Data Integration

Change Data Capture (CDC) Local Development With PostgreSQL, Debezium Server and Pub/Sub Emulator

November 7, 202411 min read Data Integration Data Streaming Change Data Capture (CDC)Debezium GCP Sub PostgreSQL Sub Emulator

Change data capture (CDC) is a data integration pattern to track changes in a database so that actions can be taken using the changed data. Debezium is probably the most popular open source platform for CDC. Originally providing Kafka source connectors, it also supports a ready-to-use application called Debezium server. The standalone application can be used to stream change events to other messaging infrastructure such as Google Cloud Pub/Sub, Amazon Kinesis and Apache Pulsar. In this post, we develop a CDC solution locally using Docker. The source of the theLook eCommerce is modified to generate data continuously, and the data is inserted into multiple tables of a PostgreSQL database. Among those tables, two of them are tracked by the Debezium server, and it pushes row-level changes of those tables into Pub/Sub topics on the Pub/Sub emulator. Finally, messages of the topics are read by a Python application.

January 11, 20247 min read Data Integration Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kafka Connect Kubernetes Minikube Python

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket. The Kafka Connect servers and individual connectors are deployed using the custom resources of Strimzi on Kubernetes.

October 23, 202312 min read Data Integration Data Streaming Kafka Connect for AWS Services Integration Apache Kafka AWS Docker Kafka Connect Kpow OpenSearch

Kafka Connect can be an effective tool to ingest data from Apache Kafka into OpenSearch. In this post, we will discuss how to develop a data pipeline from Apache Kafka into OpenSearch locally using Docker while the pipeline will be deployed on AWS in the next post. Fake impressions and clicks data will be pushed into Kafka topics using a Kafka source connector and those records will be ingested into OpenSearch indexes using a sink connector for near-real time analytics.

July 3, 202314 min read Data Integration Data Streaming Kafka Connect for AWS Services Integration Amazon DynamoDB Amazon MSK Apache Camel Apache Kafka AWS Kafka Connect Kpow

As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.

June 15, 202312 min read Data Integration Data Streaming Kafka Development With Docker Apache Kafka AWS AWS Glue Schema Registry Docker Kafka Connect Kpow

In Part 3, we developed a data ingestion pipeline using Kafka Connect source and sink connectors without enabling schemas. Later we discussed the benefits of schema registry when developing Kafka applications in Part 5. In this post, I'll demonstrate how to enhance the existing data ingestion pipeline by integrating AWS Glue Schema Registry.

May 25, 20239 min read Data Integration Data Streaming Kafka Development With Docker Apache Kafka Docker Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. In this post, I will illustrate how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data will be ingested into the corresponding topics using the MSK Data Generator source connector. The topic messages will then be saved into a S3 bucket using the Confluent S3 sink connector.

May 3, 20234 min read Data Integration Data Streaming Kafka Connect for AWS Services Integration Amazon MSK Apache Kafka AWS Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It can be used to build real-time data pipeline on AWS effectively. In this post, I will introduce available Kafka connectors mainly for AWS services integration. Also, developing and deploying some of them will be covered in later posts.

April 3, 20227 min read Data Integration Data Streaming Integrate Schema Registry With MSK Connect Amazon ECS Amazon MSK Apache Kafka Apicurio Registry AWS Change Data Capture (CDC)Debezium Docker Kafka Connect

We'll continue the discussion of a Change Data Capture (CDC) solution with a schema registry and its deployment to AWS. All major resources are deployed in private subnets and VPN is used to access them in order to improve developer experience. The Apicurio registry is used as the schema registry service and it is deployed as an ECS service. In order for the connectors to have access to the registry, the Confluent Avro Converter is packaged together with the connector sources. The post ends with illustrating how schema evolution is managed by the schema registry.

March 7, 202210 min read Data Integration Data Streaming Integrate Schema Registry With MSK Connect Apache Kafka Apicurio Registry AWS Change Data Capture (CDC)Debezium Docker Kafka Connect

We'll discuss a Change Data Capture (CDC) architecture with a schema registry. As a starting point, a local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry.

December 19, 202111 min read Data Engineering Data Integration Data Streaming Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Apache Hudi Apache Kafka AWS Change Data Capture (CDC)Debezium Kafka Connect

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build a Hudi DeltaStramer app on Amazon EMR and use the resulting Hudi table with Athena and Quicksight to build a dashboard.

Change Data Capture (CDC) Local Development With PostgreSQL, Debezium Server and Pub/Sub Emulator

Kafka Development on Kubernetes - Part 3 Kafka Connect

Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connector

Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connector

Kafka Development With Docker - Part 6 Kafka Connect With Glue Schema Registry

Kafka Development With Docker - Part 3 Kafka Connect

Kafka Connect for AWS Services Integration - Part 1 Introduction

Use External Schema Registry With MSK Connect – Part 2 MSK Deployment

Use External Schema Registry With MSK Connect – Part 1 Local Development

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake