Data Integration on Jaehyeon Kim

Change Data Capture (CDC) Local Development with PostgreSQL, Debezium Server and Pub/Sub Emulator

Thu, 07 Nov 2024 00:00:00 +0000

Change data capture (CDC) is a data integration pattern to track changes in a database so that actions can be taken using the changed data. Debezium is probably the most popular open source platform for CDC. Originally providing Kafka source connectors, it also supports a ready-to-use application called Debezium server. The standalone application can be used to stream change events to other messaging infrastructure such as Google Cloud Pub/Sub, Amazon Kinesis and Apache Pulsar. In this post, we develop a CDC solution locally using Docker. The source of the theLook eCommerce is modified to generate data continuously, and the data is inserted into multiple tables of a PostgreSQL database. Among those tables, two of them are tracked by the Debezium server, and it pushes row-level changes of those tables into Pub/Sub topics on the Pub/Sub emulator. Finally, messages of the topics are read by a Python application.

Kafka Development on Kubernetes - Part 3 Kafka Connect

Thu, 11 Jan 2024 00:00:00 +0000

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket.

Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connector

Mon, 23 Oct 2023 00:00:00 +0000

OpenSearch is a popular search and analytics engine and its use cases cover log analytics, real-time application monitoring, and clickstream analysis. OpenSearch can be deployed on its own or via Amazon OpenSearch Service. Apache Kafka is a distributed event store and stream-processing platform, and it aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. On AWS, Apache Kafka can be deployed via Amazon Managed Streaming for Apache Kafka (MSK).

Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connector

Mon, 03 Jul 2023 00:00:00 +0000

As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.

Kafka Development with Docker - Part 6 Kafka Connect with Glue Schema Registry

Thu, 15 Jun 2023 00:00:00 +0000

In Part 3, we developed a data ingestion pipeline with fake online order data using Kafka Connect source and sink connectors. Schemas are not enabled on both of them as there was not an integrated schema registry. Later we discussed how producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve in Part 5. In this post, I’ll demonstrate how to enhance the existing data ingestion pipeline by integrating AWS Glue Schema Registry.

Kafka Development with Docker - Part 3 Kafka Connect

Thu, 25 May 2023 00:00:00 +0000

According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. Kafka Connect supports two types of connectors - source and sink. Source connectors are used to ingest messages from external systems into Kafka topics while messages are ingested into external systems form Kafka topics with sink connectors.

Kafka Connect for AWS Services Integration - Part 1 Introduction

Wed, 03 May 2023 00:00:00 +0000

Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (MSK) are two managed streaming services offered by AWS. Many resources on the web indicate Kinesis Data Streams is better when it comes to integrating with AWS services. However, it is not necessarily the case with the help of Kafka Connect. According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.

Use External Schema Registry with MSK Connect – Part 2 MSK Deployment

Sun, 03 Apr 2022 00:00:00 +0000

In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry. In this post, we’ll build the solution on AWS using MSK, MSK Connect, Aurora PostgreSQL and ECS.

Use External Schema Registry with MSK Connect – Part 1 Local Development

Mon, 07 Mar 2022 00:00:00 +0000

When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that comes with Kafka Connect. We optionally enabled the key and value schemas and the topic messages include those schemas together with payload. It seems to be convenient at first as the messages are saved into S3 on their own. However, it became cumbersome when we tried to use the DeltaStreamer utility.

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake

Sun, 19 Dec 2021 00:00:00 +0000

In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDC

Sun, 12 Dec 2021 00:00:00 +0000

In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Development

Sun, 05 Dec 2021 00:00:00 +0000

Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we’ll build a data lake that uses CDC.