<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Data Integration on Jaehyeon Kim</title><link>https://jaehyeon.me/categories/data-integration/</link><description>Recent content in Data Integration on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Thu, 07 Nov 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/categories/data-integration/index.xml" rel="self" type="application/rss+xml"/><item><title>Change Data Capture (CDC) Local Development with PostgreSQL, Debezium Server and Pub/Sub Emulator</title><link>https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/</link><pubDate>Thu, 07 Nov 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/</guid><description><![CDATA[<p><em>Change data capture</em> (CDC) is a data integration pattern to track changes in a database so that actions can be taken using the changed data. <a href="https://debezium.io/" target="_blank" rel="noopener noreferrer"><em>Debezium</em><i class="fas fa-external-link-square-alt ms-1"></i></a> is probably the most popular open source platform for CDC. Originally providing Kafka source connectors, it also supports a ready-to-use application called <a href="https://debezium.io/documentation/reference/stable/operations/debezium-server.html" target="_blank" rel="noopener noreferrer">Debezium server<i class="fas fa-external-link-square-alt ms-1"></i></a>. The standalone application can be used to stream change events to other messaging infrastructure such as Google Cloud Pub/Sub, Amazon Kinesis and Apache Pulsar. In this post, we develop a CDC solution locally using Docker. The source of the <a href="https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce" target="_blank" rel="noopener noreferrer">theLook eCommerce<i class="fas fa-external-link-square-alt ms-1"></i></a> is modified to generate data continuously, and the data is inserted into multiple tables of a PostgreSQL database. Among those tables, two of them are tracked by the Debezium server, and it pushes row-level changes of those tables into Pub/Sub topics on the <a href="https://cloud.google.com/pubsub/docs/emulator" target="_blank" rel="noopener noreferrer">Pub/Sub emulator<i class="fas fa-external-link-square-alt ms-1"></i></a>. Finally, messages of the topics are read by a Python application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/featured.png" length="83605" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 3 Kafka Connect</title><link>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</link><pubDate>Thu, 11 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</guid><description>Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket.</description><enclosure url="https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/featured.png" length="97270" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connector</title><link>https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/</link><pubDate>Mon, 23 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 OpenSearch is a popular search and analytics engine and its use cases cover log analytics, real-time application monitoring, and clickstream analysis. OpenSearch can be deployed on its own or via Amazon OpenSearch Service.</description><enclosure url="https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/featured.png" length="61820" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connector</title><link>https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/</link><pubDate>Mon, 03 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/</guid><description>As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.</description><enclosure url="https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/featured.png" length="76240" type="image/png"/></item><item><title>Kafka Development with Docker - Part 6 Kafka Connect with Glue Schema Registry</title><link>https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/</link><pubDate>Thu, 15 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In Part 3, we developed a data ingestion pipeline with fake online order data using Kafka Connect source and sink connectors. Schemas are not enabled on both of them as there was not an integrated schema registry.</description><enclosure url="https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/featured.png" length="60354" type="image/png"/></item><item><title>Kafka Development with Docker - Part 3 Kafka Connect</title><link>https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.</description><enclosure url="https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/featured.png" length="69998" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 1 Introduction</title><link>https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/</link><pubDate>Wed, 03 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/</guid><description>Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (MSK) are two managed streaming services offered by AWS. Many resources on the web indicate Kinesis Data Streams is better when it comes to integrating with AWS services. However, it is not necessarily the case with the help of Kafka Connect. According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.</description><enclosure url="https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/featured.png" length="22272" type="image/png"/></item><item><title>Use External Schema Registry with MSK Connect – Part 2 MSK Deployment</title><link>https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/</link><pubDate>Sun, 03 Apr 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/</guid><description>In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry. In this post, we&amp;rsquo;ll build the solution on AWS using MSK, MSK Connect, Aurora PostgreSQL and ECS.</description><enclosure url="https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/featured.png" length="59689" type="image/png"/></item><item><title>Use External Schema Registry with MSK Connect – Part 1 Local Development</title><link>https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/</link><pubDate>Mon, 07 Mar 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that comes with Kafka Connect. We optionally enabled the key and value schemas and the topic messages include those schemas together with payload.</description><enclosure url="https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/featured.png" length="59689" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake</title><link>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</link><pubDate>Sun, 19 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</guid><description>In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.</description><enclosure url="https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDC</title><link>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</link><pubDate>Sun, 12 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</guid><description>In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.</description><enclosure url="https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Development</title><link>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</link><pubDate>Sun, 05 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</guid><description>Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we&amp;rsquo;ll build a data lake that uses CDC.</description><enclosure url="https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/featured.png" length="164526" type="image/png"/></item></channel></rss>