<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Data Streaming on Jaehyeon Kim</title><link>https://jaehyeon.me/categories/data-streaming/</link><description>Recent content in Data Streaming on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Wed, 10 Dec 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/categories/data-streaming/index.xml" rel="self" type="application/rss+xml"/><item><title>Stream Processing with Flink in Kotlin</title><link>https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/</link><pubDate>Wed, 10 Dec 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/</guid><description><![CDATA[<p>A couple of years ago, I read <a href="https://www.oreilly.com/library/view/stream-processing-with/9781491974285/" target="_blank" rel="noopener noreferrer">Stream Processing with Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> and worked through the examples using PyFlink. While the book offered a solid introduction to Flink, I frequently hit limitations with the Python API, as many features from the book weren&rsquo;t supported. This time, I decided to revisit the material, but using Kotlin. The experience has been much more rewarding and fun.</p>
<p>In porting the examples to Kotlin, I also took the opportunity to align the code with modern Flink practices. The complete source for this post is available in the <a href="https://github.com/jaehyeon-kim/flink-demos/tree/master/stream-processing-with-flink" target="_blank" rel="noopener noreferrer"><code>stream-processing-with-flink</code><i class="fas fa-external-link-square-alt ms-1"></i></a> directory of the <code>flink-demos</code> GitHub repository.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/featured.png" length="31883" type="image/png"/></item><item><title>Self-service Data Platform via a Multi-tenant SQL Gateway</title><link>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</link><pubDate>Thu, 17 Jul 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</guid><description>In the modern data stack, providing direct access to powerful engines like Apache Spark and Flink is a double-edged sword. While it empowers users, it often leads to chaos: resource contention from &amp;ldquo;noisy neighbors,&amp;rdquo; inconsistent security enforcement, and operational fragility. The core problem is the lack of a robust control plane between users and the raw compute power. The solution, therefore, isn&amp;rsquo;t to take power away from users, but to manage it through an intelligent intermediary.</description><enclosure url="https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/featured.png" length="55011" type="image/png"/></item><item><title>Flink Table API - Declarative Analytics for Supplier Stats in Real Time</title><link>https://jaehyeon.me/blog/2025-06-17-kotlin-getting-started-flink-table/</link><pubDate>Tue, 17 Jun 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-06-17-kotlin-getting-started-flink-table/</guid><description><![CDATA[<p>In the last post, we explored the fine-grained control of Flink&rsquo;s DataStream API. Now, we&rsquo;ll approach the same problem from a higher level of abstraction using the <strong>Flink Table API</strong>. This post demonstrates how to build a declarative analytics pipeline that processes our continuous stream of Avro-formatted order events. We will define a <code>Table</code> on top of a <code>DataStream</code> and use SQL-like expressions to perform windowed aggregations. This example highlights the power and simplicity of the Table API for analytical tasks and showcases Flink&rsquo;s seamless integration between its different API layers to handle complex requirements like late data.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-06-17-kotlin-getting-started-flink-table/featured.png" length="144113" type="image/png"/></item><item><title>Flink DataStream API - Scalable Event Processing for Supplier Stats</title><link>https://jaehyeon.me/blog/2025-06-10-kotlin-getting-started-flink-datastream/</link><pubDate>Tue, 10 Jun 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-06-10-kotlin-getting-started-flink-datastream/</guid><description><![CDATA[<p>Building on our exploration of stream processing, we now transition from Kafka&rsquo;s native library to <strong>Apache Flink</strong>, a powerful, general-purpose distributed processing engine. In this post, we&rsquo;ll dive into Flink&rsquo;s foundational <strong>DataStream API</strong>. We will tackle the same supplier statistics problem - analyzing a stream of Avro-formatted order events - but this time using Flink&rsquo;s robust features for stateful computation. This example will highlight Flink&rsquo;s sophisticated event-time processing with watermarks and its elegant, built-in mechanisms for handling late-arriving data through side outputs.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-06-10-kotlin-getting-started-flink-datastream/featured.png" length="142918" type="image/png"/></item><item><title>Kafka Streams - Lightweight Real-Time Processing for Supplier Stats</title><link>https://jaehyeon.me/blog/2025-06-03-kotlin-getting-started-kafka-streams/</link><pubDate>Tue, 03 Jun 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-06-03-kotlin-getting-started-kafka-streams/</guid><description><![CDATA[<p>In this post, we shift our focus from basic Kafka clients to real-time stream processing with <strong>Kafka Streams</strong>. We&rsquo;ll explore a Kotlin application designed to analyze a continuous stream of Avro-formatted order events, calculate supplier statistics in tumbling windows, and intelligently handle late-arriving data. This example demonstrates the power of Kafka Streams for building lightweight, yet robust, stream processing applications directly within your Kafka ecosystem, leveraging event-time processing and custom logic.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-06-03-kotlin-getting-started-kafka-streams/featured.png" length="131804" type="image/png"/></item><item><title>Kafka Clients with Avro - Schema Registry and Order Events</title><link>https://jaehyeon.me/blog/2025-05-27-kotlin-getting-started-kafka-avro-clients/</link><pubDate>Tue, 27 May 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-05-27-kotlin-getting-started-kafka-avro-clients/</guid><description><![CDATA[<p>In this post, we&rsquo;ll explore a practical example of building Kafka client applications using Kotlin, Apache Avro for data serialization, and Gradle for build management. We&rsquo;ll walk through the setup of a Kafka producer that generates mock order data and a consumer that processes these orders. This example highlights best practices such as schema management with Avro, robust error handling, and graceful shutdown, providing a solid foundation for your own Kafka-based projects. We&rsquo;ll dive into the build configuration, the Avro schema definition, utility functions for Kafka administration, and the core logic of both the producer and consumer applications.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-05-27-kotlin-getting-started-kafka-avro-clients/featured.png" length="73988" type="image/png"/></item><item><title>Kafka Clients with JSON - Producing and Consuming Order Events</title><link>https://jaehyeon.me/blog/2025-05-20-kotlin-getting-started-kafka-json-clients/</link><pubDate>Tue, 20 May 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-05-20-kotlin-getting-started-kafka-json-clients/</guid><description>&lt;p>This post explores a Kotlin-based Kafka project, meticulously detailing the construction and operation of both a Kafka producer application, responsible for generating and sending order data, and a Kafka consumer application, designed to receive and process these orders. We&amp;rsquo;ll delve into each component, from build configuration to message handling, to understand how they work together in an event-driven system.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2025-05-20-kotlin-getting-started-kafka-json-clients/featured.png" length="97922" type="image/png"/></item><item><title>Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures</title><link>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</link><pubDate>Tue, 06 May 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</guid><description><![CDATA[<p>The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the &ldquo;Streamhouse&rdquo; - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.</p>
<p>Today, we&rsquo;ll introduce three key open-source technologies shaping this space: <a href="https://paimon.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Paimon™</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, <a href="https://alibaba.github.io/fluss-docs/" target="_blank" rel="noopener noreferrer"><strong>Fluss</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, and <a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Iceberg</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/featured.png" length="288793" type="image/png"/></item><item><title>Run Flink SQL Cookbook in Docker</title><link>https://jaehyeon.me/blog/2025-04-15-sql-cookbook/</link><pubDate>Tue, 15 Apr 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-04-15-sql-cookbook/</guid><description><![CDATA[<p>The <a href="https://github.com/ververica/flink-sql-cookbook" target="_blank" rel="noopener noreferrer">Flink SQL Cookbook<i class="fas fa-external-link-square-alt ms-1"></i></a> by Ververica is a hands-on, example-rich guide to mastering <a href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/overview/" target="_blank" rel="noopener noreferrer">Apache Flink SQL<i class="fas fa-external-link-square-alt ms-1"></i></a> for real-time stream processing. It offers a wide range of self-contained recipes, from basic queries and table operations to more advanced use cases like windowed aggregations, complex joins, user-defined functions (UDFs), and pattern detection. These examples are designed to be run on the Ververica Platform, and as such, the cookbook doesn&rsquo;t include instructions for setting up a Flink cluster.</p>
<p>To help you run these recipes locally and explore Flink SQL without external dependencies, this post walks through setting up a fully functional local Flink cluster using Docker Compose. With this setup, you can experiment with the cookbook examples right on your machine.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-04-15-sql-cookbook/featured.gif" length="319243" type="image/gif"/></item><item><title>Apache Beam Python Examples - Part 10 Develop Streaming File Reader using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</guid><description><![CDATA[<p>In <a href="/blog/2024-12-05-beam-examples-9">Part 9</a>, we developed two Apache Beam pipelines using <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a>. One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-19-beam-examples-10/featured.png" length="305211" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</link><pubDate>Thu, 05 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</guid><description><![CDATA[<p>A <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a> is a generalization of a <em>DoFn</em> that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-05-beam-examples-9/featured.png" length="309371" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker with Runner Motivation</title><link>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</link><pubDate>Thu, 21 Nov 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</guid><description>&lt;p>In &lt;a href="/blog/2024-08-01-beam-examples-3">Part 3&lt;/a>, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-11-21-beam-examples-8/featured.png" length="402888" type="image/png"/></item><item><title>Change Data Capture (CDC) Local Development with PostgreSQL, Debezium Server and Pub/Sub Emulator</title><link>https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/</link><pubDate>Thu, 07 Nov 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/</guid><description><![CDATA[<p><em>Change data capture</em> (CDC) is a data integration pattern to track changes in a database so that actions can be taken using the changed data. <a href="https://debezium.io/" target="_blank" rel="noopener noreferrer"><em>Debezium</em><i class="fas fa-external-link-square-alt ms-1"></i></a> is probably the most popular open source platform for CDC. Originally providing Kafka source connectors, it also supports a ready-to-use application called <a href="https://debezium.io/documentation/reference/stable/operations/debezium-server.html" target="_blank" rel="noopener noreferrer">Debezium server<i class="fas fa-external-link-square-alt ms-1"></i></a>. The standalone application can be used to stream change events to other messaging infrastructure such as Google Cloud Pub/Sub, Amazon Kinesis and Apache Pulsar. In this post, we develop a CDC solution locally using Docker. The source of the <a href="https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce" target="_blank" rel="noopener noreferrer">theLook eCommerce<i class="fas fa-external-link-square-alt ms-1"></i></a> is modified to generate data continuously, and the data is inserted into multiple tables of a PostgreSQL database. Among those tables, two of them are tracked by the Debezium server, and it pushes row-level changes of those tables into Pub/Sub topics on the <a href="https://cloud.google.com/pubsub/docs/emulator" target="_blank" rel="noopener noreferrer">Pub/Sub emulator<i class="fas fa-external-link-square-alt ms-1"></i></a>. Finally, messages of the topics are read by a Python application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-11-07-cdc-local-dev/featured.png" length="83605" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 7 Separate Droppable Data into Side Output</title><link>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</link><pubDate>Thu, 24 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</guid><description><![CDATA[<p>We develop an Apache Beam pipeline that separates <em>droppable</em> elements from the rest of the data. <em>Droppable</em> elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a <em>Stateful</em> DoFn, <em>droppable</em> data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where <em>droppable</em> elements do not appear often, and thus the chance that a <em>droppable</em> element is delivered as the first element in a particular window is low.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-24-beam-examples-7/featured.png" length="214574" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 6 Call RPC Service in Batch with Defined Batch Size using Stateful DoFn</title><link>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</link><pubDate>Wed, 02 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</guid><description><![CDATA[<p>In the <a href="/blog/2024-09-25-beam-examples-5">previous post</a>, we continued discussing an Apache Beam pipeline that arguments input data by calling a <strong>Remote Procedure Call (RPC)</strong> service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful <code>DoFn</code> where the number elements to process and maximum wait seconds can be controlled by <em>state</em> and <em>timers</em>. Note that, although the stateful <code>DoFn</code> used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as <a href="https://beam.apache.org/documentation/transforms/python/aggregation/batchelements/" target="_blank" rel="noopener noreferrer">BatchElements<i class="fas fa-external-link-square-alt ms-1"></i></a> and <a href="https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/" target="_blank" rel="noopener noreferrer">GroupIntoBatches<i class="fas fa-external-link-square-alt ms-1"></i></a> whenever possible.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-02-beam-examples-6/featured.png" length="99452" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 5 Call RPC Service in Batch using Stateless DoFn</title><link>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</link><pubDate>Wed, 18 Sep 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</guid><description><![CDATA[<p>In the <a href="/blog/2024-08-15-beam-examples-4">previous post</a>, we developed an Apache Beam pipeline where the input data is augmented by a <strong>Remote Procedure Call (RPC)</strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-09-18-beam-examples-5/featured.png" length="95285" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation</title><link>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</link><pubDate>Thu, 15 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</guid><description>&lt;p>In this post, we develop an Apache Beam pipeline where the input data is augmented by a &lt;strong>Remote Procedure Call (RPC)&lt;/strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-08-15-beam-examples-4/featured.png" length="93408" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 3 Build Sport Activity Tracker with/without SQL</title><link>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</link><pubDate>Thu, 01 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</guid><description><![CDATA[<p>In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and <a href="https://beam.apache.org/documentation/dsls/sql/overview/" target="_blank" rel="noopener noreferrer">Beam SQL<i class="fas fa-external-link-square-alt ms-1"></i></a> is used for the latter. While <em>Beam SQL</em> can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of <em>Beam SQL</em> so that it can be used for building competitive applications with the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-08-01-beam-examples-3/featured.png" length="94507" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 2 Calculate Average Word Length with/without Fixed Look back</title><link>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</link><pubDate>Thu, 18 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</guid><description>&lt;p>In this post, we develop two Apache Beam pipelines that calculate average word lengths from input texts that are ingested by a Kafka topic. They obtain the statistics in different angles. The first pipeline emits the global average lengths whenever a new input text arrives while the latter triggers those values in a sliding time window.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-07-18-beam-examples-2/featured.png" length="96924" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length</title><link>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</link><pubDate>Thu, 04 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</guid><description><![CDATA[<p>In this series, we develop <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> Python pipelines. The majority of them are from <a href="https://www.packtpub.com/en-us/product/building-big-data-pipelines-with-apache-beam-9781800564930" target="_blank" rel="noopener noreferrer">Building Big Data Pipelines with Apache Beam by Jan Lukavský<i class="fas fa-external-link-square-alt ms-1"></i></a>. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster using the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-07-04-beam-examples-1/featured.png" length="96881" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner</title><link>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</link><pubDate>Thu, 06 Jun 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</guid><description><![CDATA[<p>In this post, we develop an <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> pipeline using the <a href="https://beam.apache.org/documentation/sdks/python/" target="_blank" rel="noopener noreferrer">Python SDK<i class="fas fa-external-link-square-alt ms-1"></i></a> and deploy it on an <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster via the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Same as <a href="/blog/2024-05-30-beam-deploy-1">Part I</a>, we deploy a Kafka cluster using the <a href="https://strimzi.io/" target="_blank" rel="noopener noreferrer">Strimzi Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the pipeline uses <a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer">Apache Kafka<i class="fas fa-external-link-square-alt ms-1"></i></a> topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the <a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a>, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/featured.png" length="58020" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application</title><link>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</link><pubDate>Thu, 30 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</guid><description><![CDATA[<p><a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a> on Kubernetes. In Part 1, we first deploy a Kafka cluster on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/featured.png" length="64457" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 5 Testing Pipelines</title><link>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</link><pubDate>Thu, 09 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</guid><description>We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.</description><enclosure url="https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/featured.png" length="53603" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 4 Streaming Pipelines</title><link>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</link><pubDate>Thu, 02 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</guid><description>In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we build a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds.</description><enclosure url="https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/featured.png" length="54556" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 3 Flink Runner</title><link>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</link><pubDate>Thu, 18 Apr 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this series, we discuss local development of Apache Beam pipelines using Python. In the previous posts, we mainly talked about Batch pipelines with/without Beam SQL. Beam pipelines are portable between batch and streaming semantics, and we will discuss streaming pipeline development in this and the next posts.</description><enclosure url="https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/featured.png" length="262307" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 3 Kafka Connect</title><link>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</link><pubDate>Thu, 11 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</guid><description>Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket.</description><enclosure url="https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/featured.png" length="97270" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 2 Producer and Consumer</title><link>https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/</link><pubDate>Thu, 04 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Kafka has five core APIs, and we can develop applications to send/read streams of data to/from topics in a Kafka cluster using the producer and consumer APIs. While the main Kafka project maintains only the Java APIs, there are several open source projects that provide the Kafka client APIs in Python.</description><enclosure url="https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/featured.png" length="75889" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 1 Cluster Setup</title><link>https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/</link><pubDate>Thu, 21 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/</guid><description>Apache Kafka is one of the key technologies for implementing data streaming architectures. Strimzi provides a way to run an Apache Kafka cluster and related resources on Kubernetes in various deployment configurations. In this series of posts, we will discuss how to create a Kafka cluster, to develop Kafka client applications in Python and to build a data pipeline using Kafka connectors on Kubernetes.
Part 1 Cluster Setup (this post) Part 2 Producer and Consumer Part 3 Kafka Connect Setup Kafka Cluster The Kafka cluster is deployed using the Strimzi Operator on a Minikube cluster.</description><enclosure url="https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/featured.png" length="108975" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 6 Consume data from Kafka using Lambda</title><link>https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/</link><pubDate>Thu, 14 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/</guid><description>Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.
Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda (this post) Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.</description><enclosure url="https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/featured.png" length="138986" type="image/png"/></item><item><title>Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images</title><link>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</link><pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery.</description><enclosure url="https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/featured.png" length="133053" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 5 Write data to DynamoDB using Kafka Connect</title><link>https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/</link><pubDate>Thu, 30 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.</description><enclosure url="https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/featured.png" length="113252" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flink</title><link>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</link><pubDate>Thu, 23 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</guid><description>The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.</description><enclosure url="https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/featured.png" length="112340" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flink</title><link>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</link><pubDate>Thu, 16 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector.</description><enclosure url="https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/featured.png" length="160359" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 2 Write data to Kafka from S3 using Flink</title><link>https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/</link><pubDate>Thu, 09 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/</guid><description>In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.</description><enclosure url="https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/featured.png" length="139114" type="image/png"/></item><item><title>Benefits and Opportunities of Stateful Stream Processing</title><link>https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/</link><pubDate>Thu, 02 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/</guid><description>Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities.</description><enclosure url="https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/featured.png" length="244920" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 5 Deploy Aiven OpenSearch Sink Connector</title><link>https://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/</link><pubDate>Mon, 30 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/</guid><description>In the previous post, we discussed how to develop a data pipeline from Apache Kafka into OpenSearch locally using Docker. The pipeline will be deployed on AWS using Amazon MSK, Amazon MSK Connect and Amazon OpenSearch Service using Terraform in this post. First the infrastructure will be deployed that covers a Virtual Private Cloud (VPC), Virtual Private Network (VPN) server, MSK Cluster and OpenSearch domain. Then Kafka source and sink connectors will be deployed on MSK Connect, followed by performing quick data analysis.</description><enclosure url="https://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/featured.png" length="85575" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 1 Produce data to Kafka using Lambda</title><link>https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/</link><pubDate>Thu, 26 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/</guid><description>In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages.
Introduction Lab 1 Produce data to Kafka using Lambda (this post) Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda [Update 2023-11-06] Initially I planned to deploy Pyflink applications on Amazon Managed Service for Apache Flink, but I changed the plan to use a local Flink cluster deployed on Docker.</description><enclosure url="https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/featured.png" length="138560" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connector</title><link>https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/</link><pubDate>Mon, 23 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 OpenSearch is a popular search and analytics engine and its use cases cover log analytics, real-time application monitoring, and clickstream analysis. OpenSearch can be deployed on its own or via Amazon OpenSearch Service.</description><enclosure url="https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/featured.png" length="61820" type="image/png"/></item><item><title>Building Apache Flink Applications in Python</title><link>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</link><pubDate>Thu, 19 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</guid><description>Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.</description><enclosure url="https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/featured.png" length="154736" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Introduction</title><link>https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/</link><pubDate>Thu, 05 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/</guid><description>Real Time Streaming with Amazon Kinesis is an AWS workshop that helps users build a streaming analytics application on AWS. Incoming events are stored in a number of streams of the Amazon Kinesis Data Streams service, and various other AWS services and tools are used to process and analyse data.
Apache Kafka is a popular distributed event store and stream processing platform, and it stores incoming events in topics. As part of learning real time streaming analytics on AWS, we can rebuild the analytics applications by replacing the Kinesis streams with Kafka topics.</description><enclosure url="https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/featured.png" length="138141" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flink</title><link>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</link><pubDate>Thu, 14 Sep 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</guid><description>This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.</description><enclosure url="https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/featured.png" length="66221" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 3 AWS Managed Flink and MSK</title><link>https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/</link><pubDate>Mon, 04 Sep 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/</guid><description>In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In the previous posts, I demonstrated a Pyflink app that targets a local Kafka cluster as well as a Kafka cluster on Amazon MSK. The app was executed in a virtual environment as well as in a local Flink cluster for improved monitoring. In this post, the app will be deployed via Amazon Managed Service for Apache Flink, which is the easiest option to run Flink applications on AWS.</description><enclosure url="https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/featured.png" length="74618" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 2 Local Flink and MSK</title><link>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</link><pubDate>Mon, 28 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</guid><description>In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.</description><enclosure url="https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/featured.png" length="64005" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafka</title><link>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</link><pubDate>Thu, 17 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/featured.png" length="55960" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Development</title><link>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</link><pubDate>Thu, 10 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/featured.png" length="72929" type="image/png"/></item><item><title>Kafka Development with Docker - Part 11 Kafka Authorization</title><link>https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/</link><pubDate>Thu, 20 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous posts, we discussed how to implement client authentication by TLS (SSL or TLS/SSL) and SASL authentication. One of the key benefits of client authentication is achieving user access control. Kafka ships with a pluggable, out-of-the box authorization framework, which is configured with the authorizer.</description><enclosure url="https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/featured.png" length="458848" type="image/png"/></item><item><title>Kafka Development with Docker - Part 10 SASL Authentication</title><link>https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/</link><pubDate>Thu, 13 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed TLS (SSL or TLS/SSL) authentication to improve security. It enforces two-way verification where a client certificate is verified by Kafka brokers. Client authentication can also be enabled by Simple Authentication and Security Layer (SASL), and we will discuss how to implement SASL authentication with Java and Python client examples in this post.</description><enclosure url="https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/featured.png" length="471947" type="image/png"/></item><item><title>Kafka Development with Docker - Part 9 SSL Authentication</title><link>https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/</link><pubDate>Thu, 06 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed how to configure TLS (SSL or TLS/SSL) encryption with Java and Python client examples. SSL encryption is a one-way verification process where a server certificate is verified by a client via SSL Handshake.</description><enclosure url="https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/featured.png" length="471471" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connector</title><link>https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/</link><pubDate>Mon, 03 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/</guid><description>As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.</description><enclosure url="https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/featured.png" length="76240" type="image/png"/></item><item><title>Kafka Development with Docker - Part 8 SSL Encryption</title><link>https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/</link><pubDate>Thu, 29 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 By default, Apache Kafka communicates in PLAINTEXT, which means that all data is sent without being encrypted. To secure communication, we can configure Kafka clients and other components to use Transport Layer Security (TLS) encryption.</description><enclosure url="https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/featured.png" length="469311" type="image/png"/></item><item><title>Kafka Development with Docker - Part 7 Producer and Consumer with Glue Schema Registry</title><link>https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/</link><pubDate>Thu, 22 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In Part 4, we developed Kafka producer and consumer applications using the kafka-python package. The Kafka messages are serialized as Json, but are not associated with a schema as there was not an integrated schema registry.</description><enclosure url="https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/featured.png" length="57175" type="image/png"/></item><item><title>Kafka Development with Docker - Part 6 Kafka Connect with Glue Schema Registry</title><link>https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/</link><pubDate>Thu, 15 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In Part 3, we developed a data ingestion pipeline with fake online order data using Kafka Connect source and sink connectors. Schemas are not enabled on both of them as there was not an integrated schema registry.</description><enclosure url="https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/featured.png" length="60354" type="image/png"/></item><item><title>Kafka Development with Docker - Part 5 Glue Schema Registry</title><link>https://jaehyeon.me/blog/2023-06-08-kafka-development-with-docker-part-5/</link><pubDate>Thu, 08 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-08-kafka-development-with-docker-part-5/</guid><description>As described in the Confluent document, Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network. Producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve. In AWS, the Glue Schema Registry supports features to manage and enforce schemas on data streaming applications using convenient integrations with Apache Kafka, Amazon Managed Streaming for Apache Kafka, Amazon Kinesis Data Streams, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.</description><enclosure url="https://jaehyeon.me/blog/2023-06-08-kafka-development-with-docker-part-5/featured.png" length="51170" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 2 Develop Camel DynamoDB Sink Connector</title><link>https://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/</link><pubDate>Sun, 04 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In Part 1, we reviewed Kafka connectors focusing on AWS services integration. Among the available connectors, the suite of Apache Camel Kafka connectors and the Kinesis Kafka connector from the AWS Labs can be effective for building data ingestion pipelines on AWS.</description><enclosure url="https://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/featured.png" length="87044" type="image/png"/></item><item><title>Kafka Development with Docker - Part 4 Producer and Consumer</title><link>https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/</link><pubDate>Thu, 01 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed Kafka Connect to stream data to/from a Kafka cluster. Kafka also includes the Producer/Consumer APIs that allow client applications to send/read streams of data to/from topics in a Kafka cluster.</description><enclosure url="https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/featured.png" length="75255" type="image/png"/></item><item><title>Kafka Development with Docker - Part 3 Kafka Connect</title><link>https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.</description><enclosure url="https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/featured.png" length="69998" type="image/png"/></item><item><title>Kafka Development with Docker - Part 2 Management App</title><link>https://jaehyeon.me/blog/2023-05-18-kafka-development-with-docker-part-2/</link><pubDate>Thu, 18 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-18-kafka-development-with-docker-part-2/</guid><description>In the previous post, I illustrated how to create a topic and to produce/consume messages using the command utilities provided by Apache Kafka. It is not convenient, however, for example, when you consume serialised messages where their schemas are stored in a schema registry. Also, the utilities don&amp;rsquo;t support to browse or manage related resources such as connectors and schemas. Therefore, a Kafka management app can be a good companion for development, which helps monitor and manage resources on an easy-to-use user interface.</description><enclosure url="https://jaehyeon.me/blog/2023-05-18-kafka-development-with-docker-part-2/featured.png" length="59675" type="image/png"/></item><item><title>Kafka Development with Docker - Part 1 Cluster Setup</title><link>https://jaehyeon.me/blog/2023-05-04-kafka-development-with-docker-part-1/</link><pubDate>Thu, 04 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-04-kafka-development-with-docker-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 I&amp;rsquo;m teaching myself modern data streaming architectures on AWS, and Apache Kafka is one of the key technologies, which can be used for messaging, activity tracking, stream processing and so on. While applications tend to be deployed to cloud, it can be much easier if we develop and test those with Docker and Docker Compose locally.</description><enclosure url="https://jaehyeon.me/blog/2023-05-04-kafka-development-with-docker-part-1/featured.png" length="98355" type="image/png"/></item><item><title>Kafka Connect for AWS Services Integration - Part 1 Introduction</title><link>https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/</link><pubDate>Wed, 03 May 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/</guid><description>Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (MSK) are two managed streaming services offered by AWS. Many resources on the web indicate Kinesis Data Streams is better when it comes to integrating with AWS services. However, it is not necessarily the case with the help of Kafka Connect. According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.</description><enclosure url="https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/featured.png" length="22272" type="image/png"/></item><item><title>Integrate Glue Schema Registry with Your Python Kafka App</title><link>https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/</link><pubDate>Wed, 12 Apr 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 As Kafka producer and consumer apps are decoupled, they operate on Kafka topics rather than communicating with each other directly. As described in the Confluent document, Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network.</description><enclosure url="https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/featured.png" length="46040" type="image/png"/></item><item><title>Simplify Streaming Ingestion on AWS – Part 2 MSK and Athena</title><link>https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/</link><pubDate>Tue, 14 Mar 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/</guid><description>In Part 1, we discussed a streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless. Athena provides the MSK connector to enable SQL queries on Apache Kafka topics directly, and it can also facilitate the extraction of insights without setting up an additional pipeline to store data into S3. In this post, we discuss how to update the streaming ingestion solution so that data in the Kafka topic can be queried by Athena instead of Redshift.</description><enclosure url="https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/featured.png" length="43403" type="image/png"/></item><item><title>Simplify Streaming Ingestion on AWS – Part 1 MSK and Redshift</title><link>https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/</link><pubDate>Wed, 08 Feb 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/</guid><description>Apache Kafka is a popular distributed event store and stream processing platform. Previously loading data from Kafka into Redshift and Athena usually required Kafka connectors (e.g. Amazon Redshift Sink Connector and Amazon S3 Sink Connector). Recently these AWS services provide features to ingest data from Kafka directly, which facilitates a simpler architecture that achieves low-latency and high-speed ingestion of streaming data. In part 1 of the simplify streaming ingestion on AWS series, we discuss how to develop an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless on AWS.</description><enclosure url="https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/featured.png" length="32864" type="image/png"/></item><item><title>How to configure Kafka consumers to seek offsets by timestamp</title><link>https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/</link><pubDate>Tue, 10 Jan 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Normally we consume Kafka messages from the beginning/end of a topic or last committed offsets. For backfilling or troubleshooting, however, we need to consume messages from a certain timestamp occasionally. If we know which topic partition to choose e.</description><enclosure url="https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/featured.png" length="47217" type="image/png"/></item><item><title>Use External Schema Registry with MSK Connect – Part 2 MSK Deployment</title><link>https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/</link><pubDate>Sun, 03 Apr 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/</guid><description>In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry. In this post, we&amp;rsquo;ll build the solution on AWS using MSK, MSK Connect, Aurora PostgreSQL and ECS.</description><enclosure url="https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/featured.png" length="59689" type="image/png"/></item><item><title>Use External Schema Registry with MSK Connect – Part 1 Local Development</title><link>https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/</link><pubDate>Mon, 07 Mar 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that comes with Kafka Connect. We optionally enabled the key and value schemas and the topic messages include those schemas together with payload.</description><enclosure url="https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/featured.png" length="59689" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake</title><link>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</link><pubDate>Sun, 19 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</guid><description>In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.</description><enclosure url="https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDC</title><link>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</link><pubDate>Sun, 12 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</guid><description>In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.</description><enclosure url="https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Development</title><link>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</link><pubDate>Sun, 05 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</guid><description>Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we&amp;rsquo;ll build a data lake that uses CDC.</description><enclosure url="https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/featured.png" length="164526" type="image/png"/></item></channel></rss>