<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Apache Flink on Jaehyeon Kim</title><link>https://jaehyeon.me/tags/apache-flink/</link><description>Recent content in Apache Flink on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Tue, 21 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/tags/apache-flink/index.xml" rel="self" type="application/rss+xml"/><item><title>Building a Real-Time Industrial Digital Twin with Apache Flink and Online Machine Learning</title><link>https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/</link><pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/</guid><description>Overview Imagine using a rolling pin to flatten out a thick piece of dough. A Hot Strip Mill does the exact same thing, but with glowing red-hot steel slabs (often heated over 1000°C) and massive mechanical rollers. The steel is passed through a series of these rollers, crushing it down from a thick block into a long, thin sheet.
Calculating the exact Rolling Force required to crush the steel is critical.</description><enclosure url="https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/featured.png" length="130846" type="image/png"/></item><item><title>Productionizing an Online Product Recommender using Event Driven Architecture</title><link>https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/</link><pubDate>Mon, 23 Feb 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/</guid><description><![CDATA[<p>In <a href="/blog/2026-01-29-prototype-recommender-with-python/"><strong>Part 1</strong></a>, we built a contextual bandit prototype using Python and <a href="https://github.com/fidelity/mab2rec" target="_blank" rel="noopener noreferrer"><code>Mab2Rec</code><i class="fas fa-external-link-square-alt ms-1"></i></a>. While effective for testing algorithms locally, a monolithic script cannot handle production scale. Real-world recommendation systems require low-latency inference for users and high-throughput training for model updates.</p>
<p>This post demonstrates how to decouple these concerns using an event-driven architecture with Apache Flink, Kafka, and Redis.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/featured.gif" length="702786" type="image/gif"/></item><item><title>Stream Processing with Flink in Kotlin</title><link>https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/</link><pubDate>Wed, 10 Dec 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/</guid><description><![CDATA[<p>A couple of years ago, I read <a href="https://www.oreilly.com/library/view/stream-processing-with/9781491974285/" target="_blank" rel="noopener noreferrer">Stream Processing with Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> and worked through the examples using PyFlink. While the book offered a solid introduction to Flink, I frequently hit limitations with the Python API, as many features from the book weren&rsquo;t supported. This time, I decided to revisit the material, but using Kotlin. The experience has been much more rewarding and fun.</p>
<p>In porting the examples to Kotlin, I also took the opportunity to align the code with modern Flink practices. The complete source for this post is available in the <a href="https://github.com/jaehyeon-kim/flink-demos/tree/master/stream-processing-with-flink" target="_blank" rel="noopener noreferrer"><code>stream-processing-with-flink</code><i class="fas fa-external-link-square-alt ms-1"></i></a> directory of the <code>flink-demos</code> GitHub repository.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-12-10-streaming-processing-with-flink-in-kotlin/featured.png" length="31883" type="image/png"/></item><item><title>Self-service Data Platform via a Multi-tenant SQL Gateway</title><link>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</link><pubDate>Thu, 17 Jul 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</guid><description>In the modern data stack, providing direct access to powerful engines like Apache Spark and Flink is a double-edged sword. While it empowers users, it often leads to chaos: resource contention from &amp;ldquo;noisy neighbors,&amp;rdquo; inconsistent security enforcement, and operational fragility. The core problem is the lack of a robust control plane between users and the raw compute power. The solution, therefore, isn&amp;rsquo;t to take power away from users, but to manage it through an intelligent intermediary.</description><enclosure url="https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/featured.png" length="55011" type="image/png"/></item><item><title>Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures</title><link>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</link><pubDate>Tue, 06 May 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</guid><description><![CDATA[<p>The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the &ldquo;Streamhouse&rdquo; - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.</p>
<p>Today, we&rsquo;ll introduce three key open-source technologies shaping this space: <a href="https://paimon.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Paimon™</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, <a href="https://alibaba.github.io/fluss-docs/" target="_blank" rel="noopener noreferrer"><strong>Fluss</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, and <a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Iceberg</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/featured.png" length="288793" type="image/png"/></item><item><title>Run Flink SQL Cookbook in Docker</title><link>https://jaehyeon.me/blog/2025-04-15-sql-cookbook/</link><pubDate>Tue, 15 Apr 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-04-15-sql-cookbook/</guid><description><![CDATA[<p>The <a href="https://github.com/ververica/flink-sql-cookbook" target="_blank" rel="noopener noreferrer">Flink SQL Cookbook<i class="fas fa-external-link-square-alt ms-1"></i></a> by Ververica is a hands-on, example-rich guide to mastering <a href="https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/overview/" target="_blank" rel="noopener noreferrer">Apache Flink SQL<i class="fas fa-external-link-square-alt ms-1"></i></a> for real-time stream processing. It offers a wide range of self-contained recipes, from basic queries and table operations to more advanced use cases like windowed aggregations, complex joins, user-defined functions (UDFs), and pattern detection. These examples are designed to be run on the Ververica Platform, and as such, the cookbook doesn&rsquo;t include instructions for setting up a Flink cluster.</p>
<p>To help you run these recipes locally and explore Flink SQL without external dependencies, this post walks through setting up a fully functional local Flink cluster using Docker Compose. With this setup, you can experiment with the cookbook examples right on your machine.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-04-15-sql-cookbook/featured.gif" length="319243" type="image/gif"/></item><item><title>Apache Beam Python Examples - Part 10 Develop Streaming File Reader using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</guid><description><![CDATA[<p>In <a href="/blog/2024-12-05-beam-examples-9">Part 9</a>, we developed two Apache Beam pipelines using <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a>. One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-19-beam-examples-10/featured.png" length="305211" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</link><pubDate>Thu, 05 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</guid><description><![CDATA[<p>A <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a> is a generalization of a <em>DoFn</em> that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-05-beam-examples-9/featured.png" length="309371" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker with Runner Motivation</title><link>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</link><pubDate>Thu, 21 Nov 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</guid><description>&lt;p>In &lt;a href="/blog/2024-08-01-beam-examples-3">Part 3&lt;/a>, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-11-21-beam-examples-8/featured.png" length="402888" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 7 Separate Droppable Data into Side Output</title><link>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</link><pubDate>Thu, 24 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</guid><description><![CDATA[<p>We develop an Apache Beam pipeline that separates <em>droppable</em> elements from the rest of the data. <em>Droppable</em> elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a <em>Stateful</em> DoFn, <em>droppable</em> data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where <em>droppable</em> elements do not appear often, and thus the chance that a <em>droppable</em> element is delivered as the first element in a particular window is low.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-24-beam-examples-7/featured.png" length="214574" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 6 Call RPC Service in Batch with Defined Batch Size using Stateful DoFn</title><link>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</link><pubDate>Wed, 02 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</guid><description><![CDATA[<p>In the <a href="/blog/2024-09-25-beam-examples-5">previous post</a>, we continued discussing an Apache Beam pipeline that arguments input data by calling a <strong>Remote Procedure Call (RPC)</strong> service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful <code>DoFn</code> where the number elements to process and maximum wait seconds can be controlled by <em>state</em> and <em>timers</em>. Note that, although the stateful <code>DoFn</code> used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as <a href="https://beam.apache.org/documentation/transforms/python/aggregation/batchelements/" target="_blank" rel="noopener noreferrer">BatchElements<i class="fas fa-external-link-square-alt ms-1"></i></a> and <a href="https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/" target="_blank" rel="noopener noreferrer">GroupIntoBatches<i class="fas fa-external-link-square-alt ms-1"></i></a> whenever possible.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-02-beam-examples-6/featured.png" length="99452" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 5 Call RPC Service in Batch using Stateless DoFn</title><link>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</link><pubDate>Wed, 18 Sep 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</guid><description><![CDATA[<p>In the <a href="/blog/2024-08-15-beam-examples-4">previous post</a>, we developed an Apache Beam pipeline where the input data is augmented by a <strong>Remote Procedure Call (RPC)</strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-09-18-beam-examples-5/featured.png" length="95285" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation</title><link>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</link><pubDate>Thu, 15 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</guid><description>&lt;p>In this post, we develop an Apache Beam pipeline where the input data is augmented by a &lt;strong>Remote Procedure Call (RPC)&lt;/strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-08-15-beam-examples-4/featured.png" length="93408" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 3 Build Sport Activity Tracker with/without SQL</title><link>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</link><pubDate>Thu, 01 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</guid><description><![CDATA[<p>In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and <a href="https://beam.apache.org/documentation/dsls/sql/overview/" target="_blank" rel="noopener noreferrer">Beam SQL<i class="fas fa-external-link-square-alt ms-1"></i></a> is used for the latter. While <em>Beam SQL</em> can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of <em>Beam SQL</em> so that it can be used for building competitive applications with the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-08-01-beam-examples-3/featured.png" length="94507" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 2 Calculate Average Word Length with/without Fixed Look back</title><link>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</link><pubDate>Thu, 18 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</guid><description>&lt;p>In this post, we develop two Apache Beam pipelines that calculate average word lengths from input texts that are ingested by a Kafka topic. They obtain the statistics in different angles. The first pipeline emits the global average lengths whenever a new input text arrives while the latter triggers those values in a sliding time window.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-07-18-beam-examples-2/featured.png" length="96924" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length</title><link>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</link><pubDate>Thu, 04 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</guid><description><![CDATA[<p>In this series, we develop <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> Python pipelines. The majority of them are from <a href="https://www.packtpub.com/en-us/product/building-big-data-pipelines-with-apache-beam-9781800564930" target="_blank" rel="noopener noreferrer">Building Big Data Pipelines with Apache Beam by Jan Lukavský<i class="fas fa-external-link-square-alt ms-1"></i></a>. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster using the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-07-04-beam-examples-1/featured.png" length="96881" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner</title><link>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</link><pubDate>Thu, 06 Jun 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</guid><description><![CDATA[<p>In this post, we develop an <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> pipeline using the <a href="https://beam.apache.org/documentation/sdks/python/" target="_blank" rel="noopener noreferrer">Python SDK<i class="fas fa-external-link-square-alt ms-1"></i></a> and deploy it on an <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster via the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Same as <a href="/blog/2024-05-30-beam-deploy-1">Part I</a>, we deploy a Kafka cluster using the <a href="https://strimzi.io/" target="_blank" rel="noopener noreferrer">Strimzi Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the pipeline uses <a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer">Apache Kafka<i class="fas fa-external-link-square-alt ms-1"></i></a> topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the <a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a>, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/featured.png" length="58020" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application</title><link>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</link><pubDate>Thu, 30 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</guid><description><![CDATA[<p><a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a> on Kubernetes. In Part 1, we first deploy a Kafka cluster on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/featured.png" length="64457" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 4 Streaming Pipelines</title><link>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</link><pubDate>Thu, 02 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</guid><description>In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we build a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds.</description><enclosure url="https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/featured.png" length="54556" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 3 Flink Runner</title><link>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</link><pubDate>Thu, 18 Apr 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this series, we discuss local development of Apache Beam pipelines using Python. In the previous posts, we mainly talked about Batch pipelines with/without Beam SQL. Beam pipelines are portable between batch and streaming semantics, and we will discuss streaming pipeline development in this and the next posts.</description><enclosure url="https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/featured.png" length="262307" type="image/png"/></item><item><title>Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images</title><link>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</link><pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery.</description><enclosure url="https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/featured.png" length="133053" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 5 Write data to DynamoDB using Kafka Connect</title><link>https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/</link><pubDate>Thu, 30 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka.</description><enclosure url="https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/featured.png" length="113252" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flink</title><link>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</link><pubDate>Thu, 23 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</guid><description>The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.</description><enclosure url="https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/featured.png" length="112340" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flink</title><link>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</link><pubDate>Thu, 16 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector.</description><enclosure url="https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/featured.png" length="160359" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 2 Write data to Kafka from S3 using Flink</title><link>https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/</link><pubDate>Thu, 09 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/</guid><description>In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.</description><enclosure url="https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/featured.png" length="139114" type="image/png"/></item><item><title>Benefits and Opportunities of Stateful Stream Processing</title><link>https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/</link><pubDate>Thu, 02 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/</guid><description>Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities.</description><enclosure url="https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/featured.png" length="244920" type="image/png"/></item><item><title>Building Apache Flink Applications in Python</title><link>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</link><pubDate>Thu, 19 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</guid><description>Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.</description><enclosure url="https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/featured.png" length="154736" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Introduction</title><link>https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/</link><pubDate>Thu, 05 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/</guid><description>Real Time Streaming with Amazon Kinesis is an AWS workshop that helps users build a streaming analytics application on AWS. Incoming events are stored in a number of streams of the Amazon Kinesis Data Streams service, and various other AWS services and tools are used to process and analyse data.
Apache Kafka is a popular distributed event store and stream processing platform, and it stores incoming events in topics. As part of learning real time streaming analytics on AWS, we can rebuild the analytics applications by replacing the Kinesis streams with Kafka topics.</description><enclosure url="https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/featured.png" length="138141" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flink</title><link>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</link><pubDate>Thu, 14 Sep 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</guid><description>This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.</description><enclosure url="https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/featured.png" length="66221" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 2 Local Flink and MSK</title><link>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</link><pubDate>Mon, 28 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</guid><description>In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.</description><enclosure url="https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/featured.png" length="64005" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafka</title><link>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</link><pubDate>Thu, 17 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/featured.png" length="55960" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Development</title><link>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</link><pubDate>Thu, 10 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/featured.png" length="72929" type="image/png"/></item></channel></rss>