<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Python on Jaehyeon Kim</title><link>https://jaehyeon.me/tags/python/</link><description>Recent content in Python on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/tags/python/index.xml" rel="self" type="application/rss+xml"/><item><title>Building an Event-Driven Hybrid Digital Twin with dynamic-des</title><link>https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/</guid><description>Asynchronous Gap In Part 1, we established that a true Hybrid Digital Twin does more than just mirror reality. It actively forecasts the future by running a simulation against live operational states.
If you have ever tried to build one of these systems from scratch, you immediately hit a fundamental architectural clash.
Standard simulation clocks (like those in traditional SimPy implementations) are logically synchronous and not designed to handle high-frequency asynchronous I/O without explicit decoupling.</description><enclosure url="https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/featured.png" length="177036" type="image/png"/></item><item><title>Why Digital Twins Are Rewiring Industry 4.0</title><link>https://jaehyeon.me/blog/2026-04-23-digital-twin-industry-4-0/</link><pubDate>Wed, 22 Apr 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-04-23-digital-twin-industry-4-0/</guid><description>Beyond CAD Models There is a project by Dassault Systèmes called the Living Heart that illustrates the trajectory of this technology. Instead of relying on standard 2D scans, surgeons can pull up a high-fidelity 3D model of a patient&amp;rsquo;s heart that simulates blood flow, mechanics, and electricity based on imaging-derived reconstructions and population-based physiological calibration.
While the Living Heart is closer to a personalized, highly-parameterized simulation than a continuously streaming IoT system, it highlights the core philosophy of a modern digital twin: moving past static CAD files to create models that are fundamentally aligned with a specific, real-world physical instance.</description><enclosure url="https://jaehyeon.me/blog/2026-04-23-digital-twin-industry-4-0/featured.png" length="73776" type="image/png"/></item><item><title>Building a Real-Time Industrial Digital Twin with Apache Flink and Online Machine Learning</title><link>https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/</link><pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/</guid><description>Overview Imagine using a rolling pin to flatten out a thick piece of dough. A Hot Strip Mill does the exact same thing, but with glowing red-hot steel slabs (often heated over 1000°C) and massive mechanical rollers. The steel is passed through a series of these rollers, crushing it down from a thick block into a long, thin sheet.
Calculating the exact Rolling Force required to crush the steel is critical.</description><enclosure url="https://jaehyeon.me/blog/2026-04-21-digital-twin-online-machine-learning/featured.png" length="130846" type="image/png"/></item><item><title>Productionizing an Online Product Recommender using Event Driven Architecture</title><link>https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/</link><pubDate>Mon, 23 Feb 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/</guid><description><![CDATA[<p>In <a href="/blog/2026-01-29-prototype-recommender-with-python/"><strong>Part 1</strong></a>, we built a contextual bandit prototype using Python and <a href="https://github.com/fidelity/mab2rec" target="_blank" rel="noopener noreferrer"><code>Mab2Rec</code><i class="fas fa-external-link-square-alt ms-1"></i></a>. While effective for testing algorithms locally, a monolithic script cannot handle production scale. Real-world recommendation systems require low-latency inference for users and high-throughput training for model updates.</p>
<p>This post demonstrates how to decouple these concerns using an event-driven architecture with Apache Flink, Kafka, and Redis.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2026-02-23-productionize-recommender-with-eda/featured.gif" length="702786" type="image/gif"/></item><item><title>Prototyping an Online Product Recommender in Python</title><link>https://jaehyeon.me/blog/2026-01-29-prototype-recommender-with-python/</link><pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-01-29-prototype-recommender-with-python/</guid><description>Overview Traditional recommendation approaches such as Collaborative Filtering remain widely adopted, yet they come with notable constraints. They are particularly vulnerable to the cold-start problem, where new users lack sufficient interaction history, and they depend heavily on long-term behavioral data. As a result, they frequently overlook real-time contextual signals, including time of day, device type, location, or session intent. This can prevent them from capturing situational preferences, such as someone preferring coffee in the morning but pizza in the evening.</description><enclosure url="https://jaehyeon.me/blog/2026-01-29-prototype-recommender-with-python/featured.gif" length="733023" type="image/gif"/></item><item><title>Guide to Building Integrated Web Applications with FastAPI and NiceGUI</title><link>https://jaehyeon.me/blog/2025-11-19-fastapi-nicegui-template/</link><pubDate>Wed, 19 Nov 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-11-19-fastapi-nicegui-template/</guid><description>&lt;p>The standard architecture for modern web applications involves a decoupled frontend, typically built with a JavaScript framework, and a backend API. This pattern is powerful but introduces complexity in managing two separate codebases, development environments, and the API contract between them.&lt;/p>
&lt;p>This article explores an alternative approach: an integrated architecture where the backend API and the frontend UI are served from a single, cohesive Python application.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2025-11-19-fastapi-nicegui-template/featured.gif" length="2611548" type="image/gif"/></item><item><title>Realtime Dashboard with FastAPI, Streamlit and Next.js - Part 2 Streamlit Dashboard</title><link>https://jaehyeon.me/blog/2025-02-25-realtime-dashboard-2/</link><pubDate>Tue, 25 Feb 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-02-25-realtime-dashboard-2/</guid><description><![CDATA[<p>In this post, we develop a real-time monitoring dashboard using <a href="https://streamlit.io/" target="_blank" rel="noopener noreferrer">Streamlit<i class="fas fa-external-link-square-alt ms-1"></i></a>, an open-source Python framework that allows data scientists and AI/ML engineers to create interactive data apps. The app connects to the WebSocket server we developed in <a href="/blog/2025-02-18-realtime-dashboard-1">Part 1</a> and continuously fetches data to visualize key metrics such as <strong>order counts</strong>, <strong>sales data</strong>, and <strong>revenue by traffic source and country</strong>. With interactive bar charts and dynamic metrics, users can monitor sales trends and other important business KPIs in real-time.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-02-25-realtime-dashboard-2/featured.gif" length="417216" type="image/gif"/></item><item><title>Realtime Dashboard with FastAPI, Streamlit and Next.js - Part 1 Data Producer</title><link>https://jaehyeon.me/blog/2025-02-18-realtime-dashboard-1/</link><pubDate>Tue, 18 Feb 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-02-18-realtime-dashboard-1/</guid><description><![CDATA[<p>In this series, we develop real-time monitoring dashboard applications. A data generating app is created with Python, and it ingests the <a href="https://console.cloud.google.com/marketplace/product/bigquery-public-data/thelook-ecommerce" target="_blank" rel="noopener noreferrer">theLook eCommerce<i class="fas fa-external-link-square-alt ms-1"></i></a> data continuously into a PostgreSQL database. A WebSocket server, built by <a href="https://fastapi.tiangolo.com/" target="_blank" rel="noopener noreferrer">FastAPI<i class="fas fa-external-link-square-alt ms-1"></i></a>, periodically queries the data to serve its clients. The monitoring dashboards will be developed using <a href="https://streamlit.io/" target="_blank" rel="noopener noreferrer">Streamlit<i class="fas fa-external-link-square-alt ms-1"></i></a> and <a href="https://nextjs.org/" target="_blank" rel="noopener noreferrer">Next.js<i class="fas fa-external-link-square-alt ms-1"></i></a>, with <a href="https://echarts.apache.org/en/index.html" target="_blank" rel="noopener noreferrer">Apache ECharts<i class="fas fa-external-link-square-alt ms-1"></i></a> for visualization. In this post, we walk through the data generation app and backend API, while the monitoring dashboards will be discussed in later posts.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-02-18-realtime-dashboard-1/featured.gif" length="1207440" type="image/gif"/></item><item><title>Apache Beam Python Examples - Part 10 Develop Streaming File Reader using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</link><pubDate>Thu, 19 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-19-beam-examples-10/</guid><description><![CDATA[<p>In <a href="/blog/2024-12-05-beam-examples-9">Part 9</a>, we developed two Apache Beam pipelines using <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a>. One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-19-beam-examples-10/featured.png" length="305211" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler using Splittable DoFn</title><link>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</link><pubDate>Thu, 05 Dec 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-12-05-beam-examples-9/</guid><description><![CDATA[<p>A <a href="https://beam.apache.org/documentation/programming-guide/#splittable-dofns" target="_blank" rel="noopener noreferrer"><em>Splittable DoFn (SDF)</em><i class="fas fa-external-link-square-alt ms-1"></i></a> is a generalization of a <em>DoFn</em> that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-12-05-beam-examples-9/featured.png" length="309371" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker with Runner Motivation</title><link>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</link><pubDate>Thu, 21 Nov 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-11-21-beam-examples-8/</guid><description>&lt;p>In &lt;a href="/blog/2024-08-01-beam-examples-3">Part 3&lt;/a>, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-11-21-beam-examples-8/featured.png" length="402888" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 7 Separate Droppable Data into Side Output</title><link>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</link><pubDate>Thu, 24 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-24-beam-examples-7/</guid><description><![CDATA[<p>We develop an Apache Beam pipeline that separates <em>droppable</em> elements from the rest of the data. <em>Droppable</em> elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a <em>Stateful</em> DoFn, <em>droppable</em> data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where <em>droppable</em> elements do not appear often, and thus the chance that a <em>droppable</em> element is delivered as the first element in a particular window is low.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-24-beam-examples-7/featured.png" length="214574" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 6 Call RPC Service in Batch with Defined Batch Size using Stateful DoFn</title><link>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</link><pubDate>Wed, 02 Oct 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-10-02-beam-examples-6/</guid><description><![CDATA[<p>In the <a href="/blog/2024-09-25-beam-examples-5">previous post</a>, we continued discussing an Apache Beam pipeline that arguments input data by calling a <strong>Remote Procedure Call (RPC)</strong> service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful <code>DoFn</code> where the number elements to process and maximum wait seconds can be controlled by <em>state</em> and <em>timers</em>. Note that, although the stateful <code>DoFn</code> used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as <a href="https://beam.apache.org/documentation/transforms/python/aggregation/batchelements/" target="_blank" rel="noopener noreferrer">BatchElements<i class="fas fa-external-link-square-alt ms-1"></i></a> and <a href="https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/" target="_blank" rel="noopener noreferrer">GroupIntoBatches<i class="fas fa-external-link-square-alt ms-1"></i></a> whenever possible.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-10-02-beam-examples-6/featured.png" length="99452" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 5 Call RPC Service in Batch using Stateless DoFn</title><link>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</link><pubDate>Wed, 18 Sep 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-09-18-beam-examples-5/</guid><description><![CDATA[<p>In the <a href="/blog/2024-08-15-beam-examples-4">previous post</a>, we developed an Apache Beam pipeline where the input data is augmented by a <strong>Remote Procedure Call (RPC)</strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-09-18-beam-examples-5/featured.png" length="95285" type="image/png"/></item><item><title>Cache Data on Apache Beam Pipelines Using a Shared Object</title><link>https://jaehyeon.me/blog/2024-08-22-cache-using-shared-object/</link><pubDate>Thu, 22 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-22-cache-using-shared-object/</guid><description><![CDATA[<p>I recently contributed to Apache Beam by adding a common pipeline pattern - <a href="https://beam.apache.org/documentation/patterns/shared-class/" target="_blank" rel="noopener noreferrer"><em>Cache data using a shared object</em><i class="fas fa-external-link-square-alt ms-1"></i></a>. Both batch and streaming pipelines are introduced, and they utilise the <a href="https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/utils/shared.html#Shared" target="_blank" rel="noopener noreferrer"><code>Shared</code> class<i class="fas fa-external-link-square-alt ms-1"></i></a> of the Python SDK to enrich <code>PCollection</code> elements. This pattern can be more memory-efficient than side inputs, simpler than a stateful <code>DoFn</code>, and more performant than calling an external service, because it does not have to access an external service for every element or bundle of elements. In this post, we discuss this pattern in more details with batch and streaming use cases. For the latter, we configure the cache gets refreshed periodically.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-08-22-cache-using-shared-object/featured.png" length="49574" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation</title><link>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</link><pubDate>Thu, 15 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-15-beam-examples-4/</guid><description>&lt;p>In this post, we develop an Apache Beam pipeline where the input data is augmented by a &lt;strong>Remote Procedure Call (RPC)&lt;/strong> service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-08-15-beam-examples-4/featured.png" length="93408" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 3 Build Sport Activity Tracker with/without SQL</title><link>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</link><pubDate>Thu, 01 Aug 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-08-01-beam-examples-3/</guid><description><![CDATA[<p>In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and <a href="https://beam.apache.org/documentation/dsls/sql/overview/" target="_blank" rel="noopener noreferrer">Beam SQL<i class="fas fa-external-link-square-alt ms-1"></i></a> is used for the latter. While <em>Beam SQL</em> can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of <em>Beam SQL</em> so that it can be used for building competitive applications with the Python SDK.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-08-01-beam-examples-3/featured.png" length="94507" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 2 Calculate Average Word Length with/without Fixed Look back</title><link>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</link><pubDate>Thu, 18 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-18-beam-examples-2/</guid><description>&lt;p>In this post, we develop two Apache Beam pipelines that calculate average word lengths from input texts that are ingested by a Kafka topic. They obtain the statistics in different angles. The first pipeline emits the global average lengths whenever a new input text arrives while the latter triggers those values in a sliding time window.&lt;/p></description><enclosure url="https://jaehyeon.me/blog/2024-07-18-beam-examples-2/featured.png" length="96924" type="image/png"/></item><item><title>Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length</title><link>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</link><pubDate>Thu, 04 Jul 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-07-04-beam-examples-1/</guid><description><![CDATA[<p>In this series, we develop <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> Python pipelines. The majority of them are from <a href="https://www.packtpub.com/en-us/product/building-big-data-pipelines-with-apache-beam-9781800564930" target="_blank" rel="noopener noreferrer">Building Big Data Pipelines with Apache Beam by Jan Lukavský<i class="fas fa-external-link-square-alt ms-1"></i></a>. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster using the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-07-04-beam-examples-1/featured.png" length="96881" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner</title><link>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</link><pubDate>Thu, 06 Jun 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/</guid><description><![CDATA[<p>In this post, we develop an <a href="https://beam.apache.org/" target="_blank" rel="noopener noreferrer">Apache Beam<i class="fas fa-external-link-square-alt ms-1"></i></a> pipeline using the <a href="https://beam.apache.org/documentation/sdks/python/" target="_blank" rel="noopener noreferrer">Python SDK<i class="fas fa-external-link-square-alt ms-1"></i></a> and deploy it on an <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer">Apache Flink<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster via the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Apache Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a>. Same as <a href="/blog/2024-05-30-beam-deploy-1">Part I</a>, we deploy a Kafka cluster using the <a href="https://strimzi.io/" target="_blank" rel="noopener noreferrer">Strimzi Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the pipeline uses <a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer">Apache Kafka<i class="fas fa-external-link-square-alt ms-1"></i></a> topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the <a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a>, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-06-06-beam-deploy-2/featured.png" length="58020" type="image/png"/></item><item><title>Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application</title><link>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</link><pubDate>Thu, 30 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/</guid><description><![CDATA[<p><a href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/" target="_blank" rel="noopener noreferrer">Flink Kubernetes Operator<i class="fas fa-external-link-square-alt ms-1"></i></a> acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the <a href="https://beam.apache.org/documentation/runners/flink/" target="_blank" rel="noopener noreferrer">Flink Runner<i class="fas fa-external-link-square-alt ms-1"></i></a> on Kubernetes. In Part 1, we first deploy a Kafka cluster on a <a href="https://minikube.sigs.k8s.io/docs/" target="_blank" rel="noopener noreferrer">minikube<i class="fas fa-external-link-square-alt ms-1"></i></a> cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/featured.png" length="64457" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 5 Testing Pipelines</title><link>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</link><pubDate>Thu, 09 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</guid><description>We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.</description><enclosure url="https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/featured.png" length="53603" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 4 Streaming Pipelines</title><link>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</link><pubDate>Thu, 02 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/</guid><description>In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we build a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds.</description><enclosure url="https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/featured.png" length="54556" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 3 Flink Runner</title><link>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</link><pubDate>Thu, 18 Apr 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this series, we discuss local development of Apache Beam pipelines using Python. In the previous posts, we mainly talked about Batch pipelines with/without Beam SQL. Beam pipelines are portable between batch and streaming semantics, and we will discuss streaming pipeline development in this and the next posts.</description><enclosure url="https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/featured.png" length="262307" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 2 Batch Pipelines</title><link>https://jaehyeon.me/blog/2024-04-04-beam-local-dev-2/</link><pubDate>Thu, 04 Apr 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-04-04-beam-local-dev-2/</guid><description>In this series, we discuss local development of Apache Beam pipelines using Python. A basic Beam pipeline was introduced in Part 1, followed by demonstrating how to utilise Jupyter notebooks, Beam SQL and Beam DataFrames. In this post, we discuss Batch pipelines that aggregate website visit log by user and time. The pipelines are developed with and without Beam SQL. Additionally, each pipeline is implemented on a Jupyter notebook for demonstration.</description><enclosure url="https://jaehyeon.me/blog/2024-04-04-beam-local-dev-2/featured.png" length="55405" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 1 Pipeline, Notebook, SQL and DataFrame</title><link>https://jaehyeon.me/blog/2024-03-28-beam-local-dev-1/</link><pubDate>Thu, 28 Mar 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-03-28-beam-local-dev-1/</guid><description>Apache Beam and Apache Flink are open-source frameworks for parallel, distributed data processing at scale. Flink has DataStream and Table/SQL APIs and the former has more capacity to develop sophisticated data streaming applications. The DataStream API of PyFlink, Flink&amp;rsquo;s Python API, however, is not as complete as its Java counterpart, and it doesn&amp;rsquo;t provide enough capability to extend when there are missing features in Python. Recently I had a chance to look through Apache Beam and found it supports more possibility to extend and/or customise its features.</description><enclosure url="https://jaehyeon.me/blog/2024-03-28-beam-local-dev-1/featured.png" length="88260" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflow</title><link>https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/</link><pubDate>Thu, 14 Mar 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/</guid><description>In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types.</description><enclosure url="https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/featured.png" length="82921" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 5 Modelling on Amazon Athena</title><link>https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/</link><pubDate>Thu, 07 Mar 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/</guid><description>In Part 1 and Part 3, we developed data build tool (dbt) projects that target PostgreSQL and BigQuery using fictional pizza shop data. The data is modelled by SCD type 2 dimension tables and one transactional fact table. While the order records should be joined with dimension tables to get complete details for PostgreSQL, the fact table is denormalized using nested and repeated fields to improve query performance for BigQuery.</description><enclosure url="https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/featured.png" length="61499" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflow</title><link>https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/</link><pubDate>Thu, 22 Feb 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/</guid><description>In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.</description><enclosure url="https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/featured.png" length="89588" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery</title><link>https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/</guid><description>In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery.</description><enclosure url="https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/featured.png" length="70297" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow</title><link>https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/</link><pubDate>Thu, 25 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/</guid><description>In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.</description><enclosure url="https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/featured.png" length="77355" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL</title><link>https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/</link><pubDate>Thu, 18 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/</guid><description>The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios.</description><enclosure url="https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/featured.png" length="85093" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 3 Kafka Connect</title><link>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</link><pubDate>Thu, 11 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/</guid><description>Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket.</description><enclosure url="https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/featured.png" length="97270" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 2 Producer and Consumer</title><link>https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/</link><pubDate>Thu, 04 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Kafka has five core APIs, and we can develop applications to send/read streams of data to/from topics in a Kafka cluster using the producer and consumer APIs. While the main Kafka project maintains only the Java APIs, there are several open source projects that provide the Kafka client APIs in Python.</description><enclosure url="https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/featured.png" length="75889" type="image/png"/></item><item><title>Kafka Development on Kubernetes - Part 1 Cluster Setup</title><link>https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/</link><pubDate>Thu, 21 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/</guid><description>Apache Kafka is one of the key technologies for implementing data streaming architectures. Strimzi provides a way to run an Apache Kafka cluster and related resources on Kubernetes in various deployment configurations. In this series of posts, we will discuss how to create a Kafka cluster, to develop Kafka client applications in Python and to build a data pipeline using Kafka connectors on Kubernetes.
Part 1 Cluster Setup (this post) Part 2 Producer and Consumer Part 3 Kafka Connect Setup Kafka Cluster The Kafka cluster is deployed using the Strimzi Operator on a Minikube cluster.</description><enclosure url="https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/featured.png" length="108975" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 6 Consume data from Kafka using Lambda</title><link>https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/</link><pubDate>Thu, 14 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/</guid><description>Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.
Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda (this post) Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.</description><enclosure url="https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/featured.png" length="138986" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flink</title><link>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</link><pubDate>Thu, 23 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/</guid><description>The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.</description><enclosure url="https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/featured.png" length="112340" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flink</title><link>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</link><pubDate>Thu, 16 Nov 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector.</description><enclosure url="https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/featured.png" length="160359" type="image/png"/></item><item><title>Real Time Streaming with Kafka and Flink - Lab 1 Produce data to Kafka using Lambda</title><link>https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/</link><pubDate>Thu, 26 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/</guid><description>In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages.
Introduction Lab 1 Produce data to Kafka using Lambda (this post) Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda [Update 2023-11-06] Initially I planned to deploy Pyflink applications on Amazon Managed Service for Apache Flink, but I changed the plan to use a local Flink cluster deployed on Docker.</description><enclosure url="https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/featured.png" length="138560" type="image/png"/></item><item><title>Building Apache Flink Applications in Python</title><link>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</link><pubDate>Thu, 19 Oct 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/</guid><description>Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.</description><enclosure url="https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/featured.png" length="154736" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flink</title><link>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</link><pubDate>Thu, 14 Sep 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/</guid><description>This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.</description><enclosure url="https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/featured.png" length="66221" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 3 AWS Managed Flink and MSK</title><link>https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/</link><pubDate>Mon, 04 Sep 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/</guid><description>In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In the previous posts, I demonstrated a Pyflink app that targets a local Kafka cluster as well as a Kafka cluster on Amazon MSK. The app was executed in a virtual environment as well as in a local Flink cluster for improved monitoring. In this post, the app will be deployed via Amazon Managed Service for Apache Flink, which is the easiest option to run Flink applications on AWS.</description><enclosure url="https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/featured.png" length="74618" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 2 Local Flink and MSK</title><link>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</link><pubDate>Mon, 28 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/</guid><description>In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.</description><enclosure url="https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/featured.png" length="64005" type="image/png"/></item><item><title>Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafka</title><link>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</link><pubDate>Thu, 17 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/featured.png" length="55960" type="image/png"/></item><item><title>Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Development</title><link>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</link><pubDate>Thu, 10 Aug 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few.</description><enclosure url="https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/featured.png" length="72929" type="image/png"/></item><item><title>Kafka Development with Docker - Part 11 Kafka Authorization</title><link>https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/</link><pubDate>Thu, 20 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous posts, we discussed how to implement client authentication by TLS (SSL or TLS/SSL) and SASL authentication. One of the key benefits of client authentication is achieving user access control. Kafka ships with a pluggable, out-of-the box authorization framework, which is configured with the authorizer.</description><enclosure url="https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/featured.png" length="458848" type="image/png"/></item><item><title>Kafka Development with Docker - Part 10 SASL Authentication</title><link>https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/</link><pubDate>Thu, 13 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed TLS (SSL or TLS/SSL) authentication to improve security. It enforces two-way verification where a client certificate is verified by Kafka brokers. Client authentication can also be enabled by Simple Authentication and Security Layer (SASL), and we will discuss how to implement SASL authentication with Java and Python client examples in this post.</description><enclosure url="https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/featured.png" length="471947" type="image/png"/></item><item><title>Kafka Development with Docker - Part 9 SSL Authentication</title><link>https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/</link><pubDate>Thu, 06 Jul 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed how to configure TLS (SSL or TLS/SSL) encryption with Java and Python client examples. SSL encryption is a one-way verification process where a server certificate is verified by a client via SSL Handshake.</description><enclosure url="https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/featured.png" length="471471" type="image/png"/></item><item><title>Kafka Development with Docker - Part 8 SSL Encryption</title><link>https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/</link><pubDate>Thu, 29 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 By default, Apache Kafka communicates in PLAINTEXT, which means that all data is sent without being encrypted. To secure communication, we can configure Kafka clients and other components to use Transport Layer Security (TLS) encryption.</description><enclosure url="https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/featured.png" length="469311" type="image/png"/></item><item><title>Kafka Development with Docker - Part 7 Producer and Consumer with Glue Schema Registry</title><link>https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/</link><pubDate>Thu, 22 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In Part 4, we developed Kafka producer and consumer applications using the kafka-python package. The Kafka messages are serialized as Json, but are not associated with a schema as there was not an integrated schema registry.</description><enclosure url="https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/featured.png" length="57175" type="image/png"/></item><item><title>Kafka Development with Docker - Part 4 Producer and Consumer</title><link>https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/</link><pubDate>Thu, 01 Jun 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 In the previous post, we discussed Kafka Connect to stream data to/from a Kafka cluster. Kafka also includes the Producer/Consumer APIs that allow client applications to send/read streams of data to/from topics in a Kafka cluster.</description><enclosure url="https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/featured.png" length="75255" type="image/png"/></item><item><title>Integrate Glue Schema Registry with Your Python Kafka App</title><link>https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/</link><pubDate>Wed, 12 Apr 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 As Kafka producer and consumer apps are decoupled, they operate on Kafka topics rather than communicating with each other directly. As described in the Confluent document, Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network.</description><enclosure url="https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/featured.png" length="46040" type="image/png"/></item><item><title>Simplify Streaming Ingestion on AWS – Part 2 MSK and Athena</title><link>https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/</link><pubDate>Tue, 14 Mar 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/</guid><description>In Part 1, we discussed a streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless. Athena provides the MSK connector to enable SQL queries on Apache Kafka topics directly, and it can also facilitate the extraction of insights without setting up an additional pipeline to store data into S3. In this post, we discuss how to update the streaming ingestion solution so that data in the Kafka topic can be queried by Athena instead of Redshift.</description><enclosure url="https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/featured.png" length="43403" type="image/png"/></item><item><title>Simplify Streaming Ingestion on AWS – Part 1 MSK and Redshift</title><link>https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/</link><pubDate>Wed, 08 Feb 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/</guid><description>Apache Kafka is a popular distributed event store and stream processing platform. Previously loading data from Kafka into Redshift and Athena usually required Kafka connectors (e.g. Amazon Redshift Sink Connector and Amazon S3 Sink Connector). Recently these AWS services provide features to ingest data from Kafka directly, which facilitates a simpler architecture that achieves low-latency and high-speed ingestion of streaming data. In part 1 of the simplify streaming ingestion on AWS series, we discuss how to develop an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless on AWS.</description><enclosure url="https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/featured.png" length="32864" type="image/png"/></item><item><title>How to configure Kafka consumers to seek offsets by timestamp</title><link>https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/</link><pubDate>Tue, 10 Jan 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Normally we consume Kafka messages from the beginning/end of a topic or last committed offsets. For backfilling or troubleshooting, however, we need to consume messages from a certain timestamp occasionally. If we know which topic partition to choose e.</description><enclosure url="https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/featured.png" length="47217" type="image/png"/></item><item><title>Revisit AWS Lambda Invoke Function Operator of Apache Airflow</title><link>https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/</link><pubDate>Sat, 06 Aug 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/</guid><description>Apache Airflow is a popular workflow management platform. A wide range of AWS services are integrated with the platform by Amazon AWS Operators. AWS Lambda is one of the integrated services, and it can be used to develop workflows efficiently. The current Lambda Operator, however, just invokes a Lambda function, and it can fail to report the invocation result of a function correctly and to record the exact error message from failure.</description><enclosure url="https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/featured.png" length="24814" type="image/png"/></item><item><title>Serverless Application Model (SAM) for Data Professionals</title><link>https://jaehyeon.me/blog/2022-07-18-sam-for-data-professionals/</link><pubDate>Mon, 18 Jul 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-07-18-sam-for-data-professionals/</guid><description>AWS Lambda provides serverless computing capabilities, and it can be used for performing validation or light processing/transformation of data. Moreover, with its integration with more than 140 AWS services, it facilitates building complex systems employing event-driven architectures. There are many ways to build serverless applications and one of the most efficient ways is using specialised frameworks such as the AWS Serverless Application Model (SAM) and Serverless Framework. In this post, I’ll demonstrate how to build a serverless data processing application using SAM.</description><enclosure url="https://jaehyeon.me/blog/2022-07-18-sam-for-data-professionals/featured.png" length="22838" type="image/png"/></item><item><title>Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment</title><link>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</link><pubDate>Sun, 26 Jun 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</guid><description>Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we&amp;rsquo;ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.</description><enclosure url="https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/featured.png" length="43604" type="image/png"/></item><item><title>Local Development of AWS Glue 3.0 and Later</title><link>https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/</link><pubDate>Sun, 14 Nov 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/</guid><description>In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I&amp;rsquo;ll illustrate how to create a development environment for AWS Glue 3.</description><enclosure url="https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/featured.png" length="30923" type="image/png"/></item><item><title>AWS Glue Local Development with Docker and Visual Studio Code</title><link>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</link><pubDate>Fri, 20 Aug 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</guid><description>As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.</description><enclosure url="https://jaehyeon.me/blog/2021-08-20-glue-local-development/featured.png" length="19535" type="image/png"/></item><item><title>Thoughts on Apache Airflow AWS Lambda Operator</title><link>https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/</link><pubDate>Mon, 13 Apr 2020 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/</guid><description>Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers for scalability. In AWS, however, scalability can also be achieved using serverless computing services in a simpler way. For example, the ECS Operator allows to run dockerized tasks and, with the Fargate launch type, they can run in a serverless environment.
The ECS Operator alone is not sufficient because it can take up to several minutes to pull a Docker image and to set up network interface (for the case of Fargate launch type).</description><enclosure url="https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/featured.png" length="44994" type="image/png"/></item><item><title>Dynamic Routing and Centralized Auth with Traefik, Python and R Example</title><link>https://jaehyeon.me/blog/2019-11-29-traefik-example/</link><pubDate>Fri, 29 Nov 2019 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2019-11-29-traefik-example/</guid><description>Ingress in Kubernetes exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. By setting rules, it routes requests to appropriate services (precisely requests are sent to individual Pods by Ingress Controller). Rules can be set up dynamically and I find it&amp;rsquo;s more efficient compared to traditional reverse proxy.
Traefik is a modern HTTP reverse proxy and load balancer and it can be used as a Kubernetes Ingress Controller.</description><enclosure url="https://jaehyeon.me/blog/2019-11-29-traefik-example/featured.png" length="139790" type="image/png"/></item><item><title>Distributed Task Queue with Python and R Example</title><link>https://jaehyeon.me/blog/2019-11-15-task-queue/</link><pubDate>Fri, 15 Nov 2019 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2019-11-15-task-queue/</guid><description>While I&amp;rsquo;m looking into Apache Airflow, a workflow management tool, I thought it would be beneficial to get some understanding of how Celery works. To do so, I built a simple web service that sends tasks to Celery workers and collects the results from them. FastAPI is used for developing the web service and Redis is used for the message broker and result backend. During the development, I thought it would be possible to implement similar functionality in R with Rserve.</description><enclosure url="https://jaehyeon.me/blog/2019-11-15-task-queue/featured.png" length="51615" type="image/png"/></item><item><title>Linux Dev Environment on Windows</title><link>https://jaehyeon.me/blog/2019-11-01-linux-on-windows/</link><pubDate>Fri, 01 Nov 2019 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2019-11-01-linux-on-windows/</guid><description>I use Linux containers a lot for development. Having Windows computers at home and work, I used to use Linux VMs on VirtualBox or VMWare Workstation. It&amp;rsquo;s not a bad option but it requires a lot of resources. Recently, after my home computer was updated, I was not able to start my hypervisor anymore. Also I didn&amp;rsquo;t like huge resource consumption of it so that I began to look for a different development environment.</description><enclosure url="https://jaehyeon.me/blog/2019-11-01-linux-on-windows/featured.png" length="187978" type="image/png"/></item><item><title>AWS Local Development with LocalStack</title><link>https://jaehyeon.me/blog/2019-07-20-aws-localstack/</link><pubDate>Sat, 20 Jul 2019 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2019-07-20-aws-localstack/</guid><description>LocalStack provides an easy-to-use test/mocking framework for developing AWS applications. In this post, I&amp;rsquo;ll demonstrate how to utilize LocalStack for development using a web service.
Specifically a simple web service built with Flask-RestPlus is used. It supports simple CRUD operations against a database table. It is set that SQS and Lambda are used for creating and updating a record. When a POST or PUT request is made, the service sends a message to a SQS queue and directly returns 204 reponse.</description><enclosure url="https://jaehyeon.me/blog/2019-07-20-aws-localstack/featured.png" length="164886" type="image/png"/></item><item><title>Serverless Data Product POC Backend Part IV - Serving R ML Model via S3</title><link>https://jaehyeon.me/blog/2017-04-17-serverless-data-product-4/</link><pubDate>Mon, 17 Apr 2017 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2017-04-17-serverless-data-product-4/</guid><description>In the previous posts, it is discussed how to package/deploy an R model with AWS Lambda and to expose the Lambda function via Amazon API Gateway. Main benefits of serverless architecture is cost-effectiveness and being hassle-free from provisioning/managing servers. While the API returns a predicted admission status value given GRE, GPA and Rank, there is an issue if it is served within a web application: Cross-Origin Resource Sharing (CORS). This post discusses how to resolve this issue by updating API configuration and the Lambda function handler with a simple web application.</description><enclosure url="https://jaehyeon.me/blog/2017-04-17-serverless-data-product-4/featured.png" length="225463" type="image/png"/></item><item><title>Serverless Data Product POC Backend Part III - Exposing R ML Model via APIG</title><link>https://jaehyeon.me/blog/2017-04-13-serverless-data-product-3/</link><pubDate>Thu, 13 Apr 2017 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2017-04-13-serverless-data-product-3/</guid><description>In Part I of this series, R and necessary libraries/packages together with a Lambda function handler are packaged and saved to Amazon S3. Then, in Part II, the package is deployed at AWS Lambda after creating and assigning a role to the Lambda function. Although the Lambda function can be called via the Invoke API, it&amp;rsquo;ll be much more useful if the function can be called as a web service (or API).</description><enclosure url="https://jaehyeon.me/blog/2017-04-13-serverless-data-product-3/featured.png" length="173293" type="image/png"/></item><item><title>Serverless Data Product POC Backend Part II - Deploying R ML Model via Lambda</title><link>https://jaehyeon.me/blog/2017-04-11-serverless-data-product-2/</link><pubDate>Tue, 11 Apr 2017 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2017-04-11-serverless-data-product-2/</guid><description>In the previous post, serverless event-driven application development is introduced. Also how to package R, necessary libraries/packages and a Lambda function handler is discussed. No need of provisioning/managing servers is one of the key benefits of the architecture. It is also a cost-effective way of delivering a data product as functions are executed on-demand rather than in servers that run 24/7. Furthermore AWS Lambda free tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month, which is available to both existing and new AWS customers indefinitely.</description><enclosure url="https://jaehyeon.me/blog/2017-04-11-serverless-data-product-2/featured.png" length="139725" type="image/png"/></item><item><title>Serverless Data Product POC Backend Part I - Packaging R ML Model for Lambda</title><link>https://jaehyeon.me/blog/2017-04-08-serverless-data-product-1/</link><pubDate>Sat, 08 Apr 2017 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2017-04-08-serverless-data-product-1/</guid><description><![CDATA[Let say you&rsquo;ve got a prediction model built in R and you&rsquo;d like to productionize it, for example, by serving it in a web application. One way is exposing the model through an API that returns the predicted result as a web service. However there are many issues. Firstly R is not a language for API development although there may be some ways - eg the plumber package. More importantly developing an API is not the end of the story as the API can&rsquo;t be served in a production system if it is not deployed/managed/upgraded/patched/&hellip; appropriately in a server or if it is not scalable, protected via authentication/authorization and so on.]]></description><enclosure url="https://jaehyeon.me/blog/2017-04-08-serverless-data-product-1/featured.png" length="139725" type="image/png"/></item><item><title>Quick Test to Wrap Python in R</title><link>https://jaehyeon.me/blog/2015-11-21-quick-test-to-wrap-python-in-r/</link><pubDate>Sat, 21 Nov 2015 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2015-11-21-quick-test-to-wrap-python-in-r/</guid><description>As mentioned in an earlier post, things that are not easy in R can be relatively simple in other languages. Another example would be connecting to Amazon Web Services. In relation to s3, although there are a number of existing packages, many of them seem to be deprecated, premature or platform-dependent. (I consider the cloudyr project looks promising though.)
If there isn&amp;rsquo;t a comprehensive R-way of doing something yet, it may be necessary to create it from scratch.</description></item><item><title>Some Thoughts on Python for R Users</title><link>https://jaehyeon.me/blog/2015-08-09-some-thoughts-on-python-for-r-users/</link><pubDate>Sun, 09 Aug 2015 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2015-08-09-some-thoughts-on-python-for-r-users/</guid><description>There seem to be growing interest in Python in the R cummunity. While there can be a range of opinions about using R over Python (or vice versa) for exploratory data analysis, fitting statistical/machine learning algorithms and so on, I consider one of the strongest attractions of using Python comes from the fact that Python is a general purpose programming language. As more developers are involved in, it can provide a way to get jobs done easily, which can be tricky in R.</description></item><item><title>Some Thoughts on Python</title><link>https://jaehyeon.me/blog/2015-08-08-some-thoughts-on-python/</link><pubDate>Sat, 08 Aug 2015 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2015-08-08-some-thoughts-on-python/</guid><description>When I bagan to teach myself C# 3 years ago, I only had some experience in interactive analysis tools such as MATLAB and R - I didn&amp;rsquo;t consider R as a programming language at that time. The general purpose programming language shares some common features (data type, loop, if&amp;hellip;) but it is rather different in the way how code is written/organized, which is object oriented. Therefore, while it was not a problem to grap the common features, it took quite some time to understand and keep my code in an object oriented way.</description></item></channel></rss>