Apache Flink

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

May 6, 20256 min read Data Architecture Data Streaming Apache Flink Apache Iceberg Apache Paimon Fluss

The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the “Streamhouse” - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.

Today, we’ll introduce three key open-source technologies shaping this space: Apache Paimon™, Fluss, and Apache Iceberg. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.

April 15, 20255 min read Data Streaming Apache Flink Docker Docker Compose Flink SQL Flink SQL Client

The Flink SQL Cookbook by Ververica is a hands-on, example-rich guide to mastering Apache Flink SQL for real-time stream processing. It offers a wide range of self-contained recipes, from basic queries and table operations to more advanced use cases like windowed aggregations, complex joins, user-defined functions (UDFs), and pattern detection. These examples are designed to be run on the Ververica Platform, and as such, the cookbook doesn’t include instructions for setting up a Flink cluster.

To help you run these recipes locally and explore Flink SQL without external dependencies, this post walks through setting up a fully functional local Flink cluster using Docker Compose. With this setup, you can experiment with the cookbook examples right on your machine.

December 19, 202412 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

In Part 9, we developed two Apache Beam pipelines using Splittable DoFn (SDF). One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.

December 5, 202410 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

A Splittable DoFn (SDF) is a generalization of a DoFn that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.

November 21, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In Part 3, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.

October 24, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

We develop an Apache Beam pipeline that separates droppable elements from the rest of the data. Droppable elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a Stateful DoFn, droppable data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where droppable elements do not appear often, and thus the chance that a droppable element is delivered as the first element in a particular window is low.

October 2, 202414 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we continued discussing an Apache Beam pipeline that arguments input data by calling a Remote Procedure Call (RPC) service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful DoFn where the number elements to process and maximum wait seconds can be controlled by state and timers. Note that, although the stateful DoFn used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as BatchElements and GroupIntoBatches whenever possible.

September 18, 202411 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we developed an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.

August 15, 202413 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In this post, we develop an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.

August 1, 202419 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and Beam SQL is used for the latter. While Beam SQL can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of Beam SQL so that it can be used for building competitive applications with the Python SDK.

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

Run Flink SQL Cookbook in Docker

Apache Beam Python Examples - Part 10 Develop Streaming File Reader Using Splittable DoFn

Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler Using Splittable DoFn

Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker With Runner Motivation

Apache Beam Python Examples - Part 7 Separate Droppable Data Into Side Output

Apache Beam Python Examples - Part 6 Call RPC Service in Batch With Defined Batch Size Using Stateful DoFn

Apache Beam Python Examples - Part 5 Call RPC Service in Batch Using Stateless DoFn

Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation

Apache Beam Python Examples - Part 3 Build Sport Activity Tracker With/Without SQL