Data Streaming

Self-Service Data Platform via a Multi-Tenant SQL Gateway

July 17, 202515 min read Big Data Data Architecture Data Engineering Data Platform Data Streaming Apache Flink Apache Kyuubi Apache Langer Apache Spark Data Governance Data Lakehouse Data Lineage Hive Metastore Marquez Multi-Tenancy OpenLineage Self-Service Analytics SQL Gateway Trino

Providing direct access to big data engines like Spark and Flink often creates chaos. A gateway-centric architecture solves this by introducing a robust control plane. This article presents a detailed blueprint using Apache Kyuubi, a multi-tenant SQL gateway, to provision and manage on-demand Spark, Flink, and Trino engines. Learn how this model delivers true self-service analytics with centralized governance, finally resolving the conflict between user empowerment and platform stability.

June 17, 202519 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

In the last post, we explored the fine-grained control of Flink’s DataStream API. Now, we’ll approach the same problem from a higher level of abstraction using the Flink Table API. This post demonstrates how to build a declarative analytics pipeline that processes our continuous stream of Avro-formatted order events. We will define a Table on top of a DataStream and use SQL-like expressions to perform windowed aggregations. This example highlights the power and simplicity of the Table API for analytical tasks and showcases Flink’s seamless integration between its different API layers to handle complex requirements like late data.

June 10, 202517 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

Building on our exploration of stream processing, we now transition from Kafka’s native library to Apache Flink, a powerful, general-purpose distributed processing engine. In this post, we’ll dive into Flink’s foundational DataStream API. We will tackle the same supplier statistics problem - analyzing a stream of Avro-formatted order events - but this time using Flink’s robust features for stateful computation. This example will highlight Flink’s sophisticated event-time processing with watermarks and its elegant, built-in mechanisms for handling late-arriving data through side outputs.

June 3, 202518 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

In this post, we shift our focus from basic Kafka clients to real-time stream processing with Kafka Streams. We’ll explore a Kotlin application designed to analyze a continuous stream of Avro-formatted order events, calculate supplier statistics in tumbling windows, and intelligently handle late-arriving data. This example demonstrates the power of Kafka Streams for building lightweight, yet robust, stream processing applications directly within your Kafka ecosystem, leveraging event-time processing and custom logic.

May 27, 202515 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kotlin Kpow

In this post, we’ll explore a practical example of building Kafka client applications using Kotlin, Apache Avro for data serialization, and Gradle for build management. We’ll walk through the setup of a Kafka producer that generates mock order data and a consumer that processes these orders. This example highlights best practices such as schema management with Avro, robust error handling, and graceful shutdown, providing a solid foundation for your own Kafka-based projects. We’ll dive into the build configuration, the Avro schema definition, utility functions for Kafka administration, and the core logic of both the producer and consumer applications.

May 20, 202514 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kotlin Kpow

This post explores a Kotlin-based Kafka project, meticulously detailing the construction and operation of both a Kafka producer application, responsible for generating and sending order data, and a Kafka consumer application, designed to receive and process these orders. We’ll delve into each component, from build configuration to message handling, to understand how they work together in an event-driven system.

May 6, 20256 min read Big Data Data Architecture Data Engineering Data Streaming Apache Flink Apache Iceberg Apache Paimon Fluss

The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the “Streamhouse” - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.

Today, we’ll introduce three key open-source technologies shaping this space: Apache Paimon™, Fluss, and Apache Iceberg. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.

April 15, 20255 min read Data Streaming Apache Flink Docker Docker Compose Flink SQL Flink SQL Client

The Flink SQL Cookbook by Ververica is a hands-on, example-rich guide to mastering Apache Flink SQL for real-time stream processing. It offers a wide range of self-contained recipes, from basic queries and table operations to more advanced use cases like windowed aggregations, complex joins, user-defined functions (UDFs), and pattern detection. These examples are designed to be run on the Ververica Platform, and as such, the cookbook doesn’t include instructions for setting up a Flink cluster.

To help you run these recipes locally and explore Flink SQL without external dependencies, this post walks through setting up a fully functional local Flink cluster using Docker Compose. With this setup, you can experiment with the cookbook examples right on your machine.

December 19, 202412 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

In Part 9, we developed two Apache Beam pipelines using Splittable DoFn (SDF). One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.

December 5, 202410 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

A Splittable DoFn (SDF) is a generalization of a DoFn that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.

Self-Service Data Platform via a Multi-Tenant SQL Gateway

Flink Table API - Declarative Analytics for Supplier Stats in Real Time

Flink DataStream API - Scalable Event Processing for Supplier Stats

Kafka Streams - Lightweight Real-Time Processing for Supplier Stats

Kafka Clients With Avro - Schema Registry and Order Events

Kafka Clients With JSON - Producing and Consuming Order Events

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

Run Flink SQL Cookbook in Docker

Apache Beam Python Examples - Part 10 Develop Streaming File Reader Using Splittable DoFn

Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler Using Splittable DoFn