Apache Kafka

Flink Table API - Declarative Analytics for Supplier Stats in Real Time

June 17, 202519 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

In the last post, we explored the fine-grained control of Flink’s DataStream API. Now, we’ll approach the same problem from a higher level of abstraction using the Flink Table API. This post demonstrates how to build a declarative analytics pipeline that processes our continuous stream of Avro-formatted order events. We will define a Table on top of a DataStream and use SQL-like expressions to perform windowed aggregations. This example highlights the power and simplicity of the Table API for analytical tasks and showcases Flink’s seamless integration between its different API layers to handle complex requirements like late data.

June 10, 202517 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

Building on our exploration of stream processing, we now transition from Kafka’s native library to Apache Flink, a powerful, general-purpose distributed processing engine. In this post, we’ll dive into Flink’s foundational DataStream API. We will tackle the same supplier statistics problem - analyzing a stream of Avro-formatted order events - but this time using Flink’s robust features for stateful computation. This example will highlight Flink’s sophisticated event-time processing with watermarks and its elegant, built-in mechanisms for handling late-arriving data through side outputs.

June 3, 202518 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kafka Streams Kotlin Kpow

In this post, we shift our focus from basic Kafka clients to real-time stream processing with Kafka Streams. We’ll explore a Kotlin application designed to analyze a continuous stream of Avro-formatted order events, calculate supplier statistics in tumbling windows, and intelligently handle late-arriving data. This example demonstrates the power of Kafka Streams for building lightweight, yet robust, stream processing applications directly within your Kafka ecosystem, leveraging event-time processing and custom logic.

May 27, 202515 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kotlin Kpow

In this post, we’ll explore a practical example of building Kafka client applications using Kotlin, Apache Avro for data serialization, and Gradle for build management. We’ll walk through the setup of a Kafka producer that generates mock order data and a consumer that processes these orders. This example highlights best practices such as schema management with Avro, robust error handling, and graceful shutdown, providing a solid foundation for your own Kafka-based projects. We’ll dive into the build configuration, the Avro schema definition, utility functions for Kafka administration, and the core logic of both the producer and consumer applications.

May 20, 202514 min read Data Streaming Getting Started With Real-Time Streaming in Kotlin Apache Kafka Docker Factor House Local Kotlin Kpow

This post explores a Kotlin-based Kafka project, meticulously detailing the construction and operation of both a Kafka producer application, responsible for generating and sending order data, and a Kafka consumer application, designed to receive and process these orders. We’ll delve into each component, from build configuration to message handling, to understand how they work together in an event-driven system.

November 21, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In Part 3, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.

October 24, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

We develop an Apache Beam pipeline that separates droppable elements from the rest of the data. Droppable elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a Stateful DoFn, droppable data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where droppable elements do not appear often, and thus the chance that a droppable element is delivered as the first element in a particular window is low.

October 2, 202414 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we continued discussing an Apache Beam pipeline that arguments input data by calling a Remote Procedure Call (RPC) service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful DoFn where the number elements to process and maximum wait seconds can be controlled by state and timers. Note that, although the stateful DoFn used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as BatchElements and GroupIntoBatches whenever possible.

September 18, 202411 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we developed an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.

August 15, 202413 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In this post, we develop an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.

Flink Table API - Declarative Analytics for Supplier Stats in Real Time

Flink DataStream API - Scalable Event Processing for Supplier Stats

Kafka Streams - Lightweight Real-Time Processing for Supplier Stats

Kafka Clients With Avro - Schema Registry and Order Events

Kafka Clients With JSON - Producing and Consuming Order Events

Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker With Runner Motivation

Apache Beam Python Examples - Part 7 Separate Droppable Data Into Side Output

Apache Beam Python Examples - Part 6 Call RPC Service in Batch With Defined Batch Size Using Stateful DoFn

Apache Beam Python Examples - Part 5 Call RPC Service in Batch Using Stateless DoFn

Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation