Data Streaming

Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner

June 6, 202416 min read Data Streaming Kubernetes Deploy Python Stream Processing App on Kubernetes Apache Beam Apache Flink Apache Kafka Docker Kubernetes Python

In this post, we develop an Apache Beam pipeline using the Python SDK and deploy it on an Apache Flink cluster via the Apache Flink Runner. Same as Part I, we deploy a Kafka cluster using the Strimzi Operator on a minikube cluster as the pipeline uses Apache Kafka topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the Flink Kubernetes Operator, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.

May 30, 202413 min read Data Streaming Kubernetes Deploy Python Stream Processing App on Kubernetes Apache Flink Apache Kafka Docker Kubernetes Python

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the Flink Runner on Kubernetes. In Part 1, we first deploy a Kafka cluster on a minikube cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.

May 9, 202412 min read Data Engineering Data Streaming Apache Beam Local Development With Python Apache Beam Python

We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.

May 2, 202410 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we develop a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds. Two versions of the pipeline are created with/without relying on Beam SQL.

April 18, 202414 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

Beam pipelines are portable between batch and streaming semantics but not every Runner is equally capable. The Apache Flink Runner supports Python, and it has good features that allow us to develop streaming pipelines effectively. We first discuss the portability layer of Apache Beam as it helps understand (1) how a pipeline developed by the Python SDK can be executed in the Flink Runner that only understands Java JAR and (2) how multiple SDKs can be used in a single pipeline. Then we move on to how to manage local Flink and Kafka clusters using bash scripts. Finally, we end up illustrating a simple streaming pipeline, which reads and writes website visit logs from and to Kafka topics.

January 11, 20247 min read Data Integration Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kafka Connect Kubernetes Minikube Python

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket. The Kafka Connect servers and individual connectors are deployed using the custom resources of Strimzi on Kubernetes.

January 4, 20248 min read Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kubernetes Minikube Python Strimzi

Apache Kafka has five core APIs, and we can develop applications to send/read streams of data to/from topics in a Kafka cluster using the producer and consumer APIs. While the main Kafka project maintains only the Java APIs, there are several open source projects that provide the Kafka client APIs in Python. In this post, we discuss how to develop Kafka client applications using the kafka-python package on Kubernetes.

December 21, 20237 min read Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kubernetes Minikube Python Strimzi

Apache Kafka is one of the key technologies for implementing data streaming architectures. Strimzi provides a way to run an Apache Kafka cluster and related resources on Kubernetes in various deployment configurations. In this series of posts, we will discuss how to create a Kafka cluster, to develop Kafka client applications in Python and to build a data pipeline using Kafka connectors on Kubernetes.

December 14, 20236 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Kafka AWS AWS Lambda Kpow Python

Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.

December 7, 202316 min read Data Engineering Data Streaming Development Amazon EMR Apache Flink Apache Kafka Apache Spark AWS Docker

Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases. As it is integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog). In this post, we will discuss how to set up a local development environment for Apache Flink and Spark using the EMR container images. After illustrating the environment setup, we will discuss a solution where data ingestion/processing is performed in real time using Apache Flink and the processed data is consumed by Apache Spark for analysis.

Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner

Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application

Apache Beam Local Development With Python - Part 5 Testing Pipelines

Apache Beam Local Development With Python - Part 4 Streaming Pipelines

Apache Beam Local Development With Python - Part 3 Flink Runner

Kafka Development on Kubernetes - Part 3 Kafka Connect

Kafka Development on Kubernetes - Part 2 Producer and Consumer

Kafka Development on Kubernetes - Part 1 Cluster Setup

Real Time Streaming With Kafka and Flink - Lab 6 Consume Data From Kafka Using Lambda

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images