Apache Beam

Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length

July 4, 202422 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In this series, we develop Apache Beam Python pipelines. The majority of them are from Building Big Data Pipelines with Apache Beam by Jan Lukavský. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) Apache Flink cluster using the Apache Flink Runner. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.

June 6, 202416 min read Data Streaming Kubernetes Deploy Python Stream Processing App on Kubernetes Apache Beam Apache Flink Apache Kafka Docker Kubernetes Python

In this post, we develop an Apache Beam pipeline using the Python SDK and deploy it on an Apache Flink cluster via the Apache Flink Runner. Same as Part I, we deploy a Kafka cluster using the Strimzi Operator on a minikube cluster as the pipeline uses Apache Kafka topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the Flink Kubernetes Operator, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.

May 9, 202412 min read Data Engineering Data Streaming Apache Beam Local Development With Python Apache Beam Python

We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.

May 2, 202410 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we develop a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds. Two versions of the pipeline are created with/without relying on Beam SQL.

April 18, 202414 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

Beam pipelines are portable between batch and streaming semantics but not every Runner is equally capable. The Apache Flink Runner supports Python, and it has good features that allow us to develop streaming pipelines effectively. We first discuss the portability layer of Apache Beam as it helps understand (1) how a pipeline developed by the Python SDK can be executed in the Flink Runner that only understands Java JAR and (2) how multiple SDKs can be used in a single pipeline. Then we move on to how to manage local Flink and Kafka clusters using bash scripts. Finally, we end up illustrating a simple streaming pipeline, which reads and writes website visit logs from and to Kafka topics.

April 4, 202414 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Python

In this series, we discuss local development of Apache Beam pipelines using Python. A basic Beam pipeline was introduced in Part 1, followed by demonstrating how to utilise Jupyter notebooks, Beam SQL and Beam DataFrames. In this post, we discuss Batch pipelines that aggregate website visit log by user and time. The pipelines are developed with and without Beam SQL. Additionally, each pipeline is implemented on a Jupyter notebook for demonstration.

March 28, 202412 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Jupyter Notebook Python

Apache Beam and Apache Flink are open-source frameworks for parallel, distributed data processing at scale. Flink has DataStream and Table/SQL APIs and the former has more capacity to develop sophisticated data streaming applications. The DataStream API of PyFlink, Flink’s Python API, however, is not as complete as its Java counterpart, and it doesn’t provide enough capability to extend when there are missing features in Python. On the other hand, Apache Beam supports more possibility to extend and/or customise its features. In this series of posts, we discuss local development of Apache Beam pipelines using Python. In Part 1, a basic Beam pipeline is introduced, followed by demonstrating how to utilise Jupyter notebooks for interactive development. It also covers Beam SQL and Beam DataFrames examples on notebooks. In subsequent posts, we will discuss batch and streaming pipeline development and concludes with illustrating unit testing of existing pipelines.

Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length

Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner

Apache Beam Local Development With Python - Part 5 Testing Pipelines

Apache Beam Local Development With Python - Part 4 Streaming Pipelines

Apache Beam Local Development With Python - Part 3 Flink Runner

Apache Beam Local Development With Python - Part 2 Batch Pipelines

Apache Beam Local Development With Python - Part 1 Pipeline, Notebook, SQL and DataFrame