Data Processing

Cache Data on Apache Beam Pipelines Using a Shared Object

August 22, 20248 min read Data Processing Apache Beam Caching Data Enrichment Python

I recently contributed to Apache Beam by adding a common pipeline pattern - Cache data using a shared object. Both batch and streaming pipelines are introduced, and they utilise the Shared class of the Python SDK to enrich PCollection elements. This pattern can be more memory-efficient than side inputs, simpler than a stateful DoFn, and more performant than calling an external service, because it does not have to access an external service for every element or bundle of elements. In this post, we discuss this pattern in more details with batch and streaming use cases. For the latter, we configure the cache gets refreshed periodically.

April 4, 202414 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Python

In this series, we discuss local development of Apache Beam pipelines using Python. A basic Beam pipeline was introduced in Part 1, followed by demonstrating how to utilise Jupyter notebooks, Beam SQL and Beam DataFrames. In this post, we discuss Batch pipelines that aggregate website visit log by user and time. The pipelines are developed with and without Beam SQL. Additionally, each pipeline is implemented on a Jupyter notebook for demonstration.

March 28, 202412 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Jupyter Notebook Python

Apache Beam and Apache Flink are open-source frameworks for parallel, distributed data processing at scale. Flink has DataStream and Table/SQL APIs and the former has more capacity to develop sophisticated data streaming applications. The DataStream API of PyFlink, Flink’s Python API, however, is not as complete as its Java counterpart, and it doesn’t provide enough capability to extend when there are missing features in Python. On the other hand, Apache Beam supports more possibility to extend and/or customise its features. In this series of posts, we discuss local development of Apache Beam pipelines using Python. In Part 1, a basic Beam pipeline is introduced, followed by demonstrating how to utilise Jupyter notebooks for interactive development. It also covers Beam SQL and Beam DataFrames examples on notebooks. In subsequent posts, we will discuss batch and streaming pipeline development and concludes with illustrating unit testing of existing pipelines.

Cache Data on Apache Beam Pipelines Using a Shared Object

Apache Beam Local Development With Python - Part 2 Batch Pipelines

Apache Beam Local Development With Python - Part 1 Pipeline, Notebook, SQL and DataFrame