Blogs

Apache Beam Python Examples - Part 7 Separate Droppable Data Into Side Output

October 24, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

We develop an Apache Beam pipeline that separates droppable elements from the rest of the data. Droppable elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a Stateful DoFn, droppable data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where droppable elements do not appear often, and thus the chance that a droppable element is delivered as the first element in a particular window is low.

October 2, 202414 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we continued discussing an Apache Beam pipeline that arguments input data by calling a Remote Procedure Call (RPC) service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful DoFn where the number elements to process and maximum wait seconds can be controlled by state and timers. Note that, although the stateful DoFn used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as BatchElements and GroupIntoBatches whenever possible.

September 18, 202411 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we developed an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.

September 13, 202423 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

In the previous post, we started discussing a continuous integration/continuous delivery (CI/CD) process of a dbt project by introducing two GitHub Actions workflows - slim-ci and deploy. The former is triggered when a pull request is created to the main branch, and it builds only modified models and its first-order children in a ci dataset, followed by performing tests on them. The second workflow gets triggered once a pull request is merged. Beginning with running unit tests, it packages the dbt project as a Docker container and publishes to Artifact Registry. In this post, we focus on how to deploy a dbt project in multiple environments while walking through the entire CI/CD process step-by-step.

September 5, 202418 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

Continuous integration (CI) is the process of ensuring new code integrates with the larger code base, and it puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch. Continuous delivery (CD) is an extension of continuous integration since it automatically deploys all code changes to a testing and/or production environment after the build stage. CI/CD helps development teams avoid bugs and code failures while maintaining a continuous cycle of software development and updates. In this post, we discuss how to set up a CI/CD pipeline for a data build tool (dbt) project using GitHub Actions where BigQuery is used as the target data warehouse.

August 22, 20248 min read Data Processing Apache Beam Caching Data Enrichment Python

I recently contributed to Apache Beam by adding a common pipeline pattern - Cache data using a shared object. Both batch and streaming pipelines are introduced, and they utilise the Shared class of the Python SDK to enrich PCollection elements. This pattern can be more memory-efficient than side inputs, simpler than a stateful DoFn, and more performant than calling an external service, because it does not have to access an external service for every element or bundle of elements. In this post, we discuss this pattern in more details with batch and streaming use cases. For the latter, we configure the cache gets refreshed periodically.

August 15, 202413 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In this post, we develop an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In the subsequent two posts, we will discuss updated pipelines that make RPC calls more efficiently. We begin with illustrating how to manage development resources followed by demonstrating the RPC service that we use in this series. Finally, we develop a Beam pipeline that accesses the external service to augment the input elements.

August 1, 202419 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and Beam SQL is used for the latter. While Beam SQL can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of Beam SQL so that it can be used for building competitive applications with the Python SDK.

July 18, 202415 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In this post, we develop two Apache Beam pipelines that calculate average word lengths from input texts that are ingested by a Kafka topic. They obtain the statistics in different angles. The first pipeline emits the global average lengths whenever a new input text arrives while the latter triggers those values in a sliding time window.

July 4, 202422 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In this series, we develop Apache Beam Python pipelines. The majority of them are from Building Big Data Pipelines with Apache Beam by Jan Lukavský. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) Apache Flink cluster using the Apache Flink Runner. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.

Apache Beam Python Examples - Part 7 Separate Droppable Data Into Side Output

Apache Beam Python Examples - Part 6 Call RPC Service in Batch With Defined Batch Size Using Stateful DoFn

Apache Beam Python Examples - Part 5 Call RPC Service in Batch Using Stateless DoFn

Guide to Running DBT in Production

DBT CI/CD Demo With BigQuery and GitHub Actions

Cache Data on Apache Beam Pipelines Using a Shared Object

Apache Beam Python Examples - Part 4 Call RPC Service for Data Augmentation

Apache Beam Python Examples - Part 3 Build Sport Activity Tracker With/Without SQL

Apache Beam Python Examples - Part 2 Calculate Average Word Length With/Without Fixed Look Back

Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length