Blogs

Realtime Dashboard With FastAPI, Streamlit and Next.js - Part 1 Data Producer

February 18, 202510 min read Development Realtime Dashboard With FastAPI, Streamlit and Next.js Docker FastAPI PostgreSQL Python WebSocket

In this series, we develop real-time monitoring dashboard applications. A data generating app is created with Python, and it ingests the theLook eCommerce data continuously into a PostgreSQL database. A WebSocket server, built by FastAPI, periodically queries the data to serve its clients. The monitoring dashboards will be developed using Streamlit and Next.js, with Apache ECharts for visualization. In this post, we walk through the data generation app and backend API, while the monitoring dashboards will be discussed in later posts.

December 19, 202412 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

In Part 9, we developed two Apache Beam pipelines using Splittable DoFn (SDF). One of them is a batch file reader, which reads a list of files in an input folder followed by processing them in parallel. We can extend the I/O connector so that, instead of listing files once at the beginning, it scans an input folder periodically for new files and processes whenever new files are created in the folder. The techniques used in this post can be quite useful as they can be applied to developing I/O connectors that target other unbounded (or streaming) data sources (eg Kafka) using the Python SDK.

December 5, 202410 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Python Splittable DoFn

A Splittable DoFn (SDF) is a generalization of a DoFn that enables Apache Beam developers to create modular and composable I/O components. Also, it can be applied in advanced non-I/O scenarios such as Monte Carlo simulation. In this post, we develop two Apache Beam pipelines. The first pipeline is an I/O connector, and it reads a list of files in a folder followed by processing each of the file objects in parallel. The second pipeline estimates the value of $\pi$ by performing Monte Carlo simulation.

November 21, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

In Part 3, we developed a Beam pipeline that tracks sport activities of users and outputs their speeds periodically. While reporting such values is useful for users on its own, we can provide more engaging information to users if we have a pipeline that reports pacing of their activities over periods. For example, we can send a message to encourage a user to work harder if he/she has a performance goal and is underperforming for some periods. In this post, we develop a new pipeline that tracks user activities and reports pacing details by comparing short term metrics to their long term counterparts.

November 7, 202411 min read Data Integration Data Streaming Change Data Capture (CDC)Debezium GCP Sub PostgreSQL Sub Emulator

Change data capture (CDC) is a data integration pattern to track changes in a database so that actions can be taken using the changed data. Debezium is probably the most popular open source platform for CDC. Originally providing Kafka source connectors, it also supports a ready-to-use application called Debezium server. The standalone application can be used to stream change events to other messaging infrastructure such as Google Cloud Pub/Sub, Amazon Kinesis and Apache Pulsar. In this post, we develop a CDC solution locally using Docker. The source of the theLook eCommerce is modified to generate data continuously, and the data is inserted into multiple tables of a PostgreSQL database. Among those tables, two of them are tracked by the Debezium server, and it pushes row-level changes of those tables into Pub/Sub topics on the Pub/Sub emulator. Finally, messages of the topics are read by a Python application.

October 24, 202418 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka Python

We develop an Apache Beam pipeline that separates droppable elements from the rest of the data. Droppable elements are those that come later when the watermark passes the window max timestamp plus allowed lateness. Using a timer in a Stateful DoFn, droppable data is separated from normal data and dispatched into a side output rather than being discarded silently, which is the default behaviour. Note that this pipeline works in a situation where droppable elements do not appear often, and thus the chance that a droppable element is delivered as the first element in a particular window is low.

October 2, 202414 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we continued discussing an Apache Beam pipeline that arguments input data by calling a Remote Procedure Call (RPC) service. A pipeline was developed that makes a single RPC call for a bundle of elements. The bundle size is determined by the runner, however, we may encounter an issue e.g. if an RPC service becomes quite slower if many elements are included in a single request. We can improve the pipeline using stateful DoFn where the number elements to process and maximum wait seconds can be controlled by state and timers. Note that, although the stateful DoFn used in this post solves the data augmentation task well, in practice, we should use the built-in transforms such as BatchElements and GroupIntoBatches whenever possible.

September 18, 202411 min read Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Apache Kafka GRPC Python

In the previous post, we developed an Apache Beam pipeline where the input data is augmented by a Remote Procedure Call (RPC) service. Each input element performs an RPC call and the output is enriched by the response. This is not an efficient way of accessing an external service provided that the service can accept more than one element. In this post, we discuss how to enhance the pipeline so that a single RPC call is made for a bundle of elements, which can save a significant amount time compared to making a call for each element.

September 13, 202423 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

In the previous post, we started discussing a continuous integration/continuous delivery (CI/CD) process of a dbt project by introducing two GitHub Actions workflows - slim-ci and deploy. The former is triggered when a pull request is created to the main branch, and it builds only modified models and its first-order children in a ci dataset, followed by performing tests on them. The second workflow gets triggered once a pull request is merged. Beginning with running unit tests, it packages the dbt project as a Docker container and publishes to Artifact Registry. In this post, we focus on how to deploy a dbt project in multiple environments while walking through the entire CI/CD process step-by-step.

September 5, 202418 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

Continuous integration (CI) is the process of ensuring new code integrates with the larger code base, and it puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch. Continuous delivery (CD) is an extension of continuous integration since it automatically deploys all code changes to a testing and/or production environment after the build stage. CI/CD helps development teams avoid bugs and code failures while maintaining a continuous cycle of software development and updates. In this post, we discuss how to set up a CI/CD pipeline for a data build tool (dbt) project using GitHub Actions where BigQuery is used as the target data warehouse.

Realtime Dashboard With FastAPI, Streamlit and Next.js - Part 1 Data Producer

Apache Beam Python Examples - Part 10 Develop Streaming File Reader Using Splittable DoFn

Apache Beam Python Examples - Part 9 Develop Batch File Reader and PiSampler Using Splittable DoFn

Apache Beam Python Examples - Part 8 Enhance Sport Activity Tracker With Runner Motivation

Change Data Capture (CDC) Local Development With PostgreSQL, Debezium Server and Pub/Sub Emulator

Apache Beam Python Examples - Part 7 Separate Droppable Data Into Side Output

Apache Beam Python Examples - Part 6 Call RPC Service in Batch With Defined Batch Size Using Stateful DoFn

Apache Beam Python Examples - Part 5 Call RPC Service in Batch Using Stateless DoFn

Guide to Running DBT in Production

DBT CI/CD Demo With BigQuery and GitHub Actions