Blogs

Apache Beam Local Development With Python - Part 4 Streaming Pipelines

May 2, 202410 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we develop a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds. Two versions of the pipeline are created with/without relying on Beam SQL.

April 18, 202414 min read Data Streaming Apache Beam Local Development With Python Apache Beam Apache Flink Apache Kafka Python

Beam pipelines are portable between batch and streaming semantics but not every Runner is equally capable. The Apache Flink Runner supports Python, and it has good features that allow us to develop streaming pipelines effectively. We first discuss the portability layer of Apache Beam as it helps understand (1) how a pipeline developed by the Python SDK can be executed in the Flink Runner that only understands Java JAR and (2) how multiple SDKs can be used in a single pipeline. Then we move on to how to manage local Flink and Kafka clusters using bash scripts. Finally, we end up illustrating a simple streaming pipeline, which reads and writes website visit logs from and to Kafka topics.

April 4, 202414 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Python

In this series, we discuss local development of Apache Beam pipelines using Python. A basic Beam pipeline was introduced in Part 1, followed by demonstrating how to utilise Jupyter notebooks, Beam SQL and Beam DataFrames. In this post, we discuss Batch pipelines that aggregate website visit log by user and time. The pipelines are developed with and without Beam SQL. Additionally, each pipeline is implemented on a Jupyter notebook for demonstration.

March 28, 202412 min read Data Processing Apache Beam Local Development With Python Apache Beam Beam SQL Jupyter Notebook Python

Apache Beam and Apache Flink are open-source frameworks for parallel, distributed data processing at scale. Flink has DataStream and Table/SQL APIs and the former has more capacity to develop sophisticated data streaming applications. The DataStream API of PyFlink, Flink’s Python API, however, is not as complete as its Java counterpart, and it doesn’t provide enough capability to extend when there are missing features in Python. On the other hand, Apache Beam supports more possibility to extend and/or customise its features. In this series of posts, we discuss local development of Apache Beam pipelines using Python. In Part 1, a basic Beam pipeline is introduced, followed by demonstrating how to utilise Jupyter notebooks for interactive development. It also covers Beam SQL and Beam DataFrames examples on notebooks. In subsequent posts, we will discuss batch and streaming pipeline development and concludes with illustrating unit testing of existing pipelines.

March 14, 20248 min read Data Engineering DBT Pizza Shop Demo Amazon Athena Apache Airflow AWS Dbt Docker Python

In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

March 7, 202416 min read Data Engineering DBT Pizza Shop Demo Amazon Athena AWS Dbt Docker Python

In Part 1 and Part 3, we developed data build tool (dbt) projects that target PostgreSQL and BigQuery using fictional pizza shop data. The data is modelled by SCD type 2 dimension tables and one transactional fact table. While the order records should be joined with dimension tables to get complete details for PostgreSQL, the fact table is denormalized using nested and repeated fields to improve query performance for BigQuery. Open Table Formats such as Apache Iceberg bring a new opportunity that implements data warehousing features in a data lake (i.e. data lakehouse) and Amazon Athena is probably the easiest way to perform such tasks on AWS. In this post, we create a new dbt project that targets Apache Iceberg where transformations are performed on Amazon Athena. Data modelling is similar to the BigQuery project where the dimension tables are modelled by the SCD type 2 approach and the fact table is denormalized using the array and struct data types.

February 22, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow BigQuery Dbt Docker GCP Python

In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

February 8, 202416 min read Data Engineering DBT Pizza Shop Demo BigQuery Dbt Docker GCP Python

In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery. While the dimension tables are kept by the same SCD type 2 approach, the fact table is denormalized using nested and repeated fields, which potentially can improve query performance by pre-joining corresponding dimension records.

January 25, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow Dbt Docker PostgreSQL Python

In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

January 18, 202415 min read Data Engineering DBT Pizza Shop Demo Dbt Docker PostgreSQL Python

The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios. In this series of posts, we discuss practical data warehouse/lakehouse examples including ETL orchestration with Apache Airflow. As a starting point, we develop a dbt project on PostgreSQL using fictional pizza shop data in this post.

Apache Beam Local Development With Python - Part 4 Streaming Pipelines

Apache Beam Local Development With Python - Part 3 Flink Runner

Apache Beam Local Development With Python - Part 2 Batch Pipelines

Apache Beam Local Development With Python - Part 1 Pipeline, Notebook, SQL and DataFrame

Data Build Tool (Dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 5 Modelling on Amazon Athena

Data Build Tool (Dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery

Data Build Tool (Dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL