Python

Data Build Tool (Dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery

February 8, 202416 min read Data Engineering DBT Pizza Shop Demo BigQuery Dbt Docker GCP Python

In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery. While the dimension tables are kept by the same SCD type 2 approach, the fact table is denormalized using nested and repeated fields, which potentially can improve query performance by pre-joining corresponding dimension records.

January 25, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow Dbt Docker PostgreSQL Python

In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

January 18, 202415 min read Data Engineering DBT Pizza Shop Demo Dbt Docker PostgreSQL Python

The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios. In this series of posts, we discuss practical data warehouse/lakehouse examples including ETL orchestration with Apache Airflow. As a starting point, we develop a dbt project on PostgreSQL using fictional pizza shop data in this post.

January 11, 20247 min read Data Integration Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kafka Connect Kubernetes Minikube Python

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket. The Kafka Connect servers and individual connectors are deployed using the custom resources of Strimzi on Kubernetes.

January 4, 20248 min read Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kubernetes Minikube Python Strimzi

Apache Kafka has five core APIs, and we can develop applications to send/read streams of data to/from topics in a Kafka cluster using the producer and consumer APIs. While the main Kafka project maintains only the Java APIs, there are several open source projects that provide the Kafka client APIs in Python. In this post, we discuss how to develop Kafka client applications using the kafka-python package on Kubernetes.

December 21, 20237 min read Data Streaming Kubernetes Kafka Development on Kubernetes Apache Kafka Docker Kubernetes Minikube Python Strimzi

Apache Kafka is one of the key technologies for implementing data streaming architectures. Strimzi provides a way to run an Apache Kafka cluster and related resources on Kubernetes in various deployment configurations. In this series of posts, we will discuss how to create a Kafka cluster, to develop Kafka client applications in Python and to build a data pipeline using Kafka connectors on Kubernetes.

December 14, 20236 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Kafka AWS AWS Lambda Kpow Python

Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.

November 23, 202315 min read Data Streaming Real Time Streaming With Kafka and Flink Apache Flink Apache Kafka OpenSearch Pyflink Python

The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.

November 16, 202316 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon S3 Apache Flink Apache Kafka AWS Kpow Pyflink Python

In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards. While the records are being written to the S3 bucket, a Glue table will be created to query them on Amazon Athena.

October 26, 202314 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon EventBridge Amazon MSK Apache Kafka AWS AWS Lambda Kpow Python

In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages.

Data Build Tool (Dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery

Data Build Tool (Dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL

Kafka Development on Kubernetes - Part 3 Kafka Connect

Kafka Development on Kubernetes - Part 2 Producer and Consumer

Kafka Development on Kubernetes - Part 1 Cluster Setup

Real Time Streaming With Kafka and Flink - Lab 6 Consume Data From Kafka Using Lambda

Real Time Streaming With Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events With Flink

Real Time Streaming With Kafka and Flink - Lab 3 Transform and Write Data to S3 From Kafka Using Flink

Real Time Streaming With Kafka and Flink - Lab 1 Produce Data to Kafka Using Lambda