Docker Compose

Apache Beam Python Examples - Part 3 Build Sport Activity Tracker With/Without SQL

August 1, 202419 min read Apache Beam Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Docker Docker Compose Python

In this post, we develop two Apache Beam pipelines that track sport activities of users and output their speed periodically. The first pipeline uses native transforms and Beam SQL is used for the latter. While Beam SQL can be useful in some situations, its features in the Python SDK are not complete compared to the Java SDK. Therefore, we are not able to build the required tracking pipeline using it. We end up discussing potential improvements of Beam SQL so that it can be used for building competitive applications with the Python SDK.

July 18, 202415 min read Apache Beam Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Docker Docker Compose Python

In this post, we develop two Apache Beam pipelines that calculate average word lengths from input texts that are ingested by a Kafka topic. They obtain the statistics in different angles. The first pipeline emits the global average lengths whenever a new input text arrives while the latter triggers those values in a sliding time window.

July 4, 202422 min read Apache Beam Data Streaming Apache Beam Python Examples Apache Beam Apache Flink Docker Docker Compose Python

In this series, we develop Apache Beam Python pipelines. The majority of them are from Building Big Data Pipelines with Apache Beam by Jan Lukavský. Mainly relying on the Java SDK, the book teaches fundamentals of Apache Beam using hands-on tasks, and we convert those tasks using the Python SDK. We focus on streaming pipelines, and they are deployed on a local (or embedded) Apache Flink cluster using the Apache Flink Runner. Beginning with setting up the development environment, we build two pipelines that obtain top K most frequent words and the word that has the longest word length in this post.

March 14, 20248 min read Data Engineering DBT Pizza Shop Demo Amazon Athena Apache Airflow Apache Iceberg AWS Data Build Tool (DBT)Docker Docker Compose Python

In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

February 22, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow BigQuery Data Build Tool (DBT)Docker Docker Compose GCP Python

In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

January 25, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow Data Build Tool (DBT)Docker Docker Compose PostgreSQL Python

In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

January 18, 202415 min read Data Engineering DBT Pizza Shop Demo Data Build Tool (DBT)Docker Docker Compose PostgreSQL Python

The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios. In this series of posts, we discuss practical data warehouse/lakehouse examples including ETL orchestration with Apache Airflow. As a starting point, we develop a dbt project on PostgreSQL using fictional pizza shop data in this post.

December 14, 20236 min read Apache Kafka Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Kafka AWS AWS Lambda Docker Docker Compose Python

Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.

December 7, 202316 min read Apache Flink Apache Spark Data Engineering Amazon EMR Apache Flink Apache Kafka Apache Spark Docker Docker Compose Pyflink PySpark Python

Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases. As it is integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog). In this post, we will discuss how to set up a local development environment for Apache Flink and Spark using the EMR container images. After illustrating the environment setup, we will discuss a solution where data ingestion/processing is performed in real time using Apache Flink and the processed data is consumed by Apache Spark for analysis.

November 30, 20239 min read Apache Kafka Data Streaming Real Time Streaming With Kafka and Flink Amazon DynamoDB Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this lab, we will discuss how to create a data pipeline that ingests data from a Kafka topic into a DynamoDB table using the Camel DynamoDB sink connector.

Apache Beam Python Examples - Part 3 Build Sport Activity Tracker With/Without SQL

Apache Beam Python Examples - Part 2 Calculate Average Word Length With/Without Fixed Look Back

Apache Beam Python Examples - Part 1 Calculate K Most Frequent Words and Max Word Length

Data Build Tool (Dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL

Real Time Streaming With Kafka and Flink - Lab 6 Consume Data From Kafka Using Lambda

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images

Real Time Streaming With Kafka and Flink - Lab 5 Write Data to DynamoDB Using Kafka Connect