Data Engineering

Data Build Tool (Dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL

January 18, 202415 min read Data Engineering DBT Pizza Shop Demo Dbt Docker PostgreSQL Python

The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios. In this series of posts, we discuss practical data warehouse/lakehouse examples including ETL orchestration with Apache Airflow. As a starting point, we develop a dbt project on PostgreSQL using fictional pizza shop data in this post.

December 7, 202316 min read Data Engineering Data Streaming Development Amazon EMR Apache Flink Apache Kafka Apache Spark AWS Docker

Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases. As it is integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog). In this post, we will discuss how to set up a local development environment for Apache Flink and Spark using the EMR container images. After illustrating the environment setup, we will discuss a solution where data ingestion/processing is performed in real time using Apache Flink and the processed data is consumed by Apache Spark for analysis.

December 6, 202215 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon Athena Amazon QuickSight AWS AWS Glue Dbt

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the last part of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon Athena. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

November 1, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EKS Amazon EMR Apache Spark AWS Dbt EMR on EKS

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 4 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR on EKS. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

October 19, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EMR Amazon QuickSight Apache Spark AWS Dbt

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 3 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

October 9, 202218 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon QuickSight Apache Spark AWS AWS Glue Dbt

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 28, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon Redshift AWS AWS Serverless Dbt

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 7, 202215 min read Data Engineering Amazon EMR Apache Spark AWS PySpark

We will discuss how to set up a remote dev environment on an EMR cluster deployed in a private subnet with VPN and the VS Code remote SSH extension. Typical Spark development examples will be illustrated while sharing the cluster with multiple users. Overall it brings an effective way of developing Spark apps on EMR, which improves developer experience significantly.

August 26, 202212 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS EMR on EKS Karpenter Terraform

We'll discuss how to provision and manage Spark jobs on EMR on EKS with Terraform. Amazon EKS Blueprints for Terraform will be used for provisioning EKS, EMR virtual cluster and related resources. Also Spark job autoscaling will be managed by Karpenter where two Spark jobs with and without Dynamic Resource Allocation (DRA) will be compared.

August 6, 202214 min read Data Engineering Apache Airflow AWS AWS Lambda Docker Python

We'll discuss limitations of the Lambda invoke function operator of Apache Airflow and create a custom Lambda operator. The custom operator extends the existing one and it reports the invocation result of a function correctly and records the exact error message from failure.

Data Build Tool (Dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 5 Athena

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKS

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 1 Redshift

Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code

Manage EMR on EKS With Terraform

Revisit AWS Lambda Invoke Function Operator of Apache Airflow