Data Engineering on Jaehyeon Kim

Data Engineering on Jaehyeon Kimhttps://jaehyeon.me/categories/data-engineering/Recent content in Data Engineering on Jaehyeon KimHugo -- gohugo.ioenCopyright © 2023-2024 Jaehyeon Kim. All Rights Reserved.Thu, 14 Mar 2024 00:00:00 +0000Data Build Tool (dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflowhttps://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/Thu, 14 Mar 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types.Data Build Tool (dbt) Pizza Shop Demo - Part 5 Modelling on Amazon Athenahttps://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/Thu, 07 Mar 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/In Part 1 and Part 3, we developed data build tool (dbt) projects that target PostgreSQL and BigQuery using fictional pizza shop data. The data is modelled by SCD type 2 dimension tables and one transactional fact table. While the order records should be joined with dimension tables to get complete details for PostgreSQL, the fact table is denormalized using nested and repeated fields to improve query performance for BigQuery.Data Build Tool (dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflowhttps://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/Thu, 22 Feb 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.Data Build Tool (dbt) Pizza Shop Demo - Part 3 Modelling on BigQueryhttps://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/Thu, 08 Feb 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery.Data Build Tool (dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflowhttps://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/Thu, 25 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.Data Build Tool (dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQLhttps://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/Thu, 18 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios.Setup Local Development Environment for Apache Flink and Spark Using EMR Container Imageshttps://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Thu, 07 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery. As both of them can be integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog).Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 5 Athenahttps://jaehyeon.me/blog/2022-12-06-dbt-on-aws-part-5-athena/Tue, 06 Dec 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-12-06-dbt-on-aws-part-5-athena/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue, EMR on EC2 and EMR on EKS are illustrated as well.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKShttps://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/Tue, 01 Nov 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue and EMR on EC2 are illustrated as well.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/Wed, 19 Oct 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless and Glue are illustrated as well.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 2 Gluehttps://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/Sun, 09 Oct 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. A demo data project that targets Redshift Serverless is illustrated as well. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 1 Redshifthttps://jaehyeon.me/blog/2022-09-28-dbt-on-aws-part-1-redshift/Wed, 28 Sep 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-09-28-dbt-on-aws-part-1-redshift/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices. Part 1 Redshift (this post) Part 2 Glue Part 3 EMR on EC2 Part 4 EMR on EKS Part 5 Athena Motivation In our experience delivering data solutions for our customers, we have observed a desire to move away from a centralised team function, responsible for the data collection, analysis and reporting, towards shifting this responsibility to an organisation’s lines of business (LOB) teams.Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Codehttps://jaehyeon.me/blog/2022-09-07-emr-remote-dev/Wed, 07 Sep 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/When we develop a Spark application on EMR, we can use docker for local development or notebooks via EMR Studio (or EMR Notebooks). However, the local development option is not viable if the size of data is large. Also, I am not a fan of notebooks as it is not possible to utilise the features my editor supports such as syntax highlighting, autocomplete and code formatting. Moreover, it is not possible to organise code into modules and to perform unit testing properly with that option.Manage EMR on EKS with Terraformhttps://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/Fri, 26 Aug 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/Amazon EMR on EKS is a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. While eksctl is popular for working with Amazon EKS clusters, it has limitations when it comes to building infrastructure that integrates multiple AWS services. Also, it is not straightforward to update EKS cluster resources incrementally with it. On the other hand Terraform can be an effective tool for managing infrastructure that includes not only EKS and EMR virtual clusters but also other AWS resources.Revisit AWS Lambda Invoke Function Operator of Apache Airflowhttps://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/Sat, 06 Aug 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/Apache Airflow is a popular workflow management platform. A wide range of AWS services are integrated with the platform by Amazon AWS Operators. AWS Lambda is one of the integrated services, and it can be used to develop workflows efficiently. The current Lambda Operator, however, just invokes a Lambda function, and it can fail to report the invocation result of a function correctly and to record the exact error message from failure.Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environmenthttps://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Sun, 26 Jun 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we’ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.Develop and Test Apache Spark Apps for EMR Locally Using Dockerhttps://jaehyeon.me/blog/2022-05-08-emr-local-dev/Sun, 08 May 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-05-08-emr-local-dev/[UPDATE 2023-12-07] I wrote a new post that simplifies the Spark configuration dramatically. Besides, the log configuration is based on Log4J2, which applies to newer Spark versions. Moreover, the container is configured to run the Spark History Server, and it allows us to debug and diagnose completed and running Spark applications. I recommend referring to the new post. Amazon EMR is a managed service that simplifies running Apache Spark on AWS.EMR on EKS by Examplehttps://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/Mon, 17 Jan 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/EMR on EKS provides a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lakehttps://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/Sun, 19 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDChttps://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/Sun, 12 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Developmenthttps://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Sun, 05 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we’ll build a data lake that uses CDC.Local Development of AWS Glue 3.0 and Laterhttps://jaehyeon.me/blog/2021-11-14-glue-3-local-development/Sun, 14 Nov 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.AWS Glue Local Development with Docker and Visual Studio Codehttps://jaehyeon.me/blog/2021-08-20-glue-local-development/Fri, 20 Aug 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-08-20-glue-local-development/As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.Thoughts on Apache Airflow AWS Lambda Operatorhttps://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/Mon, 13 Apr 2020 00:00:00 +0000https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers for scalability. In AWS, however, scalability can also be achieved using serverless computing services in a simpler way. For example, the ECS Operator allows to run dockerized tasks and, with the Fargate launch type, they can run in a serverless environment. The ECS Operator alone is not sufficient because it can take up to several minutes to pull a Docker image and to set up network interface (for the case of Fargate launch type).