AWS Page 3 - Tags - Jaehyeon Kim

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

October 9, 202218 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon QuickSight Apache Spark AWS AWS Glue Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 28, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon Redshift AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 7, 202215 min read Data Engineering Amazon EMR Apache Spark AWS PySpark Terraform Visual Studio Code

We will discuss how to set up a remote dev environment on an EMR cluster deployed in a private subnet with VPN and the VS Code remote SSH extension. Typical Spark development examples will be illustrated while sharing the cluster with multiple users. Overall it brings an effective way of developing Spark apps on EMR, which improves developer experience significantly.

August 26, 202212 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS Kubernetes Terraform

We'll discuss how to provision and manage Spark jobs on EMR on EKS with Terraform. Amazon EKS Blueprints for Terraform will be used for provisioning EKS, EMR virtual cluster and related resources. Also Spark job autoscaling will be managed by Karpenter where two Spark jobs with and without Dynamic Resource Allocation (DRA) will be compared.

August 6, 202214 min read Data Engineering Apache Airflow AWS AWS Lambda Docker Docker Compose Python

We'll discuss limitations of the Lambda invoke function operator of Apache Airflow and create a custom Lambda operator. The custom operator extends the existing one and it reports the invocation result of a function correctly and records the exact error message from failure.

July 18, 20227 min read Development AWS AWS Lambda AWS SAM

We'll discuss how to build a serverless data processing application using the Serverless Application Model (SAM). A Lambda function is developed, which is triggered whenever an object is created in a S3 bucket. 3rd party packages are necessary for data processing and they are made available by Lambda layers.

June 26, 202212 min read Data Engineering Amazon EMR Apache Iceberg Apache Spark AWS Docker Docker Compose ETL PySpark SCD Slowly Changing Dimension Visual Studio Code

We'll discuss how to implement data warehousing ETL using Iceberg for data storage/management and Spark for data processing. A Pyspark ETL app will be used for demonstration in an EMR local environment. Finally the ETL results will be queried by Athena for verification.

May 8, 202217 min read Data Engineering Amazon EMR Apache Spark AWS Docker Docker Compose PySpark Visual Studio Code

We'll discuss how to create a Spark local dev environment for EMR using Docker and/or VSCode. A range of Spark development examples are demonstrated and Glue Catalog integration is illustrated as well.

April 3, 20227 min read Data Streaming Integrate Schema Registry With MSK Connect Amazon ECS Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect Terraform

We'll continue the discussion of a Change Data Capture (CDC) solution with a schema registry and its deployment to AWS. All major resources are deployed in private subnets and VPN is used to access them in order to improve developer experience. The Apicurio registry is used as the schema registry service and it is deployed as an ECS service. In order for the connectors to have access to the registry, the Confluent Avro Converter is packaged together with the connector sources. The post ends with illustrating how schema evolution is managed by the schema registry.

March 7, 202210 min read Data Streaming Integrate Schema Registry With MSK Connect Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect

We'll discuss a Change Data Capture (CDC) architecture with a schema registry. As a starting point, a local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry.

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 1 Redshift

Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code

Manage EMR on EKS With Terraform

Revisit AWS Lambda Invoke Function Operator of Apache Airflow

Serverless Application Model (SAM) for Data Professionals

Data Warehousing ETL Demo With Apache Iceberg on EMR Local Environment

Develop and Test Apache Spark Apps for EMR Locally Using Docker

Use External Schema Registry With MSK Connect – Part 2 MSK Deployment

Use External Schema Registry With MSK Connect – Part 1 Local Development