Amazon EMR

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images

December 7, 202316 min read Data Engineering Amazon EMR Apache Flink Apache Kafka Apache Spark Docker Docker Compose Pyflink PySpark Python

Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases. As it is integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog). In this post, we will discuss how to set up a local development environment for Apache Flink and Spark using the EMR container images. After illustrating the environment setup, we will discuss a solution where data ingestion/processing is performed in real time using Apache Flink and the processed data is consumed by Apache Spark for analysis.

November 1, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EKS Amazon EMR Amazon QuickSight Apache Spark AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 4 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR on EKS. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

October 19, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EMR Amazon QuickSight Apache Spark AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 3 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 7, 202215 min read Data Engineering Amazon EMR Apache Spark AWS PySpark Terraform Visual Studio Code

We will discuss how to set up a remote dev environment on an EMR cluster deployed in a private subnet with VPN and the VS Code remote SSH extension. Typical Spark development examples will be illustrated while sharing the cluster with multiple users. Overall it brings an effective way of developing Spark apps on EMR, which improves developer experience significantly.

August 26, 202212 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS Kubernetes Terraform

We'll discuss how to provision and manage Spark jobs on EMR on EKS with Terraform. Amazon EKS Blueprints for Terraform will be used for provisioning EKS, EMR virtual cluster and related resources. Also Spark job autoscaling will be managed by Karpenter where two Spark jobs with and without Dynamic Resource Allocation (DRA) will be compared.

June 26, 202212 min read Data Engineering Amazon EMR Apache Iceberg Apache Spark AWS Docker Docker Compose ETL PySpark SCD Slowly Changing Dimension Visual Studio Code

We'll discuss how to implement data warehousing ETL using Iceberg for data storage/management and Spark for data processing. A Pyspark ETL app will be used for demonstration in an EMR local environment. Finally the ETL results will be queried by Athena for verification.

May 8, 202217 min read Data Engineering Amazon EMR Apache Spark AWS Docker Docker Compose PySpark Visual Studio Code

We'll discuss how to create a Spark local dev environment for EMR using Docker and/or VSCode. A range of Spark development examples are demonstrated and Glue Catalog integration is illustrated as well.

January 17, 202214 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS Kubernetes

EMR on EKS is a deployment option in EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. It can be an effective way of running spark jobs to manage big data (as well as non-big data) workloads. In this post, we’ll discuss EMR on EKS with simple and elaborated examples.

December 19, 202111 min read Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build a Hudi DeltaStramer app on Amazon EMR and use the resulting Hudi table with Athena and Quicksight to build a dashboard.

December 12, 202117 min read Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build CDC with Amazon MSK and MSK Connect.

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKS

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2

Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code

Manage EMR on EKS With Terraform

Data Warehousing ETL Demo With Apache Iceberg on EMR Local Environment

Develop and Test Apache Spark Apps for EMR Locally Using Docker

EMR on EKS by Example

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 2 Implement CDC