Terraform

Integrate Glue Schema Registry With Your Python Kafka App

April 12, 202326 min read Data Streaming Amazon MSK Apache Kafka AWS AWS Glue Schema Registry AWS Lambda AWS Serverless Application Model Docker Docker Compose Python Terraform

Glue Schema Registry provides a centralized repository for managing and validating schemas for topic message data. Its features can be utilized by many AWS services when building data streaming applications. In this post, we will discuss how to integrate Python Kafka producer and consumer apps in AWS Lambda with the Glue Schema Registry.

March 14, 202312 min read Data Streaming Simplify Streaming Ingestion on AWS Amazon Athena Amazon EventBridge Amazon MSK Apache Kafka AWS AWS Lambda AWS SAM Python Terraform

Streaming ingestion from Kafka (MSK) into Redshift and Athena can be much simpler as they now support direct integration. In part 2, we discuss an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Athena. We also use AWS SAM integrated with Terraform for developing the producer Lambda function locally.

February 8, 202318 min read Data Streaming Simplify Streaming Ingestion on AWS Amazon EventBridge Amazon MSK Amazon Redshift Apache Kafka AWS AWS Lambda AWS SAM Python Terraform

Streaming ingestion from Kafka (MSK) into Redshift and Athena can be much simpler as they now support direct integration. In part 1, we discuss an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Redshift. We also use AWS SAM integrated with Terraform for developing the producer Lambda function locally.

December 6, 202215 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon Athena Amazon QuickSight AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the last part of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon Athena. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

November 1, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EKS Amazon EMR Amazon QuickSight Apache Spark AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 4 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR on EKS. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

October 19, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon EMR Amazon QuickSight Apache Spark AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 3 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Amazon EMR. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

October 9, 202218 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon QuickSight Apache Spark AWS AWS Glue Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 28, 202219 min read Data Engineering DBT for Effective Data Transformation on AWS Amazon Redshift AWS Data Build Tool (DBT)Terraform

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

September 7, 202215 min read Data Engineering Amazon EMR Apache Spark AWS PySpark Terraform Visual Studio Code

We will discuss how to set up a remote dev environment on an EMR cluster deployed in a private subnet with VPN and the VS Code remote SSH extension. Typical Spark development examples will be illustrated while sharing the cluster with multiple users. Overall it brings an effective way of developing Spark apps on EMR, which improves developer experience significantly.

August 26, 202212 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS Kubernetes Terraform

We'll discuss how to provision and manage Spark jobs on EMR on EKS with Terraform. Amazon EKS Blueprints for Terraform will be used for provisioning EKS, EMR virtual cluster and related resources. Also Spark job autoscaling will be managed by Karpenter where two Spark jobs with and without Dynamic Resource Allocation (DRA) will be compared.

Integrate Glue Schema Registry With Your Python Kafka App

Simplify Streaming Ingestion on AWS – Part 2 MSK and Athena

Simplify Streaming Ingestion on AWS – Part 1 MSK and Redshift

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 5 Athena

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKS

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 1 Redshift

Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code

Manage EMR on EKS With Terraform