Blogs

Use External Schema Registry With MSK Connect – Part 2 MSK Deployment

April 3, 20227 min read Data Integration Data Streaming Integrate Schema Registry With MSK Connect Amazon ECS Amazon MSK Apache Kafka Apicurio Registry AWS Change Data Capture (CDC)Debezium Docker Kafka Connect

We'll continue the discussion of a Change Data Capture (CDC) solution with a schema registry and its deployment to AWS. All major resources are deployed in private subnets and VPN is used to access them in order to improve developer experience. The Apicurio registry is used as the schema registry service and it is deployed as an ECS service. In order for the connectors to have access to the registry, the Confluent Avro Converter is packaged together with the connector sources. The post ends with illustrating how schema evolution is managed by the schema registry.

March 7, 202210 min read Data Integration Data Streaming Integrate Schema Registry With MSK Connect Apache Kafka Apicurio Registry AWS Change Data Capture (CDC)Debezium Docker Kafka Connect

We'll discuss a Change Data Capture (CDC) architecture with a schema registry. As a starting point, a local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry.

February 6, 202213 min read Development Amazon Aurora AWS PostgreSQL SoftEther VPN Terraform

We'll discuss how to set up a development infrastructure on AWS with Terraform. Terraform is used as an effective way of managing resources on AWS. An Aurora PostgreSQL cluster is created in a private subnet and SoftEther VPN is configured to access the database from the developer machine.

January 17, 202214 min read Data Engineering Amazon EKS Amazon EMR Apache Spark AWS EMR on EKS Kubernetes

EMR on EKS is a deployment option in EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. It can be an effective way of running spark jobs to manage big data (as well as non-big data) workloads. In this post, we’ll discuss EMR on EKS with simple and elaborated examples.

December 19, 202111 min read Data Engineering Data Integration Data Streaming Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Apache Hudi Apache Kafka AWS Change Data Capture (CDC)Debezium Kafka Connect

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build a Hudi DeltaStramer app on Amazon EMR and use the resulting Hudi table with Athena and Quicksight to build a dashboard.

December 12, 202117 min read Data Engineering Data Integration Data Streaming Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Apache Hudi Apache Kafka AWS Change Data Capture (CDC)Debezium Kafka Connect

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build CDC with Amazon MSK and MSK Connect.

December 5, 202118 min read Data Engineering Data Integration Data Streaming Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Apache Hudi Apache Kafka AWS Change Data Capture (CDC)Debezium Kafka Connect

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. As a starting point, we’ll discuss the source database and CDC streaming infrastructure in the local environment.

November 14, 20218 min read Data Engineering AWS AWS Glue Docker PySpark Python

Recently AWS Glue 3.0 was released but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.0 (and later versions) by building a custom docker image.

October 13, 20216 min read Development Amazon SQS AWS AWS Lambda EventBridge Node.js Serverless Framework

Triggering a Lambda function by an EventBridge Events rule can be used as a serverless replacement of cron job. The highest frequency of it is one invocation per minute so that it cannot be used directly if you need to schedule a Lambda function more frequently. In this post, I’ll demonstrate another serverless solution of scheduling a Lambda function at a sub-minute frequency using Amazon SQS.

August 20, 20219 min read Data Engineering Apache Spark AWS AWS Glue Docker PySpark Python

In this post, I'll demonstrate how to build development environments for AWS Glue 1.0 and 2.0 using the Docker image and the Visual Studio Code Remote - Containers extension.

Use External Schema Registry With MSK Connect – Part 2 MSK Deployment

Use External Schema Registry With MSK Connect – Part 1 Local Development

Simplify Your Development on AWS With Terraform

EMR on EKS by Example

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 2 Implement CDC

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 1 Local Development

Local Development of AWS Glue 3.0 and Later

Yet Another Serverless Solution for Invoking AWS Lambda at a Sub-Minute Frequency

AWS Glue Local Development With Docker and Visual Studio Code