Amazon MSK

Getting Started With Pyflink on AWS - Part 2 Local Flink and MSK

August 28, 202320 min read Data Streaming Getting Started With Pyflink on AWS Amazon Managed Flink Amazon Managed Service for Apache Flink Amazon MSK Apache Flink Apache Kafka Docker Docker Compose Pyflink Python

In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files. Same as part 1, the app will be executed in a virtual environment as well as in a local Flink cluster for improved monitoring with the updated pipeline jar file.

August 17, 202316 min read Data Streaming Getting Started With Pyflink on AWS Amazon Managed Flink Amazon Managed Service for Apache Flink Amazon MSK Apache Flink Apache Kafka Docker Docker Compose Pyflink Python

Apache Flink is widely used for building real-time stream processing applications. On AWS, Amazon Managed Service for Apache Flink is the easiest option to develop a Flink app as it provides the underlying infrastructure. Updating a guide from AWS, this series of posts discuss how to develop and deploy a Flink (Pyflink) application via KDA where the data source and sink are Kafka topics. In part 1, the app will be developed locally targeting a Kafka cluster created by Docker. Furthermore, it will be executed in a virtual environment as well as in a local Flink cluster for improved monitoring.

July 3, 202314 min read Data Streaming Kafka Connect for AWS Services Integration Amazon DynamoDB Amazon MSK Amazon MSK Connect Apache Camel Apache Kafka AWS Kafka Connect

As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.

April 12, 202326 min read Data Streaming Amazon MSK Apache Kafka AWS AWS Glue Schema Registry AWS Lambda AWS Serverless Application Model Docker Docker Compose Python Terraform

Glue Schema Registry provides a centralized repository for managing and validating schemas for topic message data. Its features can be utilized by many AWS services when building data streaming applications. In this post, we will discuss how to integrate Python Kafka producer and consumer apps in AWS Lambda with the Glue Schema Registry.

March 14, 202312 min read Data Streaming Simplify Streaming Ingestion on AWS Amazon Athena Amazon EventBridge Amazon MSK Apache Kafka AWS AWS Lambda AWS SAM Python Terraform

Streaming ingestion from Kafka (MSK) into Redshift and Athena can be much simpler as they now support direct integration. In part 2, we discuss an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Athena. We also use AWS SAM integrated with Terraform for developing the producer Lambda function locally.

February 8, 202318 min read Data Streaming Simplify Streaming Ingestion on AWS Amazon EventBridge Amazon MSK Amazon Redshift Apache Kafka AWS AWS Lambda AWS SAM Python Terraform

Streaming ingestion from Kafka (MSK) into Redshift and Athena can be much simpler as they now support direct integration. In part 1, we discuss an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Redshift. We also use AWS SAM integrated with Terraform for developing the producer Lambda function locally.

April 3, 20227 min read Data Streaming Integrate Schema Registry With MSK Connect Amazon ECS Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect Terraform

We'll continue the discussion of a Change Data Capture (CDC) solution with a schema registry and its deployment to AWS. All major resources are deployed in private subnets and VPN is used to access them in order to improve developer experience. The Apicurio registry is used as the schema registry service and it is deployed as an ECS service. In order for the connectors to have access to the registry, the Confluent Avro Converter is packaged together with the connector sources. The post ends with illustrating how schema evolution is managed by the schema registry.

March 7, 202210 min read Data Streaming Integrate Schema Registry With MSK Connect Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect

We'll discuss a Change Data Capture (CDC) architecture with a schema registry. As a starting point, a local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry.

December 19, 202111 min read Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build a Hudi DeltaStramer app on Amazon EMR and use the resulting Hudi table with Athena and Quicksight to build a dashboard.

December 12, 202117 min read Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build CDC with Amazon MSK and MSK Connect.

Getting Started With Pyflink on AWS - Part 2 Local Flink and MSK

Getting Started With Pyflink on AWS - Part 1 Local Flink and Local Kafka

Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connector

Integrate Glue Schema Registry With Your Python Kafka App

Simplify Streaming Ingestion on AWS – Part 2 MSK and Athena

Simplify Streaming Ingestion on AWS – Part 1 MSK and Redshift

Use External Schema Registry With MSK Connect – Part 2 MSK Deployment

Use External Schema Registry With MSK Connect – Part 1 Local Development

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 2 Implement CDC