Data Streaming on Jaehyeon Kim

Data Streaming on Jaehyeon Kimhttps://jaehyeon.me/categories/data-streaming/Recent content in Data Streaming on Jaehyeon KimHugo -- gohugo.ioenCopyright © 2023-2024 Jaehyeon Kim. All Rights Reserved.Thu, 14 Dec 2023 00:00:00 +0000Real Time Streaming with Kafka and Flink - Lab 6 Consume data from Kafka using Lambdahttps://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/Thu, 14 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function. Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda (this post) Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.Real Time Streaming with Kafka and Flink - Lab 5 Write data to DynamoDB using Kafka Connecthttps://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/Thu, 30 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this lab, we will discuss how to create a data pipeline that ingests data from a Kafka topic into a DynamoDB table using the Camel DynamoDB sink connector. Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect (this post) Lab 6 Consume data from Kafka using Lambda Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flinkhttps://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/Thu, 23 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flinkhttps://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/Thu, 16 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards.Real Time Streaming with Kafka and Flink - Lab 2 Write data to Kafka from S3 using Flinkhttps://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/Thu, 09 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.Benefits and Opportunities of Stateful Stream Processinghttps://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/Thu, 02 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities.Kafka Connect for AWS Services Integration - Part 5 Deploy Aiven OpenSearch Sink Connectorhttps://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/Mon, 30 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/In the previous post, we discussed how to develop a data pipeline from Apache Kafka into OpenSearch locally using Docker. The pipeline will be deployed on AWS using Amazon MSK, Amazon MSK Connect and Amazon OpenSearch Service using Terraform in this post. First the infrastructure will be deployed that covers a Virtual Private Cloud (VPC), Virtual Private Network (VPN) server, MSK Cluster and OpenSearch domain. Then Kafka source and sink connectors will be deployed on MSK Connect, followed by performing quick data analysis.Real Time Streaming with Kafka and Flink - Lab 1 Produce data to Kafka using Lambdahttps://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/Thu, 26 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages. Introduction Lab 1 Produce data to Kafka using Lambda (this post) Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda [Update 2023-11-06] Initially I planned to deploy Pyflink applications on Amazon Managed Service for Apache Flink, but I changed the plan to use a local Flink cluster deployed on Docker.Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connectorhttps://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/Mon, 23 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/OpenSearch is a popular search and analytics engine and its use cases cover log analytics, real-time application monitoring, and clickstream analysis. OpenSearch can be deployed on its own or via Amazon OpenSearch Service. Apache Kafka is a distributed event store and stream-processing platform, and it aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. On AWS, Apache Kafka can be deployed via Amazon Managed Streaming for Apache Kafka (MSK).Building Apache Flink Applications in Pythonhttps://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Thu, 19 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.Real Time Streaming with Kafka and Flink - Introductionhttps://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Thu, 05 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Real Time Streaming with Amazon Kinesis is an AWS workshop that helps users build a streaming analytics application on AWS. Incoming events are stored in a number of streams of the Amazon Kinesis Data Streams service, and various other AWS services and tools are used to process and analyse data. Apache Kafka is a popular distributed event store and stream processing platform, and it stores incoming events in topics. As part of learning real time streaming analytics on AWS, we can rebuild the analytics applications by replacing the Kinesis streams with Kafka topics.Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flinkhttps://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/Thu, 14 Sep 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.Getting Started with Pyflink on AWS - Part 3 AWS Managed Flink and MSKhttps://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/Mon, 04 Sep 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In the previous posts, I demonstrated a Pyflink app that targets a local Kafka cluster as well as a Kafka cluster on Amazon MSK. The app was executed in a virtual environment as well as in a local Flink cluster for improved monitoring. In this post, the app will be deployed via Amazon Managed Service for Apache Flink, which is the easiest option to run Flink applications on AWS.Getting Started with Pyflink on AWS - Part 2 Local Flink and MSKhttps://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/Mon, 28 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafkahttps://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Thu, 17 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Developmenthttps://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Thu, 10 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.Kafka Connect for AWS Services Integration - Part 3 Deploy Camel DynamoDB Sink Connectorhttps://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/Mon, 03 Jul 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-07-03-kafka-connect-for-aws-part-3/As part of investigating how to utilize Kafka Connect effectively for AWS services integration, I demonstrated how to develop the Camel DynamoDB sink connector using Docker in Part 2. Fake order data was generated using the MSK Data Generator source connector, and the sink connector was configured to consume the topic messages to ingest them into a DynamoDB table. In this post, I will illustrate how to deploy the data ingestion applications using Amazon MSK and MSK Connect.Kafka Connect for AWS Services Integration - Part 2 Develop Camel DynamoDB Sink Connectorhttps://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/Sun, 04 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/In Part 1, we reviewed Kafka connectors focusing on AWS services integration. Among the available connectors, the suite of Apache Camel Kafka connectors and the Kinesis Kafka connector from the AWS Labs can be effective for building data ingestion pipelines on AWS. In this post, I will illustrate how to develop the Camel DynamoDB sink connector using Docker. Fake order data will be generated using the MSK Data Generator source connector, and the sink connector will be configured to consume the topic messages to ingest them into a DynamoDB table.Kafka Connect for AWS Services Integration - Part 1 Introductionhttps://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/Wed, 03 May 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-05-03-kafka-connect-for-aws-part-1/Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (MSK) are two managed streaming services offered by AWS. Many resources on the web indicate Kinesis Data Streams is better when it comes to integrating with AWS services. However, it is not necessarily the case with the help of Kafka Connect. According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.Integrate Glue Schema Registry with Your Python Kafka Apphttps://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/Wed, 12 Apr 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/As Kafka producer and consumer apps are decoupled, they operate on Kafka topics rather than communicating with each other directly. As described in the Confluent document, Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network. Producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve. In AWS, the Glue Schema Registry supports features to manage and enforce schemas on data streaming applications using convenient integrations with Apache Kafka, Amazon Managed Streaming for Apache Kafka, Amazon Kinesis Data Streams, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.Simplify Streaming Ingestion on AWS – Part 2 MSK and Athenahttps://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/Tue, 14 Mar 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-03-14-simplify-streaming-ingestion-athena/In Part 1, we discussed a streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless. Athena provides the MSK connector to enable SQL queries on Apache Kafka topics directly, and it can also facilitate the extraction of insights without setting up an additional pipeline to store data into S3. In this post, we discuss how to update the streaming ingestion solution so that data in the Kafka topic can be queried by Athena instead of Redshift.Simplify Streaming Ingestion on AWS – Part 1 MSK and Redshifthttps://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/Wed, 08 Feb 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-02-08-simplify-streaming-ingestion-redshift/Apache Kafka is a popular distributed event store and stream processing platform. Previously loading data from Kafka into Redshift and Athena usually required Kafka connectors (e.g. Amazon Redshift Sink Connector and Amazon S3 Sink Connector). Recently these AWS services provide features to ingest data from Kafka directly, which facilitates a simpler architecture that achieves low-latency and high-speed ingestion of streaming data. In part 1 of the simplify streaming ingestion on AWS series, we discuss how to develop an end-to-end streaming ingestion solution using EventBridge, Lambda, MSK and Redshift Serverless on AWS.How to configure Kafka consumers to seek offsets by timestamphttps://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/Tue, 10 Jan 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/Normally we consume Kafka messages from the beginning/end of a topic or last committed offsets. For backfilling or troubleshooting, however, we need to consume messages from a certain timestamp occasionally. If we know which topic partition to choose e.g. by assigning a topic partition, we can easily override the fetch offset to a specific timestamp. When we deploy multiple consumer instances together, however, we make them subscribe to a topic and topic partitions are dynamically assigned, which means we cannot determine which fetch offset to use for an instance.Use External Schema Registry with MSK Connect – Part 2 MSK Deploymenthttps://jaehyeon.me/blog/2022-04-03-schema-registry-part2/Sun, 03 Apr 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry. In this post, we’ll build the solution on AWS using MSK, MSK Connect, Aurora PostgreSQL and ECS.Use External Schema Registry with MSK Connect – Part 1 Local Developmenthttps://jaehyeon.me/blog/2022-03-07-schema-registry-part1/Mon, 07 Mar 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that comes with Kafka Connect. We optionally enabled the key and value schemas and the topic messages include those schemas together with payload. It seems to be convenient at first as the messages are saved into S3 on their own. However, it became cumbersome when we tried to use the DeltaStreamer utility.