Data Streaming

Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner

June 6, 202416 min read Data Streaming Deploy Python Stream Processing App on Kubernetes Apache Beam Apache Flink Apache Kafka Docker Kubernetes Minikube Python

In this post, we develop an Apache Beam pipeline using the Python SDK and deploy it on an Apache Flink cluster via the Apache Flink Runner. Same as Part I, we deploy a Kafka cluster using the Strimzi Operator on a minikube cluster as the pipeline uses Apache Kafka topics for its data source and sink. Then, we develop the pipeline as a Python package and add the package to a custom Docker image so that Python user code can be executed externally. For deployment, we create a Flink session cluster via the Flink Kubernetes Operator, and deploy the pipeline using a Kubernetes job. Finally, we check the output of the application by sending messages to the input Kafka topic using a Python producer application.

May 30, 202413 min read Data Streaming Deploy Python Stream Processing App on Kubernetes Apache Flink Apache Kafka Docker Kubernetes Minikube Python

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the Flink Runner on Kubernetes. In Part 1, we first deploy a Kafka cluster on a minikube cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.

December 14, 20236 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Kafka AWS AWS Lambda Docker Docker Compose Python

Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function.

November 30, 20239 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon DynamoDB Amazon MSK Amazon MSK Connect Apache Kafka AWS Docker Docker Compose Kafka Connect

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this lab, we will discuss how to create a data pipeline that ingests data from a Kafka topic into a DynamoDB table using the Camel DynamoDB sink connector.

November 23, 202315 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Amazon OpenSearch Service Apache Flink Apache Kafka AWS Docker Docker Compose OpenSearch Pyflink Python

The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.

November 16, 202316 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon Athena Amazon MSK Amazon S3 Apache Flink Apache Kafka AWS Docker Docker Compose Pyflink Python

In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards. While the records are being written to the S3 bucket, a Glue table will be created to query them on Amazon Athena.

November 9, 202315 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Flink Apache Kafka AWS Docker Docker Compose Pyflink Python

In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.

November 2, 20237 min read Data Streaming Apache Flink Apache Kafka Data Pipeline Event Driven Architecture Stateful Stream Processing Streaming Analytics

Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities. Those opportunities are summarised at the end.

October 30, 202318 min read Data Streaming Kafka Connect for AWS Services Integration Amazon MSK Amazon OpenSearch Service Apache Kafka AWS Docker Docker Compose Kafka Connect MSK Connect OpenSearch

In the previous post, we discussed how to develop a data pipeline from Apache Kafka into OpenSearch locally using Docker. The pipeline will be deployed on AWS using Amazon MSK, Amazon MSK Connect and Amazon OpenSearch Service using Terraform in this post. First the infrastructure will be deployed that covers a VPC, VPN server, MSK Cluster and OpenSearch domain. Then Kafka source and sink connectors will be deployed on MSK Connect, followed by performing quick data analysis.

October 26, 202314 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Kafka AWS AWS Lambda Docker Docker Compose Python

In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages.

Deploy Python Stream Processing App on Kubernetes - Part 2 Beam Pipeline on Flink Runner

Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application

Real Time Streaming With Kafka and Flink - Lab 6 Consume Data From Kafka Using Lambda

Real Time Streaming With Kafka and Flink - Lab 5 Write Data to DynamoDB Using Kafka Connect

Real Time Streaming With Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events With Flink

Real Time Streaming With Kafka and Flink - Lab 3 Transform and Write Data to S3 From Kafka Using Flink

Real Time Streaming With Kafka and Flink - Lab 2 Write Data to Kafka From S3 Using Flink

Benefits and Opportunities of Stateful Stream Processing

Kafka Connect for AWS Services Integration - Part 5 Deploy Aiven OpenSearch Sink Connector

Real Time Streaming With Kafka and Flink - Lab 1 Produce Data to Kafka Using Lambda