Apache Flink

Real Time Streaming With Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events With Flink

November 23, 202315 min read Data Streaming Real Time Streaming With Kafka and Flink Apache Flink Apache Kafka OpenSearch Pyflink Python

The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.

November 16, 202316 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon S3 Apache Flink Apache Kafka AWS Kpow Pyflink Python

In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards. While the records are being written to the S3 bucket, a Glue table will be created to query them on Amazon Athena.

November 9, 202315 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon S3 Apache Flink Apache Kafka AWS Docker Kpow Pyflink

In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.

November 2, 20237 min read Data Architecture Data Streaming Apache Flink Apache Kafka Streaming Analytics

Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities. Those opportunities are summarised at the end.

October 19, 20236 min read Data Streaming Apache Flink Apache Kafka Docker Pyflink Python

Building Apache Flink Applications in Java by Confluent is a course to introduce Apache Flink through a series of hands-on exercises. Utilising the Flink DataStream API, the course develops three Flink applications from ingesting source data into calculating usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink. This post summarises the progress of the conversion and shows the final output.

October 5, 20236 min read Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Flink Apache Kafka AWS Pyflink

This series updates a real time analytics app based on Amazon Kinesis from an AWS workshop. Data is ingested from multiple sources into a Kafka cluster instead and Flink (Pyflink) apps are used extensively for data ingesting and processing. As an introduction, this post compares the original architecture with the new architecture, and the app will be implemented in subsequent posts.

September 14, 202317 min read Data Streaming Kafka, Flink and DynamoDB for Real Time Fraud Detection Amazon DynamoDB Apache Flink Apache Kafka AWS Kpow Python

This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.

August 28, 202320 min read Data Streaming Getting Started With Pyflink on AWS Amazon MSK Apache Flink Apache Kafka AWS Kpow Pyflink Python

In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files. Same as part 1, the app will be executed in a virtual environment as well as in a local Flink cluster for improved monitoring with the updated pipeline jar file.

August 17, 202316 min read Data Streaming Getting Started With Pyflink on AWS Apache Flink Apache Kafka Docker Kpow Pyflink Python

Apache Flink is widely used for building real-time stream processing applications. On AWS, Amazon Managed Service for Apache Flink is the easiest option to develop a Flink app as it provides the underlying infrastructure. Updating a guide from AWS, this series of posts discuss how to develop and deploy a Flink (Pyflink) application via KDA where the data source and sink are Kafka topics. In part 1, the app will be developed locally targeting a Kafka cluster created by Docker. Furthermore, it will be executed in a virtual environment as well as in a local Flink cluster for improved monitoring.

August 10, 202316 min read Data Streaming Kafka, Flink and DynamoDB for Real Time Fraud Detection Amazon DynamoDB Apache Flink Apache Kafka AWS Docker Kpow Python

Apache Flink is widely used for building real-time stream processing applications. On AWS, Amazon Managed Service for Apache Flink is the easiest option to develop a Flink app as it provides the underlying infrastructure. Re-implementing a solution from an AWS workshop, this series of posts discuss how to develop and deploy a fraud detection app using Kafka, Flink and DynamoDB. Part 1 covers local development using Docker while deployment via KDA will be discussed in part 2.

Real Time Streaming With Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events With Flink

Real Time Streaming With Kafka and Flink - Lab 3 Transform and Write Data to S3 From Kafka Using Flink

Real Time Streaming With Kafka and Flink - Lab 2 Write Data to Kafka From S3 Using Flink

Benefits and Opportunities of Stateful Stream Processing

Building Apache Flink Applications in Python

Real Time Streaming With Kafka and Flink - Introduction

Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flink

Getting Started With Pyflink on AWS - Part 2 Local Flink and MSK

Getting Started With Pyflink on AWS - Part 1 Local Flink and Local Kafka

Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Development