Apache Flink

Run Flink SQL Cookbook in Docker

April 15, 20255 min read Apache Flink Apache Flink Docker Docker Compose

The Flink SQL Cookbook by Ververica is a hands-on, example-rich guide to mastering Apache Flink SQL for real-time stream processing. It offers a wide range of self-contained recipes, from basic queries and table operations to more advanced use cases like windowed aggregations, complex joins, user-defined functions (UDFs), and pattern detection. These examples are designed to be run on the Ververica Platform, and as such, the cookbook doesn’t include instructions for setting up a Flink cluster.

To help you run these recipes locally and explore Flink SQL without external dependencies, this post walks through setting up a fully functional local Flink cluster using Docker Compose. With this setup, you can experiment with the cookbook examples right on your machine.

May 30, 202413 min read Apache Flink Data Streaming Deploy Python Stream Processing App on Kubernetes Apache Flink Apache Kafka Docker Kubernetes Minikube Python

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. With the operator, we can simplify deployment and management of Python stream processing applications. In this series, we discuss how to deploy a PyFlink application and Python Apache Beam pipeline on the Flink Runner on Kubernetes. In Part 1, we first deploy a Kafka cluster on a minikube cluster as the source and sink of the PyFlink application are Kafka topics. Then, the application source is packaged in a custom Docker image and deployed on the minikube cluster using the Flink Kubernetes Operator. Finally, the output of the application is checked by sending messages to the input Kafka topic using a Python producer application.

December 7, 202316 min read Apache Flink Apache Spark Data Engineering Amazon EMR Apache Flink Apache Kafka Apache Spark Docker Docker Compose Pyflink PySpark Python

Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases. As it is integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog). In this post, we will discuss how to set up a local development environment for Apache Flink and Spark using the EMR container images. After illustrating the environment setup, we will discuss a solution where data ingestion/processing is performed in real time using Apache Flink and the processed data is consumed by Apache Spark for analysis.

November 23, 202315 min read Apache Flink Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Amazon OpenSearch Service Apache Flink Apache Kafka AWS Docker Docker Compose OpenSearch Pyflink Python

The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.

November 16, 202316 min read Apache Flink Apache Kafka Data Streaming Real Time Streaming With Kafka and Flink Amazon Athena Amazon MSK Amazon S3 Apache Flink Apache Kafka AWS Docker Docker Compose Pyflink Python

In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards. While the records are being written to the S3 bucket, a Glue table will be created to query them on Amazon Athena.

November 9, 202315 min read Apache Flink Apache Kafka Data Streaming Real Time Streaming With Kafka and Flink Amazon MSK Apache Flink Apache Kafka AWS Docker Docker Compose Pyflink Python

In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.

November 2, 20237 min read Apache Flink Data Streaming Apache Flink Apache Kafka Data Pipeline Event Driven Architecture Stateful Stream Processing Streaming Analytics

Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities. Those opportunities are summarised at the end.

October 19, 20236 min read Apache Flink Data Streaming Apache Flink Apache Kafka Docker Docker Compose Pyflink Python

Building Apache Flink Applications in Java by Confluent is a course to introduce Apache Flink through a series of hands-on exercises. Utilising the Flink DataStream API, the course develops three Flink applications from ingesting source data into calculating usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink. This post summarises the progress of the conversion and shows the final output.

October 5, 20236 min read Apache Flink Apache Kafka Data Streaming Real Time Streaming With Kafka and Flink Amazon Athena Amazon DyanmoDB Amazon MSK Amazon MSK Connect Amazon OpenSearch Service Amazon S3 Apache Camel Apache Flink Apache Kafka AWS AWS Glue AWS Lambda Docker Docker Compose Kafka Connect OpenSearch Pyflink Python

This series updates a real time analytics app based on Amazon Kinesis from an AWS workshop. Data is ingested from multiple sources into a Kafka cluster instead and Flink (Pyflink) apps are used extensively for data ingesting and processing. As an introduction, this post compares the original architecture with the new architecture, and the app will be implemented in subsequent posts.

September 14, 202317 min read Apache Flink Apache Kafka Data Streaming Kafka, Flink and DynamoDB for Real Time Fraud Detection Amazon DynamoDB Amazon Managed Flink Amazon Managed Service for Apache Flink Amazon MSK Amazon MSK Connect Apache Flink Apache Kafka Fraud Detection Kafka Connect Pyflink Python

This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.

Run Flink SQL Cookbook in Docker

Deploy Python Stream Processing App on Kubernetes - Part 1 PyFlink Application

Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images

Real Time Streaming With Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events With Flink

Real Time Streaming With Kafka and Flink - Lab 3 Transform and Write Data to S3 From Kafka Using Flink

Real Time Streaming With Kafka and Flink - Lab 2 Write Data to Kafka From S3 Using Flink

Benefits and Opportunities of Stateful Stream Processing

Building Apache Flink Applications in Python

Real Time Streaming With Kafka and Flink - Introduction

Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flink