Apache Flink on Jaehyeon Kim

Apache Flink on Jaehyeon Kimhttps://jaehyeon.me/tags/apache-flink/Recent content in Apache Flink on Jaehyeon KimHugo -- gohugo.ioenCopyright © 2023-2024 Jaehyeon Kim. All Rights Reserved.Thu, 09 May 2024 00:00:00 +0000Apache Beam Local Development with Python - Part 5 Testing Pipelineshttps://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/Thu, 09 May 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.Apache Beam Local Development with Python - Part 4 Streaming Pipelineshttps://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/Thu, 02 May 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-05-02-beam-local-dev-4/In Part 3, we discussed the portability layer of Apache Beam as it helps understand (1) how Python pipelines run on the Flink Runner and (2) how multiple SDKs can be used in a single pipeline, followed by demonstrating local Flink and Kafka cluster creation for developing streaming pipelines. In this post, we build a streaming pipeline that aggregates page visits by user in a fixed time window of 20 seconds.Apache Beam Local Development with Python - Part 3 Flink Runnerhttps://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/Thu, 18 Apr 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-04-18-beam-local-dev-3/In this series, we discuss local development of Apache Beam pipelines using Python. In the previous posts, we mainly talked about Batch pipelines with/without Beam SQL. Beam pipelines are portable between batch and streaming semantics, and we will discuss streaming pipeline development in this and the next posts. While there are multiple Beam Runners, not every Runner supports Python or some Runners have too limited features in streaming semantics - see Beam Capability Matrix for details.Apache Beam Local Development with Python - Part 2 Batch Pipelineshttps://jaehyeon.me/blog/2024-04-04-beam-local-dev-2/Thu, 04 Apr 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-04-04-beam-local-dev-2/In this series, we discuss local development of Apache Beam pipelines using Python. A basic Beam pipeline was introduced in Part 1, followed by demonstrating how to utilise Jupyter notebooks, Beam SQL and Beam DataFrames. In this post, we discuss Batch pipelines that aggregate website visit log by user and time. The pipelines are developed with and without Beam SQL. Additionally, each pipeline is implemented on a Jupyter notebook for demonstration.Apache Beam Local Development with Python - Part 1 Pipeline, Notebook, SQL and DataFramehttps://jaehyeon.me/blog/2024-03-28-beam-local-dev-1/Thu, 28 Mar 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-03-28-beam-local-dev-1/Apache Beam and Apache Flink are open-source frameworks for parallel, distributed data processing at scale. Flink has DataStream and Table/SQL APIs and the former has more capacity to develop sophisticated data streaming applications. The DataStream API of PyFlink, Flink’s Python API, however, is not as complete as its Java counterpart, and it doesn’t provide enough capability to extend when there are missing features in Python. Recently I had a chance to look through Apache Beam and found it supports more possibility to extend and/or customise its features.Setup Local Development Environment for Apache Flink and Spark Using EMR Container Imageshttps://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Thu, 07 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery. As both of them can be integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog).Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flinkhttps://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/Thu, 23 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flinkhttps://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/Thu, 16 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards.Real Time Streaming with Kafka and Flink - Lab 2 Write data to Kafka from S3 using Flinkhttps://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/Thu, 09 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.Benefits and Opportunities of Stateful Stream Processinghttps://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/Thu, 02 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-02-stateful-stream-processing/Stream processing technology is becoming more and more popular with companies big and small because it provides superior solutions for many established use cases such as data analytics, ETL, and transactional applications, but also facilitates novel applications, software architectures, and business opportunities. Beginning with traditional data infrastructures and application/data development patterns, this post introduces stateful stream processing and demonstrates to what extent it can improve the traditional development patterns. A consulting company can partner with her clients on their journeys of adopting stateful stream processing, and it can bring huge opportunities.Building Apache Flink Applications in Pythonhttps://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Thu, 19 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.Real Time Streaming with Kafka and Flink - Introductionhttps://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Thu, 05 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Real Time Streaming with Amazon Kinesis is an AWS workshop that helps users build a streaming analytics application on AWS. Incoming events are stored in a number of streams of the Amazon Kinesis Data Streams service, and various other AWS services and tools are used to process and analyse data. Apache Kafka is a popular distributed event store and stream processing platform, and it stores incoming events in topics. As part of learning real time streaming analytics on AWS, we can rebuild the analytics applications by replacing the Kinesis streams with Kafka topics.Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 2 Deployment via AWS Managed Flinkhttps://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/Thu, 14 Sep 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-09-14-fraud-detection-part-2/This series aims to help those who are new to Apache Flink and Amazon Managed Service for Apache Flink by re-implementing a simple fraud detection application that is discussed in an AWS workshop titled AWS Kafka and DynamoDB for real time fraud detection. In part 1, I demonstrated how to develop the application locally, and the app will be deployed via Amazon Managed Service for Apache Flink in this post.Getting Started with Pyflink on AWS - Part 3 AWS Managed Flink and MSKhttps://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/Mon, 04 Sep 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In the previous posts, I demonstrated a Pyflink app that targets a local Kafka cluster as well as a Kafka cluster on Amazon MSK. The app was executed in a virtual environment as well as in a local Flink cluster for improved monitoring. In this post, the app will be deployed via Amazon Managed Service for Apache Flink, which is the easiest option to run Flink applications on AWS.Getting Started with Pyflink on AWS - Part 2 Local Flink and MSKhttps://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/Mon, 28 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafkahttps://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Thu, 17 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Developmenthttps://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Thu, 10 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.