Docker on Jaehyeon Kim

Docker on Jaehyeon Kimhttps://jaehyeon.me/tags/docker/Recent content in Docker on Jaehyeon KimHugo -- gohugo.ioenCopyright © 2023-2024 Jaehyeon Kim. All Rights Reserved.Thu, 14 Mar 2024 00:00:00 +0000Data Build Tool (dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflowhttps://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/Thu, 14 Mar 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types.Data Build Tool (dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflowhttps://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/Thu, 22 Feb 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.Data Build Tool (dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflowhttps://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/Thu, 25 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.Data Build Tool (dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQLhttps://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/Thu, 18 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios.Kafka Development on Kubernetes - Part 3 Kafka Connecthttps://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/Thu, 11 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-11-kafka-development-on-k8s-part-3/Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this post, we discuss how to set up a data ingestion pipeline using Kafka connectors. Fake customer and order data is ingested into Kafka topics using the MSK Data Generator. Also, we use the Confluent S3 sink connector to save the messages of the topics into a S3 bucket.Kafka Development on Kubernetes - Part 2 Producer and Consumerhttps://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/Thu, 04 Jan 2024 00:00:00 +0000https://jaehyeon.me/blog/2024-01-04-kafka-development-on-k8s-part-2/Apache Kafka has five core APIs, and we can develop applications to send/read streams of data to/from topics in a Kafka cluster using the producer and consumer APIs. While the main Kafka project maintains only the Java APIs, there are several open source projects that provide the Kafka client APIs in Python. In this post, we discuss how to develop Kafka client applications using the kafka-python package on Kubernetes. Part 1 Cluster Setup Part 2 Producer and Consumer (this post) Part 3 Kafka Connect Kafka Client Apps We create Kafka producer and consumer apps using the kafka-python package.Kafka Development on Kubernetes - Part 1 Cluster Setuphttps://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/Thu, 21 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-21-kafka-development-on-k8s-part-1/Apache Kafka is one of the key technologies for implementing data streaming architectures. Strimzi provides a way to run an Apache Kafka cluster and related resources on Kubernetes in various deployment configurations. In this series of posts, we will discuss how to create a Kafka cluster, to develop Kafka client applications in Python and to build a data pipeline using Kafka connectors on Kubernetes. Part 1 Cluster Setup (this post) Part 2 Producer and Consumer Part 3 Kafka Connect Setup Kafka Cluster The Kafka cluster is deployed using the Strimzi Operator on a Minikube cluster.Real Time Streaming with Kafka and Flink - Lab 6 Consume data from Kafka using Lambdahttps://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/Thu, 14 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-14-real-time-streaming-with-kafka-and-flink-7/Amazon MSK can be configured as an event source of a Lambda function. Lambda internally polls for new messages from the event source and then synchronously invokes the target Lambda function. With this feature, we can develop a Kafka consumer application in serverless environment where developers can focus on application logic. In this lab, we will discuss how to create a Kafka consumer using a Lambda function. Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda (this post) Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.Setup Local Development Environment for Apache Flink and Spark Using EMR Container Imageshttps://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Thu, 07 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery. As both of them can be integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog).Real Time Streaming with Kafka and Flink - Lab 5 Write data to DynamoDB using Kafka Connecthttps://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/Thu, 30 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-30-real-time-streaming-with-kafka-and-flink-6/Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. In this lab, we will discuss how to create a data pipeline that ingests data from a Kafka topic into a DynamoDB table using the Camel DynamoDB sink connector. Introduction Lab 1 Produce data to Kafka using Lambda Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect (this post) Lab 6 Consume data from Kafka using Lambda Architecture Fake taxi ride data is sent to a Kafka topic by the Kafka producer application that is discussed in Lab 1.Real Time Streaming with Kafka and Flink - Lab 4 Clean, Aggregate, and Enrich Events with Flinkhttps://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/Thu, 23 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-23-real-time-streaming-with-kafka-and-flink-5/The value of data can be maximised when it is used without delay. With Apache Flink, we can build streaming analytics applications that incorporate the latest events with low latency. In this lab, we will create a Pyflink application that writes accumulated taxi rides data into an OpenSearch cluster. It aggregates the number of trips/passengers and trip durations by vendor ID for a window of 5 seconds. The data is then used to create a chart that monitors the status of taxi rides in the OpenSearch Dashboard.Real Time Streaming with Kafka and Flink - Lab 3 Transform and write data to S3 from Kafka using Flinkhttps://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/Thu, 16 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-16-real-time-streaming-with-kafka-and-flink-4/In this lab, we will create a Pyflink application that exports Kafka topic messages into a S3 bucket. The app enriches the records by adding a new column using a user defined function and writes them via the FileSystem SQL connector. This allows us to achieve a simpler architecture compared to the original lab where the records are sent into Amazon Kinesis Data Firehose, enriched by a separate Lambda function and written to a S3 bucket afterwards.Real Time Streaming with Kafka and Flink - Lab 2 Write data to Kafka from S3 using Flinkhttps://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/Thu, 09 Nov 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-11-09-real-time-streaming-with-kafka-and-flink-3/In this lab, we will create a Pyflink application that reads records from S3 and sends them into a Kafka topic. A custom pipeline Jar file will be created as the Kafka cluster is authenticated by IAM, and it will be demonstrated how to execute the app in a Flink cluster deployed on Docker as well as locally as a typical Python app. We can assume the S3 data is static metadata that needs to be joined into another stream, and this exercise can be useful for data enrichment.Kafka Connect for AWS Services Integration - Part 5 Deploy Aiven OpenSearch Sink Connectorhttps://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/Mon, 30 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-30-kafka-connect-for-aws-part-5/In the previous post, we discussed how to develop a data pipeline from Apache Kafka into OpenSearch locally using Docker. The pipeline will be deployed on AWS using Amazon MSK, Amazon MSK Connect and Amazon OpenSearch Service using Terraform in this post. First the infrastructure will be deployed that covers a Virtual Private Cloud (VPC), Virtual Private Network (VPN) server, MSK Cluster and OpenSearch domain. Then Kafka source and sink connectors will be deployed on MSK Connect, followed by performing quick data analysis.Real Time Streaming with Kafka and Flink - Lab 1 Produce data to Kafka using Lambdahttps://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/Thu, 26 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-26-real-time-streaming-with-kafka-and-flink-2/In this lab, we will create a Kafka producer application using AWS Lambda, which sends fake taxi ride data into a Kafka topic on Amazon MSK. A configurable number of the producer Lambda function will be invoked by an Amazon EventBridge schedule rule. In this way we are able to generate test data concurrently based on the desired volume of messages. Introduction Lab 1 Produce data to Kafka using Lambda (this post) Lab 2 Write data to Kafka from S3 using Flink Lab 3 Transform and write data to S3 from Kafka using Flink Lab 4 Clean, Aggregate, and Enrich Events with Flink Lab 5 Write data to DynamoDB using Kafka Connect Lab 6 Consume data from Kafka using Lambda [Update 2023-11-06] Initially I planned to deploy Pyflink applications on Amazon Managed Service for Apache Flink, but I changed the plan to use a local Flink cluster deployed on Docker.Kafka Connect for AWS Services Integration - Part 4 Develop Aiven OpenSearch Sink Connectorhttps://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/Mon, 23 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-23-kafka-connect-for-aws-part-4/OpenSearch is a popular search and analytics engine and its use cases cover log analytics, real-time application monitoring, and clickstream analysis. OpenSearch can be deployed on its own or via Amazon OpenSearch Service. Apache Kafka is a distributed event store and stream-processing platform, and it aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. On AWS, Apache Kafka can be deployed via Amazon Managed Streaming for Apache Kafka (MSK).Building Apache Flink Applications in Pythonhttps://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Thu, 19 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-19-build-pyflink-apps/Building Apache Flink Applications in Java is a course to introduce Apache Flink through a series of hands-on exercises, and it is provided by Confluent. Utilising the Flink DataStream API, the course develops three Flink applications that populate multiple source data sets, collect them into a standardised data set, and aggregate it to produce usage statistics. As part of learning the Flink DataStream API in Pyflink, I converted the Java apps into Python equivalent while performing the course exercises in Pyflink.Real Time Streaming with Kafka and Flink - Introductionhttps://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Thu, 05 Oct 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-10-05-real-time-streaming-with-kafka-and-flink-1/Real Time Streaming with Amazon Kinesis is an AWS workshop that helps users build a streaming analytics application on AWS. Incoming events are stored in a number of streams of the Amazon Kinesis Data Streams service, and various other AWS services and tools are used to process and analyse data. Apache Kafka is a popular distributed event store and stream processing platform, and it stores incoming events in topics. As part of learning real time streaming analytics on AWS, we can rebuild the analytics applications by replacing the Kinesis streams with Kafka topics.Getting Started with Pyflink on AWS - Part 3 AWS Managed Flink and MSKhttps://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/Mon, 04 Sep 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-09-04-getting-started-with-pyflink-on-aws-part-3/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In the previous posts, I demonstrated a Pyflink app that targets a local Kafka cluster as well as a Kafka cluster on Amazon MSK. The app was executed in a virtual environment as well as in a local Flink cluster for improved monitoring. In this post, the app will be deployed via Amazon Managed Service for Apache Flink, which is the easiest option to run Flink applications on AWS.Getting Started with Pyflink on AWS - Part 2 Local Flink and MSKhttps://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/Mon, 28 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-28-getting-started-with-pyflink-on-aws-part-2/In this series of posts, we discuss a Flink (Pyflink) application that reads/writes from/to Kafka topics. In part 1, an app that targets a local Kafka cluster was created. In this post, we will update the app by connecting a Kafka cluster on Amazon MSK. The Kafka cluster is authenticated by IAM and the app has additional jar dependency. As Amazon Managed Service for Apache Flink does not allow you to specify multiple pipeline jar files, we have to build a custom Uber Jar that combines multiple jar files.Getting Started with Pyflink on AWS - Part 1 Local Flink and Local Kafkahttps://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Thu, 17 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-17-getting-started-with-pyflink-on-aws-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.Kafka, Flink and DynamoDB for Real Time Fraud Detection - Part 1 Local Developmenthttps://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Thu, 10 Aug 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-08-10-fraud-detection-part-1/Apache Flink is an open-source, unified stream-processing and batch-processing framework. Its core is a distributed streaming data-flow engine that you can use to run real-time stream processing on high-throughput data sources. Currently, it is widely used to build applications for fraud/anomaly detection, rule-based alerting, business process monitoring, and continuous ETL to name a few. On AWS, we can deploy a Flink application via Amazon Kinesis Data Analytics (KDA), Amazon EMR and Amazon EKS.Kafka Development with Docker - Part 11 Kafka Authorizationhttps://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/Thu, 20 Jul 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-07-20-kafka-development-with-docker-part-11/In the previous posts, we discussed how to implement client authentication by TLS (SSL or TLS/SSL) and SASL authentication. One of the key benefits of client authentication is achieving user access control. Kafka ships with a pluggable, out-of-the box authorization framework, which is configured with the authorizer.class.name property in the server configuration and stores Access Control Lists (ACLs) in the cluster metadata (either Zookeeper or the KRaft metadata log). In this post, we will discuss how to configure Kafka authorization with Java and Python client examples while SASL is kept for client authentication.Kafka Development with Docker - Part 10 SASL Authenticationhttps://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/Thu, 13 Jul 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-07-13-kafka-development-with-docker-part-10/In the previous post, we discussed TLS (SSL or TLS/SSL) authentication to improve security. It enforces two-way verification where a client certificate is verified by Kafka brokers. Client authentication can also be enabled by Simple Authentication and Security Layer (SASL), and we will discuss how to implement SASL authentication with Java and Python client examples in this post. Part 1 Cluster Setup Part 2 Management App Part 3 Kafka Connect Part 4 Producer and Consumer Part 5 Glue Schema Registry Part 6 Kafka Connect with Glue Schema Registry Part 7 Producer and Consumer with Glue Schema Registry Part 8 SSL Encryption Part 9 SSL Authentication Part 10 SASL Authentication (this post) Part 11 Kafka Authorization Certificate Setup As we will leave Kafka communication to remain encrypted, we need to keep the components for SSL encryption.Kafka Development with Docker - Part 9 SSL Authenticationhttps://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/Thu, 06 Jul 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-07-06-kafka-development-with-docker-part-9/In the previous post, we discussed how to configure TLS (SSL or TLS/SSL) encryption with Java and Python client examples. SSL encryption is a one-way verification process where a server certificate is verified by a client via SSL Handshake. To improve security, we can add client authentication either by enforcing two-way verification where a client certificate is verified by Kafka brokers (SSL authentication). Or we can choose a separate authentication mechanism, which is typically Simple Authentication and Security Layer (SASL).Kafka Development with Docker - Part 8 SSL Encryptionhttps://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/Thu, 29 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-29-kafka-development-with-docker-part-8/By default, Apache Kafka communicates in PLAINTEXT, which means that all data is sent without being encrypted. To secure communication, we can configure Kafka clients and other components to use Transport Layer Security (TLS) encryption. Note that TLS is also referred to Secure Sockets Layer (SSL) or TLS/SSL. SSL is the predecessor of TLS, and has been deprecated since June 2015. However, it is used in configuration and code instead of TLS for historical reasons.Kafka Development with Docker - Part 7 Producer and Consumer with Glue Schema Registryhttps://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/Thu, 22 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-22-kafka-development-with-docker-part-7/In Part 4, we developed Kafka producer and consumer applications using the kafka-python package. The Kafka messages are serialized as Json, but are not associated with a schema as there was not an integrated schema registry. Later we discussed how producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve in Part 5. In this post, I’ll demonstrate how to enhance the existing applications by integrating AWS Glue Schema Registry.Kafka Development with Docker - Part 6 Kafka Connect with Glue Schema Registryhttps://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/Thu, 15 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-15-kafka-development-with-docker-part-6/In Part 3, we developed a data ingestion pipeline with fake online order data using Kafka Connect source and sink connectors. Schemas are not enabled on both of them as there was not an integrated schema registry. Later we discussed how producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve in Part 5. In this post, I’ll demonstrate how to enhance the existing data ingestion pipeline by integrating AWS Glue Schema Registry.Kafka Connect for AWS Services Integration - Part 2 Develop Camel DynamoDB Sink Connectorhttps://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/Sun, 04 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-04-kafka-connect-for-aws-part-2/In Part 1, we reviewed Kafka connectors focusing on AWS services integration. Among the available connectors, the suite of Apache Camel Kafka connectors and the Kinesis Kafka connector from the AWS Labs can be effective for building data ingestion pipelines on AWS. In this post, I will illustrate how to develop the Camel DynamoDB sink connector using Docker. Fake order data will be generated using the MSK Data Generator source connector, and the sink connector will be configured to consume the topic messages to ingest them into a DynamoDB table.Kafka Development with Docker - Part 4 Producer and Consumerhttps://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/Thu, 01 Jun 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-06-01-kafka-development-with-docker-part-4/In the previous post, we discussed Kafka Connect to stream data to/from a Kafka cluster. Kafka also includes the Producer/Consumer APIs that allow client applications to send/read streams of data to/from topics in a Kafka cluster. While the main Kafka project maintains only the Java clients, there are several open source projects that provide the Kafka client APIs in Python. In this post, I’ll demonstrate how to develop producer/consumer applications using the kafka-python package.Kafka Development with Docker - Part 3 Kafka Connecthttps://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/Thu, 25 May 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-05-25-kafka-development-with-docker-part-3/According to the documentation of Apache Kafka, Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. Kafka Connect supports two types of connectors - source and sink. Source connectors are used to ingest messages from external systems into Kafka topics while messages are ingested into external systems form Kafka topics with sink connectors.Kafka Development with Docker - Part 2 Management Apphttps://jaehyeon.me/blog/2023-05-18-kafka-development-with-docker-part-2/Thu, 18 May 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-05-18-kafka-development-with-docker-part-2/In the previous post, I illustrated how to create a topic and to produce/consume messages using the command utilities provided by Apache Kafka. It is not convenient, however, for example, when you consume serialised messages where their schemas are stored in a schema registry. Also, the utilities don’t support to browse or manage related resources such as connectors and schemas. Therefore, a Kafka management app can be a good companion for development, which helps monitor and manage resources on an easy-to-use user interface.Kafka Development with Docker - Part 1 Cluster Setuphttps://jaehyeon.me/blog/2023-05-04-kafka-development-with-docker-part-1/Thu, 04 May 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-05-04-kafka-development-with-docker-part-1/I’m teaching myself modern data streaming architectures on AWS, and Apache Kafka is one of the key technologies, which can be used for messaging, activity tracking, stream processing and so on. While applications tend to be deployed to cloud, it can be much easier if we develop and test those with Docker and Docker Compose locally. As the series title indicates, I plan to publish articles that demonstrate Kafka and related tools in Dockerized environments.Integrate Glue Schema Registry with Your Python Kafka Apphttps://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/Wed, 12 Apr 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-04-12-integrate-glue-schema-registry/As Kafka producer and consumer apps are decoupled, they operate on Kafka topics rather than communicating with each other directly. As described in the Confluent document, Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network. Producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve. In AWS, the Glue Schema Registry supports features to manage and enforce schemas on data streaming applications using convenient integrations with Apache Kafka, Amazon Managed Streaming for Apache Kafka, Amazon Kinesis Data Streams, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.How to configure Kafka consumers to seek offsets by timestamphttps://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/Tue, 10 Jan 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-01-10-kafka-consumer-seek-offsets/Normally we consume Kafka messages from the beginning/end of a topic or last committed offsets. For backfilling or troubleshooting, however, we need to consume messages from a certain timestamp occasionally. If we know which topic partition to choose e.g. by assigning a topic partition, we can easily override the fetch offset to a specific timestamp. When we deploy multiple consumer instances together, however, we make them subscribe to a topic and topic partitions are dynamically assigned, which means we cannot determine which fetch offset to use for an instance.Revisit AWS Lambda Invoke Function Operator of Apache Airflowhttps://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/Sat, 06 Aug 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/Apache Airflow is a popular workflow management platform. A wide range of AWS services are integrated with the platform by Amazon AWS Operators. AWS Lambda is one of the integrated services, and it can be used to develop workflows efficiently. The current Lambda Operator, however, just invokes a Lambda function, and it can fail to report the invocation result of a function correctly and to record the exact error message from failure.Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environmenthttps://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Sun, 26 Jun 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we’ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.Develop and Test Apache Spark Apps for EMR Locally Using Dockerhttps://jaehyeon.me/blog/2022-05-08-emr-local-dev/Sun, 08 May 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-05-08-emr-local-dev/[UPDATE 2023-12-07] I wrote a new post that simplifies the Spark configuration dramatically. Besides, the log configuration is based on Log4J2, which applies to newer Spark versions. Moreover, the container is configured to run the Spark History Server, and it allows us to debug and diagnose completed and running Spark applications. I recommend referring to the new post. Amazon EMR is a managed service that simplifies running Apache Spark on AWS.Use External Schema Registry with MSK Connect – Part 2 MSK Deploymenthttps://jaehyeon.me/blog/2022-04-03-schema-registry-part2/Sun, 03 Apr 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-04-03-schema-registry-part2/In the previous post, we discussed a Change Data Capture (CDC) solution with a schema registry. A local development environment is set up using Docker Compose. The Debezium and Confluent S3 connectors are deployed with the Confluent Avro converter and the Apicurio registry is used as the schema registry service. A quick example is shown to illustrate how schema evolution can be managed by the schema registry. In this post, we’ll build the solution on AWS using MSK, MSK Connect, Aurora PostgreSQL and ECS.Use External Schema Registry with MSK Connect – Part 1 Local Developmenthttps://jaehyeon.me/blog/2022-03-07-schema-registry-part1/Mon, 07 Mar 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-03-07-schema-registry-part1/When we discussed a Change Data Capture (CDC) solution in one of the earlier posts, we used the JSON converter that comes with Kafka Connect. We optionally enabled the key and value schemas and the topic messages include those schemas together with payload. It seems to be convenient at first as the messages are saved into S3 on their own. However, it became cumbersome when we tried to use the DeltaStreamer utility.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lakehttps://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/Sun, 19 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDChttps://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/Sun, 12 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Developmenthttps://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Sun, 05 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we’ll build a data lake that uses CDC.Local Development of AWS Glue 3.0 and Laterhttps://jaehyeon.me/blog/2021-11-14-glue-3-local-development/Sun, 14 Nov 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.AWS Glue Local Development with Docker and Visual Studio Codehttps://jaehyeon.me/blog/2021-08-20-glue-local-development/Fri, 20 Aug 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-08-20-glue-local-development/As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.Thoughts on Apache Airflow AWS Lambda Operatorhttps://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/Mon, 13 Apr 2020 00:00:00 +0000https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers for scalability. In AWS, however, scalability can also be achieved using serverless computing services in a simpler way. For example, the ECS Operator allows to run dockerized tasks and, with the Fargate launch type, they can run in a serverless environment. The ECS Operator alone is not sufficient because it can take up to several minutes to pull a Docker image and to set up network interface (for the case of Fargate launch type).Dynamic Routing and Centralized Auth with Traefik, Python and R Examplehttps://jaehyeon.me/blog/2019-11-29-traefik-example/Fri, 29 Nov 2019 00:00:00 +0000https://jaehyeon.me/blog/2019-11-29-traefik-example/Ingress in Kubernetes exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. By setting rules, it routes requests to appropriate services (precisely requests are sent to individual Pods by Ingress Controller). Rules can be set up dynamically and I find it’s more efficient compared to traditional reverse proxy. Traefik is a modern HTTP reverse proxy and load balancer and it can be used as a Kubernetes Ingress Controller.Distributed Task Queue with Python and R Examplehttps://jaehyeon.me/blog/2019-11-15-task-queue/Fri, 15 Nov 2019 00:00:00 +0000https://jaehyeon.me/blog/2019-11-15-task-queue/While I’m looking into Apache Airflow, a workflow management tool, I thought it would be beneficial to get some understanding of how Celery works. To do so, I built a simple web service that sends tasks to Celery workers and collects the results from them. FastAPI is used for developing the web service and Redis is used for the message broker and result backend. During the development, I thought it would be possible to implement similar functionality in R with Rserve.Linux Dev Environment on Windowshttps://jaehyeon.me/blog/2019-11-01-linux-on-windows/Fri, 01 Nov 2019 00:00:00 +0000https://jaehyeon.me/blog/2019-11-01-linux-on-windows/I use Linux containers a lot for development. Having Windows computers at home and work, I used to use Linux VMs on VirtualBox or VMWare Workstation. It’s not a bad option but it requires a lot of resources. Recently, after my home computer was updated, I was not able to start my hypervisor anymore. Also I didn’t like huge resource consumption of it so that I began to look for a different development environment.AWS Local Development with LocalStackhttps://jaehyeon.me/blog/2019-07-20-aws-localstack/Sat, 20 Jul 2019 00:00:00 +0000https://jaehyeon.me/blog/2019-07-20-aws-localstack/LocalStack provides an easy-to-use test/mocking framework for developing AWS applications. In this post, I’ll demonstrate how to utilize LocalStack for development using a web service. Specifically a simple web service built with Flask-RestPlus is used. It supports simple CRUD operations against a database table. It is set that SQS and Lambda are used for creating and updating a record. When a POST or PUT request is made, the service sends a message to a SQS queue and directly returns 204 reponse.Cronicle Multi Server Setuphttps://jaehyeon.me/blog/2019-07-19-cronicle-multi-server-setup/Fri, 19 Jul 2019 00:00:00 +0000https://jaehyeon.me/blog/2019-07-19-cronicle-multi-server-setup/Accroding to the project GitHub repository, Cronicle is a multi-server task scheduler and runner, with a web based front-end UI. It handles both scheduled, repeating and on-demand jobs, targeting any number of slave servers, with real-time stats and live log viewer. By default, Cronicle is configured to launch a single master server - task scheduling is controlled by the master server. For high availability, it is important that another server takes the role of master when the existing master server fails.