<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Data Engineering on Jaehyeon Kim</title><link>https://jaehyeon.me/categories/data-engineering/</link><description>Recent content in Data Engineering on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Wed, 29 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/categories/data-engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>Building an Event-Driven Hybrid Digital Twin with dynamic-des</title><link>https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/</guid><description>Asynchronous Gap In Part 1, we established that a true Hybrid Digital Twin does more than just mirror reality. It actively forecasts the future by running a simulation against live operational states.
If you have ever tried to build one of these systems from scratch, you immediately hit a fundamental architectural clash.
Standard simulation clocks (like those in traditional SimPy implementations) are logically synchronous and not designed to handle high-frequency asynchronous I/O without explicit decoupling.</description><enclosure url="https://jaehyeon.me/blog/2026-04-28-digital-twin-dynamic-des/featured.png" length="177036" type="image/png"/></item><item><title>Self-service Data Platform via a Multi-tenant SQL Gateway</title><link>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</link><pubDate>Thu, 17 Jul 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</guid><description>In the modern data stack, providing direct access to powerful engines like Apache Spark and Flink is a double-edged sword. While it empowers users, it often leads to chaos: resource contention from &amp;ldquo;noisy neighbors,&amp;rdquo; inconsistent security enforcement, and operational fragility. The core problem is the lack of a robust control plane between users and the raw compute power. The solution, therefore, isn&amp;rsquo;t to take power away from users, but to manage it through an intelligent intermediary.</description><enclosure url="https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/featured.png" length="55011" type="image/png"/></item><item><title>Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures</title><link>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</link><pubDate>Tue, 06 May 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/</guid><description><![CDATA[<p>The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the &ldquo;Streamhouse&rdquo; - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.</p>
<p>Today, we&rsquo;ll introduce three key open-source technologies shaping this space: <a href="https://paimon.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Paimon™</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, <a href="https://alibaba.github.io/fluss-docs/" target="_blank" rel="noopener noreferrer"><strong>Fluss</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>, and <a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"><strong>Apache Iceberg</strong><i class="fas fa-external-link-square-alt ms-1"></i></a>. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2025-05-06-streamhouse-trio/featured.png" length="288793" type="image/png"/></item><item><title>Guide to Running DBT in Production</title><link>https://jaehyeon.me/blog/2024-09-13-dbt-guide/</link><pubDate>Fri, 13 Sep 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-09-13-dbt-guide/</guid><description><![CDATA[<p>In the <a href="/blog/2024-09-05-dbt-cicd-demo">previous post</a>, we started discussing a <em>continuous integration/continuous delivery (CI/CD)</em> process of a <em>dbt</em> project by introducing two GitHub Actions workflows - <code>slim-ci</code> and <code>deploy</code>. The former is triggered when a pull request is created to the main branch, and it builds only modified models and its first-order children in a <em>ci</em> dataset, followed by performing tests on them. The second workflow gets triggered once a pull request is merged. Beginning with running unit tests, it packages the <em>dbt</em> project as a Docker container and publishes to <em>Artifact Registry</em>. In this post, we focus on how to deploy a <em>dbt</em> project in multiple environments while walking through the entire CI/CD process step-by-step.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-09-13-dbt-guide/featured.png" length="71185" type="image/png"/></item><item><title>DBT CI/CD Demo with BigQuery and GitHub Actions</title><link>https://jaehyeon.me/blog/2024-09-05-dbt-cicd-demo/</link><pubDate>Thu, 05 Sep 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-09-05-dbt-cicd-demo/</guid><description><![CDATA[<p>Continuous integration (CI) is the process of ensuring new code integrates with the larger code base, and it puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch. Continuous delivery (CD) is an extension of continuous integration since it automatically deploys all code changes to a testing and/or production environment after the build stage. CI/CD helps development teams avoid bugs and code failures while maintaining a continuous cycle of software development and updates. In this post, we discuss how to set up a CI/CD pipeline for a <a href="https://www.getdbt.com/" target="_blank" rel="noopener noreferrer">data build tool (<em>dbt</em>)<i class="fas fa-external-link-square-alt ms-1"></i></a> project using <a href="https://github.com/features/actions" target="_blank" rel="noopener noreferrer">GitHub Actions<i class="fas fa-external-link-square-alt ms-1"></i></a> where <a href="https://cloud.google.com/bigquery?hl=en" target="_blank" rel="noopener noreferrer">BigQuery<i class="fas fa-external-link-square-alt ms-1"></i></a> is used as the target data warehouse.</p>]]></description><enclosure url="https://jaehyeon.me/blog/2024-09-05-dbt-cicd-demo/featured.png" length="60835" type="image/png"/></item><item><title>Apache Beam Local Development with Python - Part 5 Testing Pipelines</title><link>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</link><pubDate>Thu, 09 May 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/</guid><description>We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.</description><enclosure url="https://jaehyeon.me/blog/2024-05-09-beam-local-dev-5/featured.png" length="53603" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflow</title><link>https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/</link><pubDate>Thu, 14 Mar 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/</guid><description>In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types.</description><enclosure url="https://jaehyeon.me/blog/2024-03-14-dbt-pizza-shop-6/featured.png" length="82921" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 5 Modelling on Amazon Athena</title><link>https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/</link><pubDate>Thu, 07 Mar 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/</guid><description>In Part 1 and Part 3, we developed data build tool (dbt) projects that target PostgreSQL and BigQuery using fictional pizza shop data. The data is modelled by SCD type 2 dimension tables and one transactional fact table. While the order records should be joined with dimension tables to get complete details for PostgreSQL, the fact table is denormalized using nested and repeated fields to improve query performance for BigQuery.</description><enclosure url="https://jaehyeon.me/blog/2024-03-07-dbt-pizza-shop-5/featured.png" length="61499" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflow</title><link>https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/</link><pubDate>Thu, 22 Feb 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/</guid><description>In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.</description><enclosure url="https://jaehyeon.me/blog/2024-02-22-dbt-pizza-shop-4/featured.png" length="89588" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery</title><link>https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/</link><pubDate>Thu, 08 Feb 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/</guid><description>In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery.</description><enclosure url="https://jaehyeon.me/blog/2024-02-08-dbt-pizza-shop-3/featured.png" length="70297" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow</title><link>https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/</link><pubDate>Thu, 25 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/</guid><description>In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.</description><enclosure url="https://jaehyeon.me/blog/2024-01-25-dbt-pizza-shop-2/featured.png" length="77355" type="image/png"/></item><item><title>Data Build Tool (dbt) Pizza Shop Demo - Part 1 Modelling on PostgreSQL</title><link>https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/</link><pubDate>Thu, 18 Jan 2024 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/</guid><description>The data build tool (dbt) is a popular data transformation tool for data warehouse development. Moreover, it can be used for data lakehouse development thanks to open table formats such as Apache Iceberg, Apache Hudi and Delta Lake. dbt supports key AWS analytics services and I wrote a series of posts that discuss how to utilise dbt with Redshift, Glue, EMR on EC2, EMR on EKS, and Athena. Those posts focus on platform integration, however, they do not show realistic ETL scenarios.</description><enclosure url="https://jaehyeon.me/blog/2024-01-18-dbt-pizza-shop-1/featured.png" length="85093" type="image/png"/></item><item><title>Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images</title><link>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</link><pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery.</description><enclosure url="https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/featured.png" length="133053" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 5 Athena</title><link>https://jaehyeon.me/blog/2022-12-06-dbt-on-aws-part-5-athena/</link><pubDate>Tue, 06 Dec 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-12-06-dbt-on-aws-part-5-athena/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue, EMR on EC2 and EMR on EKS are illustrated as well.</description><enclosure url="https://jaehyeon.me/blog/2022-12-06-dbt-on-aws-part-5-athena/featured.png" length="91796" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKS</title><link>https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/</link><pubDate>Tue, 01 Nov 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue and EMR on EC2 are illustrated as well.</description><enclosure url="https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/featured.png" length="91067" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2</title><link>https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/</link><pubDate>Wed, 19 Oct 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless and Glue are illustrated as well.</description><enclosure url="https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/featured.png" length="91067" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 2 Glue</title><link>https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/</link><pubDate>Sun, 09 Oct 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. A demo data project that targets Redshift Serverless is illustrated as well. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue.</description><enclosure url="https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/featured.png" length="90647" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 1 Redshift</title><link>https://jaehyeon.me/blog/2022-09-28-dbt-on-aws-part-1-redshift/</link><pubDate>Wed, 28 Sep 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-09-28-dbt-on-aws-part-1-redshift/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.
Part 1 Redshift (this post) Part 2 Glue Part 3 EMR on EC2 Part 4 EMR on EKS Part 5 Athena Motivation In our experience delivering data solutions for our customers, we have observed a desire to move away from a centralised team function, responsible for the data collection, analysis and reporting, towards shifting this responsibility to an organisation&amp;rsquo;s lines of business (LOB) teams.</description><enclosure url="https://jaehyeon.me/blog/2022-09-28-dbt-on-aws-part-1-redshift/featured.png" length="97234" type="image/png"/></item><item><title>Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code</title><link>https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/</link><pubDate>Wed, 07 Sep 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/</guid><description>When we develop a Spark application on EMR, we can use docker for local development or notebooks via EMR Studio (or EMR Notebooks). However, the local development option is not viable if the size of data is large. Also, I am not a fan of notebooks as it is not possible to utilise the features my editor supports such as syntax highlighting, autocomplete and code formatting. Moreover, it is not possible to organise code into modules and to perform unit testing properly with that option.</description><enclosure url="https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/featured.png" length="72448" type="image/png"/></item><item><title>Manage EMR on EKS with Terraform</title><link>https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/</link><pubDate>Fri, 26 Aug 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/</guid><description>Amazon EMR on EKS is a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. While eksctl is popular for working with Amazon EKS clusters, it has limitations when it comes to building infrastructure that integrates multiple AWS services. Also, it is not straightforward to update EKS cluster resources incrementally with it. On the other hand Terraform can be an effective tool for managing infrastructure that includes not only EKS and EMR virtual clusters but also other AWS resources.</description><enclosure url="https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/featured.png" length="67936" type="image/png"/></item><item><title>Revisit AWS Lambda Invoke Function Operator of Apache Airflow</title><link>https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/</link><pubDate>Sat, 06 Aug 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/</guid><description>Apache Airflow is a popular workflow management platform. A wide range of AWS services are integrated with the platform by Amazon AWS Operators. AWS Lambda is one of the integrated services, and it can be used to develop workflows efficiently. The current Lambda Operator, however, just invokes a Lambda function, and it can fail to report the invocation result of a function correctly and to record the exact error message from failure.</description><enclosure url="https://jaehyeon.me/blog/2022-08-06-revisit-lambda-operator/featured.png" length="24814" type="image/png"/></item><item><title>Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment</title><link>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</link><pubDate>Sun, 26 Jun 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</guid><description>Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we&amp;rsquo;ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.</description><enclosure url="https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/featured.png" length="43604" type="image/png"/></item><item><title>Develop and Test Apache Spark Apps for EMR Locally Using Docker</title><link>https://jaehyeon.me/blog/2022-05-08-emr-local-dev/</link><pubDate>Sun, 08 May 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-05-08-emr-local-dev/</guid><description>[UPDATE 2023-12-07]
I wrote a new post that simplifies the Spark configuration dramatically. Besides, the log configuration is based on Log4J2, which applies to newer Spark versions. Moreover, the container is configured to run the Spark History Server, and it allows us to debug and diagnose completed and running Spark applications. I recommend referring to the new post. [UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository.</description><enclosure url="https://jaehyeon.me/blog/2022-05-08-emr-local-dev/featured.png" length="25693" type="image/png"/></item><item><title>EMR on EKS by Example</title><link>https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/</link><pubDate>Mon, 17 Jan 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/</guid><description>EMR on EKS provides a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate.</description><enclosure url="https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/featured.png" length="76740" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake</title><link>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</link><pubDate>Sun, 19 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/</guid><description>In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.</description><enclosure url="https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDC</title><link>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</link><pubDate>Sun, 12 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/</guid><description>In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.</description><enclosure url="https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/featured.png" length="164526" type="image/png"/></item><item><title>Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Development</title><link>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</link><pubDate>Sun, 05 Dec 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/</guid><description>Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we&amp;rsquo;ll build a data lake that uses CDC.</description><enclosure url="https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/featured.png" length="164526" type="image/png"/></item><item><title>Local Development of AWS Glue 3.0 and Later</title><link>https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/</link><pubDate>Sun, 14 Nov 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/</guid><description>In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I&amp;rsquo;ll illustrate how to create a development environment for AWS Glue 3.</description><enclosure url="https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/featured.png" length="30923" type="image/png"/></item><item><title>AWS Glue Local Development with Docker and Visual Studio Code</title><link>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</link><pubDate>Fri, 20 Aug 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</guid><description>As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.</description><enclosure url="https://jaehyeon.me/blog/2021-08-20-glue-local-development/featured.png" length="19535" type="image/png"/></item><item><title>Thoughts on Apache Airflow AWS Lambda Operator</title><link>https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/</link><pubDate>Mon, 13 Apr 2020 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/</guid><description>Apache Airflow is a popular open-source workflow management platform. Typically tasks run remotely by Celery workers for scalability. In AWS, however, scalability can also be achieved using serverless computing services in a simpler way. For example, the ECS Operator allows to run dockerized tasks and, with the Fargate launch type, they can run in a serverless environment.
The ECS Operator alone is not sufficient because it can take up to several minutes to pull a Docker image and to set up network interface (for the case of Fargate launch type).</description><enclosure url="https://jaehyeon.me/blog/2020-04-13-airflow-lambda-operator/featured.png" length="44994" type="image/png"/></item><item><title>Boost SparkR with Hive</title><link>https://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/</link><pubDate>Sat, 30 Apr 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/</guid><description>In the previous post, it is demonstrated how to start SparkR in local and cluster mode. While SparkR is in active development, it is yet to fully support Spark&amp;rsquo;s key libraries such as MLlib and Spark Streaming. Even, as a data processing engine, this R API is still limited as it is not possible to manipulate RDDs directly but only via Spark SQL/DataFrame API. As can be checked in the API doc, SparkR rebuilds many existing R functions to work with Spark DataFrame and notably it borrows some functions from the dplyr package.</description></item><item><title>Quick Start SparkR in Local and Cluster Mode</title><link>https://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/</link><pubDate>Wed, 02 Mar 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/</guid><description>In the previous post, a Spark cluster is set up using 2 VirtualBox Ubuntu guests. While this is a viable option for many, it is not always for others. For those who find setting-up such a cluster is not convenient, there&amp;rsquo;s still another option, which is relying on the local mode of Spark. In this post, a BitBucket repository is introduced, which is a R project that includes Spark 1.6.0 Pre-built for Hadoop 2.</description></item><item><title>Spark Cluster Setup on VirtualBox</title><link>https://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/</link><pubDate>Mon, 22 Feb 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/</guid><description>We discuss how to set up a Spark cluser between 2 Ubuntu guests. Firstly it begins with machine preparation. Once a machine is baked, its image file (VDI) is be copied for the second one. Then how to launch a cluster by standalone mode is discussed. Let&amp;rsquo;s get started.
Machine preparation If you haven&amp;rsquo;t read the previous post, I recommend reading as it introduces Putty as well. Also, as Spark need Java Development Kit (JDK), you may need to apt-get it first - see this tutorial for further details.</description></item></channel></rss>