<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Apache Spark on Jaehyeon Kim</title><link>https://jaehyeon.me/tags/apache-spark/</link><description>Recent content in Apache Spark on Jaehyeon Kim</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Copyright © 2023-2026 Jaehyeon Kim. All Rights Reserved.</copyright><lastBuildDate>Thu, 17 Jul 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://jaehyeon.me/tags/apache-spark/index.xml" rel="self" type="application/rss+xml"/><item><title>Self-service Data Platform via a Multi-tenant SQL Gateway</title><link>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</link><pubDate>Thu, 17 Jul 2025 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/</guid><description>In the modern data stack, providing direct access to powerful engines like Apache Spark and Flink is a double-edged sword. While it empowers users, it often leads to chaos: resource contention from &amp;ldquo;noisy neighbors,&amp;rdquo; inconsistent security enforcement, and operational fragility. The core problem is the lack of a robust control plane between users and the raw compute power. The solution, therefore, isn&amp;rsquo;t to take power away from users, but to manage it through an intelligent intermediary.</description><enclosure url="https://jaehyeon.me/blog/2025-07-17-self-service-data-platform-via-sql-gateway/featured.png" length="55011" type="image/png"/></item><item><title>Setup Local Development Environment for Apache Flink and Spark Using EMR Container Images</title><link>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</link><pubDate>Thu, 07 Dec 2023 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/</guid><description>[UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository. To ensure continued access and compatibility, please update your Docker image references accordingly.
For example:
bitnami/kafka:2.8.1 → bitnamilegacy/kafka:2.8.1 bitnami/zookeeper:3.7.0 → bitnamilegacy/zookeeper:3.7.0 bitnami/python:3.9.0 → bitnamilegacy/python:3.9.0 Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery.</description><enclosure url="https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/featured.png" length="133053" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKS</title><link>https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/</link><pubDate>Tue, 01 Nov 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue and EMR on EC2 are illustrated as well.</description><enclosure url="https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/featured.png" length="91067" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2</title><link>https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/</link><pubDate>Wed, 19 Oct 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless and Glue are illustrated as well.</description><enclosure url="https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/featured.png" length="91067" type="image/png"/></item><item><title>Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 2 Glue</title><link>https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/</link><pubDate>Sun, 09 Oct 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/</guid><description>The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. A demo data project that targets Redshift Serverless is illustrated as well. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue.</description><enclosure url="https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/featured.png" length="90647" type="image/png"/></item><item><title>Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Code</title><link>https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/</link><pubDate>Wed, 07 Sep 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/</guid><description>When we develop a Spark application on EMR, we can use docker for local development or notebooks via EMR Studio (or EMR Notebooks). However, the local development option is not viable if the size of data is large. Also, I am not a fan of notebooks as it is not possible to utilise the features my editor supports such as syntax highlighting, autocomplete and code formatting. Moreover, it is not possible to organise code into modules and to perform unit testing properly with that option.</description><enclosure url="https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/featured.png" length="72448" type="image/png"/></item><item><title>Manage EMR on EKS with Terraform</title><link>https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/</link><pubDate>Fri, 26 Aug 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/</guid><description>Amazon EMR on EKS is a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. While eksctl is popular for working with Amazon EKS clusters, it has limitations when it comes to building infrastructure that integrates multiple AWS services. Also, it is not straightforward to update EKS cluster resources incrementally with it. On the other hand Terraform can be an effective tool for managing infrastructure that includes not only EKS and EMR virtual clusters but also other AWS resources.</description><enclosure url="https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/featured.png" length="67936" type="image/png"/></item><item><title>Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environment</title><link>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</link><pubDate>Sun, 26 Jun 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/</guid><description>Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we&amp;rsquo;ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.</description><enclosure url="https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/featured.png" length="43604" type="image/png"/></item><item><title>Develop and Test Apache Spark Apps for EMR Locally Using Docker</title><link>https://jaehyeon.me/blog/2022-05-08-emr-local-dev/</link><pubDate>Sun, 08 May 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-05-08-emr-local-dev/</guid><description>[UPDATE 2023-12-07]
I wrote a new post that simplifies the Spark configuration dramatically. Besides, the log configuration is based on Log4J2, which applies to newer Spark versions. Moreover, the container is configured to run the Spark History Server, and it allows us to debug and diagnose completed and running Spark applications. I recommend referring to the new post. [UPDATE 2025-10-01]
Bitnami&amp;rsquo;s public Docker images have been moved to the Bitnami Legacy repository.</description><enclosure url="https://jaehyeon.me/blog/2022-05-08-emr-local-dev/featured.png" length="25693" type="image/png"/></item><item><title>EMR on EKS by Example</title><link>https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/</link><pubDate>Mon, 17 Jan 2022 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/</guid><description>EMR on EKS provides a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate.</description><enclosure url="https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/featured.png" length="76740" type="image/png"/></item><item><title>AWS Glue Local Development with Docker and Visual Studio Code</title><link>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</link><pubDate>Fri, 20 Aug 2021 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2021-08-20-glue-local-development/</guid><description>As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.</description><enclosure url="https://jaehyeon.me/blog/2021-08-20-glue-local-development/featured.png" length="19535" type="image/png"/></item><item><title>Boost SparkR with Hive</title><link>https://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/</link><pubDate>Sat, 30 Apr 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/</guid><description>In the previous post, it is demonstrated how to start SparkR in local and cluster mode. While SparkR is in active development, it is yet to fully support Spark&amp;rsquo;s key libraries such as MLlib and Spark Streaming. Even, as a data processing engine, this R API is still limited as it is not possible to manipulate RDDs directly but only via Spark SQL/DataFrame API. As can be checked in the API doc, SparkR rebuilds many existing R functions to work with Spark DataFrame and notably it borrows some functions from the dplyr package.</description></item><item><title>Quick Start SparkR in Local and Cluster Mode</title><link>https://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/</link><pubDate>Wed, 02 Mar 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/</guid><description>In the previous post, a Spark cluster is set up using 2 VirtualBox Ubuntu guests. While this is a viable option for many, it is not always for others. For those who find setting-up such a cluster is not convenient, there&amp;rsquo;s still another option, which is relying on the local mode of Spark. In this post, a BitBucket repository is introduced, which is a R project that includes Spark 1.6.0 Pre-built for Hadoop 2.</description></item><item><title>Spark Cluster Setup on VirtualBox</title><link>https://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/</link><pubDate>Mon, 22 Feb 2016 00:00:00 +0000</pubDate><guid>https://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/</guid><description>We discuss how to set up a Spark cluser between 2 Ubuntu guests. Firstly it begins with machine preparation. Once a machine is baked, its image file (VDI) is be copied for the second one. Then how to launch a cluster by standalone mode is discussed. Let&amp;rsquo;s get started.
Machine preparation If you haven&amp;rsquo;t read the previous post, I recommend reading as it introduces Putty as well. Also, as Spark need Java Development Kit (JDK), you may need to apt-get it first - see this tutorial for further details.</description></item></channel></rss>