Apache Spark on Jaehyeon Kim

Apache Spark on Jaehyeon Kimhttps://jaehyeon.me/tags/apache-spark/Recent content in Apache Spark on Jaehyeon KimHugo -- gohugo.ioenCopyright © 2023-2024 Jaehyeon Kim. All Rights Reserved.Thu, 07 Dec 2023 00:00:00 +0000Setup Local Development Environment for Apache Flink and Spark Using EMR Container Imageshttps://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Thu, 07 Dec 2023 00:00:00 +0000https://jaehyeon.me/blog/2023-12-07-flink-spark-local-dev/Apache Flink became generally available for Amazon EMR on EKS from the EMR 6.15.0 releases, and we are able to pull the Flink (as well as Spark) container images from the ECR Public Gallery. As both of them can be integrated with the Glue Data Catalog, it can be particularly useful if we develop real time data ingestion/processing via Flink and build analytical queries using Spark (or any other tools or services that can access to the Glue Data Catalog).Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 4 EMR on EKShttps://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/Tue, 01 Nov 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-11-01-dbt-on-aws-part-4-emr-eks/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless, Glue and EMR on EC2 are illustrated as well.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 3 EMR on EC2https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/Wed, 19 Oct 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-10-19-dbt-on-aws-part-3-emr-ec2/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In the previous posts, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. Demo data projects that target Redshift Serverless and Glue are illustrated as well.Data Build Tool (dbt) for Effective Data Transformation on AWS – Part 2 Gluehttps://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/Sun, 09 Oct 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-10-09-dbt-on-aws-part-2-glue/The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. A demo data project that targets Redshift Serverless is illustrated as well. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue.Develop and Test Apache Spark Apps for EMR Remotely Using Visual Studio Codehttps://jaehyeon.me/blog/2022-09-07-emr-remote-dev/Wed, 07 Sep 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-09-07-emr-remote-dev/When we develop a Spark application on EMR, we can use docker for local development or notebooks via EMR Studio (or EMR Notebooks). However, the local development option is not viable if the size of data is large. Also, I am not a fan of notebooks as it is not possible to utilise the features my editor supports such as syntax highlighting, autocomplete and code formatting. Moreover, it is not possible to organise code into modules and to perform unit testing properly with that option.Manage EMR on EKS with Terraformhttps://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/Fri, 26 Aug 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-08-26-emr-on-eks-with-terraform/Amazon EMR on EKS is a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on EKS. While eksctl is popular for working with Amazon EKS clusters, it has limitations when it comes to building infrastructure that integrates multiple AWS services. Also, it is not straightforward to update EKS cluster resources incrementally with it. On the other hand Terraform can be an effective tool for managing infrastructure that includes not only EKS and EMR virtual clusters but also other AWS resources.Data Warehousing ETL Demo with Apache Iceberg on EMR Local Environmenthttps://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Sun, 26 Jun 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-06-26-iceberg-etl-demo/Unlike traditional Data Lake, new table formats (Iceberg, Hudi and Delta Lake) support features that can be used to apply data warehousing patterns, which can bring a way to be rescued from Data Swamp. In this post, we’ll discuss how to implement ETL using retail analytics data. It has two dimension data (user and product) and a single fact data (order). The dimension data sets have different ETL strategies depending on whether to track historical changes.Develop and Test Apache Spark Apps for EMR Locally Using Dockerhttps://jaehyeon.me/blog/2022-05-08-emr-local-dev/Sun, 08 May 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-05-08-emr-local-dev/[UPDATE 2023-12-07] I wrote a new post that simplifies the Spark configuration dramatically. Besides, the log configuration is based on Log4J2, which applies to newer Spark versions. Moreover, the container is configured to run the Spark History Server, and it allows us to debug and diagnose completed and running Spark applications. I recommend referring to the new post. Amazon EMR is a managed service that simplifies running Apache Spark on AWS.EMR on EKS by Examplehttps://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/Mon, 17 Jan 2022 00:00:00 +0000https://jaehyeon.me/blog/2022-01-17-emr-on-eks-by-example/EMR on EKS provides a deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon EKS. While a wide range of open source big data components are available in EMR on EC2, only Apache Spark is available in EMR on EKS. It is more flexible, however, that applications of different EMR versions can be run in multiple availability zones on either EC2 or Fargate.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lakehttps://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/Sun, 19 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-19-datalake-demo-part3/In the previous post, we created a VPC that has private and public subnets in 2 availability zones in order to build and deploy the data lake solution on AWS. NAT instances are created to forward outbound traffic to the internet and a VPN bastion host is set up to facilitate deployment. An Aurora PostgreSQL cluster is deployed to host the source database and a Python command line app is used to create the database.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 2 Implement CDChttps://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/Sun, 12 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-12-datalake-demo-part2/In the previous post, we discussed a data lake solution where data ingestion is performed using change data capture (CDC) and the output files are upserted to an Apache Hudi table. Being registered to Glue Data Catalog, it can be used for ad-hoc queries and report/dashboard creation. The Northwind database is used as the source database and, following the transactional outbox pattern, order-related changes are _upserted _to an outbox table by triggers.Data Lake Demo using Change Data Capture (CDC) on AWS – Part 1 Local Developmenthttps://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Sun, 05 Dec 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-12-05-datalake-demo-part1/Change data capture (CDC) is a proven data integration pattern that has a wide range of applications. Among those, data replication to data lakes is a good use case in data engineering. Coupled with best-in-breed data lake formats such as Apache Hudi, we can build an efficient data replication solution. This is the first post of the data lake demo series. Over time, we’ll build a data lake that uses CDC.Local Development of AWS Glue 3.0 and Laterhttps://jaehyeon.me/blog/2021-11-14-glue-3-local-development/Sun, 14 Nov 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-11-14-glue-3-local-development/In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.AWS Glue Local Development with Docker and Visual Studio Codehttps://jaehyeon.me/blog/2021-08-20-glue-local-development/Fri, 20 Aug 2021 00:00:00 +0000https://jaehyeon.me/blog/2021-08-20-glue-local-development/As described in the product page, AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. For development, a development endpoint is recommended, but it can be costly, inconvenient or unavailable (for Glue 2.0). The AWS Glue team published a Docker image that includes the AWS Glue binaries and all the dependencies packaged together. After inspecting it, I find some modifications are necessary in order to build a development environment on it.Boost SparkR with Hivehttps://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/Sat, 30 Apr 2016 00:00:00 +0000https://jaehyeon.me/blog/2016-04-30-boost-sparkr-with-hive/In the previous post, it is demonstrated how to start SparkR in local and cluster mode. While SparkR is in active development, it is yet to fully support Spark’s key libraries such as MLlib and Spark Streaming. Even, as a data processing engine, this R API is still limited as it is not possible to manipulate RDDs directly but only via Spark SQL/DataFrame API. As can be checked in the API doc, SparkR rebuilds many existing R functions to work with Spark DataFrame and notably it borrows some functions from the dplyr package.Quick Start SparkR in Local and Cluster Modehttps://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/Wed, 02 Mar 2016 00:00:00 +0000https://jaehyeon.me/blog/2016-03-02-quick-start-sparkr-in-local-and-cluster-mode/In the previous post, a Spark cluster is set up using 2 VirtualBox Ubuntu guests. While this is a viable option for many, it is not always for others. For those who find setting-up such a cluster is not convenient, there’s still another option, which is relying on the local mode of Spark. In this post, a BitBucket repository is introduced, which is a R project that includes Spark 1.6.0 Pre-built for Hadoop 2.Spark Cluster Setup on VirtualBoxhttps://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/Mon, 22 Feb 2016 00:00:00 +0000https://jaehyeon.me/blog/2016-02-22-spark-cluster-setup-on-virtualbox/We discuss how to set up a Spark cluser between 2 Ubuntu guests. Firstly it begins with machine preparation. Once a machine is baked, its image file (VDI) is be copied for the second one. Then how to launch a cluster by standalone mode is discussed. Let’s get started. Machine preparation If you haven’t read the previous post, I recommend reading as it introduces Putty as well. Also, as Spark need Java Development Kit (JDK), you may need to apt-get it first - see this tutorial for further details.