In an earlier post, I demonstrated how to set up a local development environment for AWS Glue 1.0 and 2.0 using a docker image that is published by the AWS Glue team and the Visual Studio Code Remote – Containers extension. Recently AWS Glue 3.0 was released, but a docker image for this version is not published. In this post, I’ll illustrate how to create a development environment for AWS Glue 3.0 (and later versions) by building a custom docker image.

Glue Base Docker Image

The Glue base images are built while referring to the official AWS Glue Python local development documentation. For example, the latest image that targets Glue 3.0 is built on top of the official Python image on the latest stable Debian version (python:3.7.12-bullseye). After installing utilities (zip and AWS CLI V2), Open JDK 8 is installed. Then Maven, Spark and Glue Python libraries (aws-glue-libs) are added to the /opt directory and Glue dependencies are downloaded by sourcing glue-setup.sh. It ends up downloading default Python packages and updating the _GLUE_HOME _and PYTHONPATH environment variables. The Dockerfile can be shown below, and it can also be found in the project GitHub repository.

 1## glue-base/3.0/Dockerfile
 2FROM python:3.7.12-bullseye
 3
 4## Install utils
 5RUN apt-get update && apt-get install -y zip
 6
 7RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
 8    && unzip awscliv2.zip && ./aws/install
 9
10## Install Open JDK 8
11RUN apt-get update \
12  && apt-get install -y software-properties-common \
13  && apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' \
14  && apt-get update \
15  && apt-get install -y openjdk-8-jdk
16
17## Create environment variables
18ENV M2_HOME=/opt/apache-maven-3.6.0
19ENV SPARK_HOME=/opt/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3
20ENV PATH="${PATH}:${M2_HOME}/bin"
21
22## Add Maven, Spark and AWS Glue Libs to /opt
23RUN curl -SsL https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz \
24    | tar -C /opt --warning=no-unknown-keyword -xzf -
25RUN curl -SsL https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz \
26    | tar -C /opt --warning=no-unknown-keyword -xf -
27RUN curl -SsL https://github.com/awslabs/aws-glue-libs/archive/refs/tags/v3.0.tar.gz \
28    | tar -C /opt --warning=no-unknown-keyword -xzf -
29
30# Install Glue dependencies
31RUN cd /opt/evoaustraliaaws-glue-libs-3.0/bin/ \
32    && bash -c "source glue-setup.sh"
33
34## Add default Python packages
35COPY ./requirements.txt /tmp/requirements.txt
36RUN pip install -r /tmp/requirements.txt
37
38## Update Python path
39ENV GLUE_HOME=/opt/aws-glue-libs-3.0
40ENV PYTHONPATH=$GLUE_HOME:$SPARK_HOME/python/lib/pyspark.zip:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$SPARK_HOME/python
41
42EXPOSE 4040
43
44CMD ["bash"]

It is published to the glue-base repository of Cevo Australia’s public ECR registry with the following tags. Later versions of Glue base images will be published with relevant tags.

  • public.ecr.aws/cevoaustralia/glue-base:latest
  • public.ecr.aws/cevoaustralia/glue-base:3.0

Usage

The Glue base image can be used for running a Pyspark shell or submitting a spark application as shown below. For the spark application, I assume the project repository is mapped to the container’s /tmp folder. The Glue Python libraries also support Pytest, and it’ll be discussed later in the post.

1docker run --rm -it \
2  -v $HOME/.aws:/root/.aws \
3  public.ecr.aws/cevoaustralia/glue-base bash -c "/opt/aws-glue-libs-3.0/bin/gluepyspark"
4
5docker run --rm -it \
6  -v $HOME/.aws:/root/.aws \
7  -v $PWD:/tmp/glue-vscode \
8  public.ecr.aws/cevoaustralia/glue-base bash -c "/opt/aws-glue-libs-3.0/bin/gluesparksubmit /tmp/glue-vscode/example.py"

Extend Glue Base Image

We can extend the Glue base image using the Visual Studio Code Dev Containers extension. The configuration for the extension can be found in the .devcontainer folder. The folder includes the Dockerfile for the development docker image and remote container configuration file (devcontainer.json). The other contents include the source for the Glue base image and materials for Pyspark, spark-submit and Pytest demonstrations. These will be illustrated below.

 1.
 2├── .devcontainer
 3│   ├── pkgs
 4│   │   └── dev.txt
 5│   ├── Dockerfile
 6│   └── devcontainer.json
 7├── .gitignore
 8├── README.md
 9├── example.py
10├── execute.sh
11├── glue-base
12│   └── 3.0
13│       ├── Dockerfile
14│       └── requirements.txt
15├── src
16│   └── utils.py
17└── tests
18    ├── __init__.py
19    ├── conftest.py
20    └── test_utils.py

Development Docker Image

The Glue base Docker image runs as the root user, and it is not convenient to write code with it. Therefore, a non-root user is created whose username corresponds to the logged-in user’s username - the _USERNAME _argument will be set accordingly in devcontainer.json. Next the sudo program is installed and the non-root user is added to the Sudo group. Note the Python Glue library’s executables are configured to run with the root user so that the sudo program is necessary to run those executables. Finally, it installs additional development Python packages.

 1## .devcontainer/Dockerfile
 2FROM public.ecr.aws/i0m5p1b5/glue-base:3.0
 3
 4ARG USERNAME
 5ARG USER_UID
 6ARG USER_GID
 7
 8## Create non-root user
 9RUN groupadd --gid $USER_GID $USERNAME \
10    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME
11
12## Add sudo support in case we need to install software after connecting
13RUN apt-get update \
14    && apt-get install -y sudo nano \
15    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
16    && chmod 0440 /etc/sudoers.d/$USERNAME
17
18## Install Python packages
19COPY ./pkgs /tmp/pkgs
20RUN pip install -r /tmp/pkgs/dev.txt

Container Configuration

The development container will be created by building an image from the Dockerfile illustrated above. The logged-in user’s username is provided to create a non-root user and the container is set to run as the user as well. And 2 Visual Studio Code extensions are installed - Python and Prettier. Also, the current folder is mounted to the container’s workspace folder and 2 additional folders are mounted - they are to share AWS credentials and SSH keys. Note that AWS credentials are mounted to /root/.aws because the Python Glue library’s executables will be run as the root user. Then the port 4040 is set to be forwarded, which is used for the Spark UI. Finally, additional editor settings are added at the end.

 1// .devcontainer/devcontainer.json
 2{
 3  "name": "glue",
 4  "build": {
 5    "dockerfile": "Dockerfile",
 6    "args": {
 7      "USERNAME": "${localEnv:USER}",
 8      "USER_UID": "1000",
 9      "USER_GID": "1000"
10    }
11  },
12  "containerUser": "${localEnv:USER}",
13  "extensions": [
14    "ms-python.python",
15    "esbenp.prettier-vscode"
16  ],
17  "workspaceMount": "source=${localWorkspaceFolder},target=${localEnv:HOME}/glue-vscode,type=bind,consistency=cached",
18  "workspaceFolder": "${localEnv:HOME}/glue-vscode",
19  "forwardPorts": [4040],
20  "mounts": [
21    "source=${localEnv:HOME}/.aws,target=/root/.aws,type=bind,consistency=cached",
22    "source=${localEnv:HOME}/.ssh,target=${localEnv:HOME}/.ssh,type=bind,consistency=cached"
23  ],
24  "settings": {
25    "terminal.integrated.profiles.linux": {
26      "bash": {
27        "path": "/bin/bash"
28      }
29    },
30    "terminal.integrated.defaultProfile.linux": "bash",
31    "editor.formatOnSave": true,
32    "editor.defaultFormatter": "esbenp.prettier-vscode",
33    "editor.tabSize": 2,
34    "python.testing.pytestEnabled": true,
35    "python.linting.enabled": true,
36    "python.linting.pylintEnabled": false,
37    "python.linting.flake8Enabled": false,
38    "python.formatting.provider": "black",
39    "python.formatting.blackPath": "black",
40    "python.formatting.blackArgs": ["--line-length", "100"],
41    "[python]": {
42      "editor.tabSize": 4,
43      "editor.defaultFormatter": "ms-python.python"
44    }
45  }
46}

Launch Container

The development container can be run by executing the following command in the command palette.

  • Remote-Containers: Open Folder in Container…

Once the development container is ready, the workspace folder will be open within the container.

Examples

I’ve created a script (execute.sh) to run the executables easily. The main command indicates which executable to run and possible values are pyspark, spark-submit and pytest. Below shows some example commands.

1./execute.sh pyspark # pyspark
2./execute.sh spark-submit example.py # spark submit
3./execute.sh pytest -svv # pytest
 1## execute.sh
 2#!/usr/bin/env bash
 3
 4## remove first argument
 5execution=$1
 6echo "execution type - $execution"
 7
 8shift 1
 9echo $@
10
11## set up command
12if [ $execution == 'pyspark' ]; then
13  sudo su -c "$GLUE_HOME/bin/gluepyspark"
14elif [ $execution == 'spark-submit' ]; then
15  sudo su -c "$GLUE_HOME/bin/gluesparksubmit $@"
16elif [ $execution == 'pytest' ]; then
17  sudo su -c "$GLUE_HOME/bin/gluepytest $@"
18else
19  echo "unsupported execution type - $execution"
20  exit 1
21fi

Pyspark

Using the script above, we can launch PySpark. A screenshot of the PySpark shell can be found below.

1./execute.sh pyspark

Spark Submit

Below shows one of the Python samples in the Glue documentation. It pulls 3 data sets from a database called legislators. Then they are joined to create a history data set (l_history) and saved into S3.

1./execute.sh spark-submit example.py
 1## example.py
 2from awsglue.dynamicframe import DynamicFrame
 3from awsglue.transforms import Join
 4from awsglue.utils import getResolvedOptions
 5from pyspark.context import SparkContext
 6from awsglue.context import GlueContext
 7
 8glueContext = GlueContext(SparkContext.getOrCreate())
 9
10DATABASE = "legislators"
11OUTPUT_PATH = "s3://glue-python-samples-fbe445ee/output_dir"
12
13## create dynamic frames from data catalog
14persons: DynamicFrame = glueContext.create_dynamic_frame.from_catalog(
15    database=DATABASE, table_name="persons_json"
16)
17
18memberships: DynamicFrame = glueContext.create_dynamic_frame.from_catalog(
19    database=DATABASE, table_name="memberships_json"
20)
21
22orgs: DynamicFrame = glueContext.create_dynamic_frame.from_catalog(
23    database=DATABASE, table_name="organizations_json"
24)
25
26## manipulate data
27orgs = (
28    orgs.drop_fields(["other_names", "identifiers"])
29    .rename_field("id", "org_id")
30    .rename_field("name", "org_name")
31)
32
33l_history: DynamicFrame = Join.apply(
34    orgs, Join.apply(persons, memberships, "id", "person_id"), "org_id", "organization_id"
35)
36l_history = l_history.drop_fields(["person_id", "org_id"])
37
38l_history.printSchema()
39
40## write to s3
41glueContext.write_dynamic_frame.from_options(
42    frame=l_history,
43    connection_type="s3",
44    connection_options={"path": f"{OUTPUT_PATH}/legislator_history"},
45    format="parquet",
46)

When the execution completes, we can see the joined data set is stored as a parquet file in the output S3 bucket.

Note that we can monitor and inspect Spark job executions in the Spark UI on port 4040.

Pytest

We can test a function that deals with a DynamicFrame. Below shows a test case for a simple function that filters a DynamicFrame based on a column value.

1./execute.sh pytest -svv
 1## src/utils.py
 2from awsglue.dynamicframe import DynamicFrame
 3
 4def filter_dynamic_frame(dyf: DynamicFrame, column_name: str, value: int):
 5    return dyf.filter(f=lambda x: x[column_name] > value)
 6
 7## tests/conftest.py
 8from pyspark.context import SparkContext
 9from awsglue.context import GlueContext
10import pytest
11
12@pytest.fixture(scope="session")
13def glueContext():
14    sparkContext = SparkContext()
15    glueContext = GlueContext(sparkContext)
16    yield glueContext
17    sparkContext.stop()
18
19
20## tests/test_utils.py
21from typing import List
22from awsglue.dynamicframe import DynamicFrame
23import pandas as pd
24from src.utils import filter_dynamic_frame
25
26def _get_sorted_data_frame(pdf: pd.DataFrame, columns_list: List[str] = None):
27    if columns_list is None:
28        columns_list = list(pdf.columns.values)
29    return pdf.sort_values(columns_list).reset_index(drop=True)
30
31
32def test_filter_dynamic_frame_by_value(glueContext):
33    spark = glueContext.spark_session
34
35    input = spark.createDataFrame(
36        [("charly", 15), ("fabien", 18), ("sam", 21), ("sam", 25), ("nick", 19), ("nick", 40)],
37        ["name", "age"],
38    )
39
40    expected_output = spark.createDataFrame(
41        [("sam", 25), ("sam", 21), ("nick", 40)],
42        ["name", "age"],
43    )
44
45    real_output = filter_dynamic_frame(DynamicFrame.fromDF(input, glueContext, "output"), "age", 20)
46
47    pd.testing.assert_frame_equal(
48        _get_sorted_data_frame(real_output.toDF().toPandas(), ["name", "age"]),
49        _get_sorted_data_frame(expected_output.toPandas(), ["name", "age"]),
50        check_like=True,
51    )

Conclusion

In this post, I demonstrated how to build local development environments for AWS Glue 3.0 and later using a custom docker image and the Visual Studio Code Remote - Containers extension. Then examples of launching Pyspark shells, submitting an application and running a test are shown. I hope this post is useful to develop and test Glue ETL scripts locally.