Data Build Tool (dbt) for Effective Data Transformation on AWS

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.

The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1, we discussed benefits of a common data transformation tool and the potential of dbt to cover a wide range of data projects from data warehousing to data lake to data lakehouse. A demo data project that targets Redshift Serverless is illustrated as well. In part 2 of the dbt on AWS series, we discuss data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices. A list of posts of this series can be found below.

Below shows an overview diagram of the scope of this dbt on AWS series. Glue is highlighted as it is discussed in this post.

Infrastructure

The infrastructure hosting this solution leverages AWS Glue Data Catalog, AWS Glue Crawlers and a S3 bucket. We also need a runtime IAM role for AWS Glue interactive sessions for data transformation. They are deployed using Terraform and the source can be found in the GitHub repository of this post.

Glue Databases

We have two Glue databases. The source tables and the tables of the staging and intermediate layers are kept in the imdb database. The tables of the marts layer are stored in the _imdb_analytics _database.

 1# dbt-on-aws/glue/infra/main.tf
 2resource "aws_glue_catalog_database" "imdb_db" {
 3  name        = "imdb"
 4  description = "Database that contains IMDb staging/intermediate model datasets"
 5}
 6
 7resource "aws_glue_catalog_database" "imdb_db_marts" {
 8  name        = "imdb_analytics"
 9  description = "Database that contains IMDb marts model datasets"
10}

Glue Crawlers

We use Glue crawlers to create source tables in the imdb database. We can create a single crawler for the seven source tables but it was not satisfactory, especially header detection. Instead, a dedicated crawler is created for each of the tables with its own custom classifier where it includes header columns specifically. The Terraform count meta-argument is used to create the crawlers and classifiers recursively.

 1# dbt-on-aws/glue/infra/main.tf
 2resource "aws_glue_crawler" "imdb_crawler" {
 3  count = length(local.glue.tables)
 4
 5  name          = local.glue.tables[count.index].name
 6  database_name = aws_glue_catalog_database.imdb_db.name
 7  role          = aws_iam_role.imdb_crawler.arn
 8  classifiers   = [aws_glue_classifier.imdb_crawler[count.index].id]
 9
10  s3_target {
11    path = "s3://${local.default_bucket.name}/${local.glue.tables[count.index].name}"
12  }
13
14  tags = local.tags
15}
16
17resource "aws_glue_classifier" "imdb_crawler" {
18  count = length(local.glue.tables)
19
20  name = local.glue.tables[count.index].name
21
22  csv_classifier {
23    contains_header = "PRESENT"
24    delimiter       = "\t"
25    header          = local.glue.tables[count.index].header
26  }
27}
28
29# dbt-on-aws/glue/infra/variables.tf
30locals {
31  name        = basename(path.cwd) == "infra" ? basename(dirname(path.cwd)) : basename(path.cwd)
32  region      = data.aws_region.current.name
33  environment = "dev"
34
35  default_bucket = {
36    name = "${local.name}-${data.aws_caller_identity.current.account_id}-${local.region}"
37  }
38
39  glue = {
40    tables = [
41      { name = "name_basics", header = ["nconst", "primaryName", "birthYear", "deathYear", "primaryProfession", "knownForTitles"] },
42      { name = "title_akas", header = ["titleId", "ordering", "title", "region", "language", "types", "attributes", "isOriginalTitle"] },
43      { name = "title_basics", header = ["tconst", "titleType", "primaryTitle", "originalTitle", "isAdult", "startYear", "endYear", "runtimeMinutes", "genres"] },
44      { name = "title_crew", header = ["tconst", "directors", "writers"] },
45      { name = "title_episode", header = ["tconst", "parentTconst", "seasonNumber", "episodeNumber"] },
46      { name = "title_principals", header = ["tconst", "ordering", "nconst", "category", "job", "characters"] },
47      { name = "title_ratings", header = ["tconst", "averageRating", "numVotes"] }
48    ]
49  }
50
51  tags = {
52    Name        = local.name
53    Environment = local.environment
54  }
55}

Glue Runtime Role for Interactive Sessions

We need a runtime (or service) role for Glue interaction sessions. AWS Glue uses this role to run statements in a session, and it is required to generate a profile by the dbt-glue adapter. Two policies are attached to the runtime role - the former is related to managing Glue interactive sessions while the latter is for actual data transformation by Glue.

  1# dbt-on-aws/glue/infra/main.tf
  2resource "aws_iam_role" "glue_interactive_session" {
  3  name = "${local.name}-glue-interactive-session"
  4
  5  assume_role_policy = data.aws_iam_policy_document.glue_interactive_session_assume_role_policy.json
  6  managed_policy_arns = [
  7    aws_iam_policy.glue_interactive_session.arn,
  8    aws_iam_policy.glue_dbt.arn
  9  ]
 10
 11  tags = local.tags
 12}
 13
 14data "aws_iam_policy_document" "glue_interactive_session_assume_role_policy" {
 15  statement {
 16    actions = ["sts:AssumeRole"]
 17
 18    principals {
 19      type = "Service"
 20      identifiers = [
 21        "lakeformation.amazonaws.com",
 22        "glue.amazonaws.com"
 23      ]
 24    }
 25  }
 26}
 27
 28resource "aws_iam_policy" "glue_interactive_session" {
 29  name   = "${local.name}-glue-interactive-session"
 30  path   = "/"
 31  policy = data.aws_iam_policy_document.glue_interactive_session.json
 32  tags   = local.tags
 33}
 34
 35resource "aws_iam_policy" "glue_dbt" {
 36  name   = "${local.name}-glue-dbt"
 37  path   = "/"
 38  policy = data.aws_iam_policy_document.glue_dbt.json
 39  tags   = local.tags
 40}
 41
 42data "aws_iam_policy_document" "glue_interactive_session" {
 43  statement {
 44    sid = "AllowStatementInASessionToAUser"
 45
 46    actions = [
 47      "glue:ListSessions",
 48      "glue:GetSession",
 49      "glue:ListStatements",
 50      "glue:GetStatement",
 51      "glue:RunStatement",
 52      "glue:CancelStatement",
 53      "glue:DeleteSession"
 54    ]
 55
 56    resources = [
 57      "arn:aws:glue:${local.region}:${data.aws_caller_identity.current.account_id}:session/*",
 58    ]
 59  }
 60
 61  statement {
 62    actions = ["glue:CreateSession"]
 63
 64    resources = ["*"]
 65  }
 66
 67  statement {
 68    actions = ["iam:PassRole"]
 69
 70    resources = ["arn:aws:iam::*:role/${local.name}-glue-interactive-session*"]
 71
 72    condition {
 73      test     = "StringLike"
 74      variable = "iam:PassedToService"
 75
 76      values = ["glue.amazonaws.com"]
 77    }
 78  }
 79
 80  statement {
 81    actions = ["iam:PassRole"]
 82
 83    resources = ["arn:aws:iam::*:role/service-role/${local.name}-glue-interactive-session*"]
 84
 85    condition {
 86      test     = "StringLike"
 87      variable = "iam:PassedToService"
 88
 89      values = ["glue.amazonaws.com"]
 90    }
 91  }
 92}
 93
 94data "aws_iam_policy_document" "glue_dbt" {
 95  statement {
 96    actions = [
 97      "glue:SearchTables",
 98      "glue:BatchCreatePartition",
 99      "glue:CreatePartitionIndex",
100      "glue:DeleteDatabase",
101      "glue:GetTableVersions",
102      "glue:GetPartitions",
103      "glue:DeleteTableVersion",
104      "glue:UpdateTable",
105      "glue:DeleteTable",
106      "glue:DeletePartitionIndex",
107      "glue:GetTableVersion",
108      "glue:UpdateColumnStatisticsForTable",
109      "glue:CreatePartition",
110      "glue:UpdateDatabase",
111      "glue:CreateTable",
112      "glue:GetTables",
113      "glue:GetDatabases",
114      "glue:GetTable",
115      "glue:GetDatabase",
116      "glue:GetPartition",
117      "glue:UpdateColumnStatisticsForPartition",
118      "glue:CreateDatabase",
119      "glue:BatchDeleteTableVersion",
120      "glue:BatchDeleteTable",
121      "glue:DeletePartition",
122      "glue:GetUserDefinedFunctions"
123    ]
124
125    resources = [
126      "arn:aws:glue:${local.region}:${data.aws_caller_identity.current.account_id}:catalog",
127      "arn:aws:glue:${local.region}:${data.aws_caller_identity.current.account_id}:table/*/*",
128      "arn:aws:glue:${local.region}:${data.aws_caller_identity.current.account_id}:database/*",
129    ]
130  }
131
132  statement {
133    actions = [
134      "lakeformation:UpdateResource",
135      "lakeformation:ListResources",
136      "lakeformation:BatchGrantPermissions",
137      "lakeformation:GrantPermissions",
138      "lakeformation:GetDataAccess",
139      "lakeformation:GetTableObjects",
140      "lakeformation:PutDataLakeSettings",
141      "lakeformation:RevokePermissions",
142      "lakeformation:ListPermissions",
143      "lakeformation:BatchRevokePermissions",
144      "lakeformation:UpdateTableObjects"
145    ]
146
147    resources = ["*"]
148  }
149
150  statement {
151    actions = [
152      "s3:GetBucketLocation",
153      "s3:ListBucket"
154    ]
155
156    resources = ["arn:aws:s3:::${local.default_bucket.name}"]
157  }
158
159  statement {
160    actions = [
161      "s3:PutObject",
162      "s3:PutObjectAcl",
163      "s3:GetObject",
164      "s3:DeleteObject"
165    ]
166
167    resources = ["arn:aws:s3:::${local.default_bucket.name}/*"]
168  }
169}

Project

We build a data transformation pipeline using subsets of IMDb data - seven titles and names related datasets are provided as gzipped, tab-separated-values (TSV) formatted files. This results in three tables that can be used for reporting and analysis.

Save Data to S3

The Axel download accelerator is used to download the data files locally followed by decompressing with the gzip utility. Note that simple retry logic is added as I see download failure from time to time. Finally, the decompressed files are saved into the project S3 bucket using the S3 sync command.

 1# dbt-on-aws/glue/upload-data.sh
 2#!/usr/bin/env bash
 3
 4s3_bucket=$(terraform -chdir=./infra output --raw default_bucket_name)
 5hostname="datasets.imdbws.com"
 6declare -a file_names=(
 7  "name.basics.tsv.gz" \
 8  "title.akas.tsv.gz" \
 9  "title.basics.tsv.gz" \
10  "title.crew.tsv.gz" \
11  "title.episode.tsv.gz" \
12  "title.principals.tsv.gz" \
13  "title.ratings.tsv.gz"
14  )
15
16rm -rf imdb-data
17
18for fn in "${file_names[@]}"
19do
20  download_url="https://$hostname/$fn"
21  prefix=$(echo ${fn::-7} | tr '.' '_')
22  echo "download imdb-data/$prefix/$fn from $download_url"
23  while true;
24  do
25    mkdir -p imdb-data/$prefix
26    axel -n 32 -a -o imdb-data/$prefix/$fn $download_url
27    gzip -d imdb-data/$prefix/$fn
28    num_files=$(ls imdb-data/$prefix | wc -l)
29    if [ $num_files == 1 ]; then
30      break
31    fi
32    rm -rf imdb-data/$prefix
33  done
34done
35
36aws s3 sync ./imdb-data s3://$s3_bucket

Run Glue Crawlers

The Glue crawlers for the seven source tables are executed as shown below.

 1# dbt-on-aws/glue/start-crawlers.sh
 2#!/usr/bin/env bash
 3
 4declare -a crawler_names=(
 5  "name_basics" \
 6  "title_akas" \
 7  "title_basics" \
 8  "title_crew" \
 9  "title_episode" \
10  "title_principals" \
11  "title_ratings"
12  )
13
14for cn in "${crawler_names[@]}"
15do
16  echo "start crawler $cn ..."
17  aws glue start-crawler --name $cn
18done

Note that the header rows of the source tables are not detected properly by the Glue crawlers, and they have to be filtered out in the stage models of the dbt project.

Setup dbt Project

We need the dbt-core and dbt-glue packages for the main data transformation project. Also the boto3 and aws-glue-sessions packages are necessary for setting up interactive sessions locally. The dbt Glue adapter doesn’t support creating a profile with the dbt init command so that profile creation is skipped when initialising the project.

1$ pip install --no-cache-dir --upgrade boto3 aws-glue-sessions dbt-core dbt-glue
2$ dbt init --skip-profile-setup
310:04:00  Running with dbt=1.2.1
4Enter a name for your project (letters, digits, underscore): dbt_glue_proj

The project profile is manually created as shown below. The attributes are self-explanatory and their details can be checked further in the GitHub repository of the dbt-glue adapter.

 1# dbt-on-aws/glue/set-profile.sh
 2#!/usr/bin/env bash
 3
 4dbt_role_arn=$(terraform -chdir=./infra output --raw glue_interactive_session_role_arn)
 5dbt_s3_location=$(terraform -chdir=./infra output --raw default_bucket_name)
 6
 7cat << EOF > ~/.dbt/profiles.yml
 8dbt_glue_proj:
 9  outputs:
10    dev:
11      type: glue
12      role_arn: "${dbt_role_arn}"
13      region: ap-southeast-2
14      workers: 3
15      worker_type: G.1X
16      schema: imdb
17      database: imdb
18      session_provisioning_timeout_in_seconds: 240
19      location: "s3://${dbt_s3_location}"
20      query_timeout_in_seconds: 300
21      idle_timeout: 60
22      glue_version: "3.0"
23  target: dev
24EOF

dbt initialises a project in a folder that matches to the project name and generates project boilerplate as shown below. Some of the main objects are dbt_project.yml, and the model folder. The former is required because dbt doesn’t know if a folder is a dbt project without it. Also, it contains information that tells dbt how to operate on the project. The latter is for including dbt models, which is basically a set of SQL select statements. See dbt documentation for more details.

 1$ tree glue/dbt_glue_proj/ -L 1
 2glue/dbt_glue_proj/
 3├── README.md
 4├── analyses
 5├── dbt_packages
 6├── dbt_project.yml
 7├── logs
 8├── macros
 9├── models
10├── packages.yml
11├── seeds
12├── snapshots
13├── target
14└── tests

We can check Glue interactive session connection with the dbt debug command as shown below.

 1$ dbt debug
 208:50:58  Running with dbt=1.2.1
 3dbt version: 1.2.1
 4python version: 3.8.10
 5python path: <path-to-python-path>
 6os info: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
 7Using profiles.yml file at /home/<username>/.dbt/profiles.yml
 8Using dbt_project.yml file at <path-to-dbt-project>/dbt_project.yml
 9
10Configuration:
11  profiles.yml file [OK found and valid]
12  dbt_project.yml file [OK found and valid]
13
14Required dependencies:
15 - git [OK found]
16
17Connection:
18  role_arn: <glue-interactive-session-role-arn>
19  region: ap-southeast-2
20  session_id: None
21  workers: 3
22  worker_type: G.1X
23  session_provisioning_timeout_in_seconds: 240
24  database: imdb
25  schema: imdb
26  location: s3://<s3-bucket-name>
27  extra_jars: None
28  idle_timeout: 60
29  query_timeout_in_seconds: 300
30  glue_version: 3.0
31  security_configuration: None
32  connections: None
33  conf: None
34  extra_py_files: None
35  delta_athena_prefix: None
36  tags: None
37  Connection test: [OK connection ok]
38
39All checks passed!

After initialisation, the model configuration is updated. The project materialisation is specified as view although it is the default materialisation. Also tags are added to the entire model folder as well as folders of specific layers - staging, intermediate and marts. As shown below, tags can simplify model execution.

 1# glue/dbt_glue_proj/dbt_project.yml
 2name: "dbt_glue_proj"
 3
 4...
 5
 6models:
 7  dbt_glue_proj:
 8    +materialized: view
 9    +tags:
10      - "imdb"
11    staging:
12      +tags:
13        - "staging"
14    intermediate:
15      +tags:
16        - "intermediate"
17    marts:
18      +tags:
19        - "marts"

The dbt_utils package is installed for adding tests to the final marts models. The packages can be installed by the dbt deps command.

1# glue/dbt_glue_proj/packages.yml
2packages:
3  - package: dbt-labs/dbt_utils
4    version: 0.9.2

Create dbt Models

The models for this post are organised into three layers according to the dbt best practices - staging, intermediate and marts.

Staging

The seven tables that are loaded from S3 are dbt source tables and their details are declared in a YAML file (_imdb_sources.yml). By doing so, we are able to refer to the source tables with the {{ source() }} function. Also we can add tests to source tables. For example below two tests (unique, not_null) are added to the tconst column of the title_basics table below and these tests can be executed by the dbt test command.

 1# glue/dbt_glue_proj/models/staging/imdb/_imdb__sources.yml
 2version: 2
 3
 4sources:
 5  - name: imdb
 6    description: Subsets of IMDb data, which are available for access to customers for personal and non-commercial use
 7    tables:
 8      - name: title_basics
 9        description: Table that contains basic information of titles
10        columns:
11          - name: tconst
12            description: alphanumeric unique identifier of the title
13            tests:
14              - unique
15              - not_null
16          - name: titletype
17            description: the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
18          - name: primarytitle
19            description: the more popular title / the title used by the filmmakers on promotional materials at the point of release
20          - name: originaltitle
21            description: original title, in the original language
22          - name: isadult
23            description: flag that indicates whether it is an adult title or not
24          - name: startyear
25            description: represents the release year of a title. In the case of TV Series, it is the series start year
26          - name: endyear
27            description: TV Series end year. NULL for all other title types
28          - name: runtime minutes
29            description: primary runtime of the title, in minutes
30          - name: genres
31            description: includes up to three genres associated with the title
32      ...

Based on the source tables, staging models are created. They are created as views, which is the project’s default materialisation. In the SQL statements, column names and data types are modified mainly.

 1# glue/dbt_glue_proj/models/staging/imdb/stg_imdb__title_basics.sql
 2with source as (
 3
 4    select * from {{ source('imdb', 'title_basics') }}
 5
 6),
 7
 8renamed as (
 9
10    select
11        tconst as title_id,
12        titletype as title_type,
13        primarytitle as primary_title,
14        originaltitle as original_title,
15        cast(isadult as boolean) as is_adult,
16        cast(startyear as int) as start_year,
17        cast(endyear as int) as end_year,
18        cast(runtimeminutes as int) as runtime_minutes,
19        genres
20    from source
21    where tconst <> 'tconst'
22
23)
24
25select * from renamed

Below shows the file tree of the staging models. The staging models can be executed using the dbt run command. As we’ve added tags to the staging layer models, we can limit to execute only this layer by dbt run --select staging.

 1$ tree glue/dbt_glue_proj/models/staging/
 2glue/dbt_glue_proj/models/staging/
 3└── imdb
 4    ├── _imdb__models.yml
 5    ├── _imdb__sources.yml
 6    ├── stg_imdb__name_basics.sql
 7    ├── stg_imdb__title_akas.sql
 8    ├── stg_imdb__title_basics.sql
 9    ├── stg_imdb__title_crews.sql
10    ├── stg_imdb__title_episodes.sql
11    ├── stg_imdb__title_principals.sql
12    └── stg_imdb__title_ratings.sql

Note that the model materialisation of the staging and intermediate models is view and the dbt project creates VIRTUAL_VIEW tables. Although we are able to reference those tables in other models, they cannot be queried by Athena.

 1$ aws glue get-tables --database imdb \
 2 --query "TableList[?Name=='stg_imdb__title_basics'].[Name, TableType, StorageDescriptor.Columns]" --output yaml
 3- - stg_imdb__title_basics
 4  - VIRTUAL_VIEW
 5  - - Name: title_id
 6      Type: string
 7    - Name: title_type
 8      Type: string
 9    - Name: primary_title
10      Type: string
11    - Name: original_title
12      Type: string
13    - Name: is_adult
14      Type: boolean
15    - Name: start_year
16      Type: int
17    - Name: end_year
18      Type: int
19    - Name: runtime_minutes
20      Type: int
21    - Name: genres
22      Type: string

Instead we can use Glue Studio notebooks to query the tables, which is a bit inconvenient.

Intermediate

We can keep intermediate results in this layer so that the models of the final marts layer can be simplified. The source data includes columns where array values are kept as comma separated strings. For example, the genres column of the stg_imdb__title_basics model includes up to three genre values as shown in the previous screenshot. A total of seven columns in three models are columns of comma-separated strings and it is better to flatten them in the intermediate layer. Also, in order to avoid repetition, a dbt macro (f_latten_fields_) is created to share the column-flattening logic.

1# glue/dbt_glue_proj/macros/flatten_fields.sql
2{% macro flatten_fields(model, field_name, id_field_name) %}
3    select
4        {{ id_field_name }} as id,
5        explode(split({{ field_name }}, ',')) as field
6    from {{ model }}
7{% endmacro %}

The macro function can be added inside a common table expression (CTE) by specifying the relevant model, field name to flatten and id field name.

 1-- glue/dbt_glue_proj/models/intermediate/title/int_genres_flattened_from_title_basics.sql
 2with flattened as (
 3    {{ flatten_fields(ref('stg_imdb__title_basics'), 'genres', 'title_id') }}
 4)
 5
 6select
 7    id as title_id,
 8    field as genre
 9from flattened
10order by id

The intermediate models are also materialised as views and we can check the array columns are flattened as expected.

Below shows the file tree of the intermediate models. Similar to the staging models, the intermediate models can be executed by dbt run --select intermediate.

 1$ tree glue/dbt_glue_proj/models/intermediate/ glue/dbt_glue_proj/macros/
 2glue/dbt_glue_proj/models/intermediate/
 3├── name
 4│   ├── _int_name__models.yml
 5│   ├── int_known_for_titles_flattened_from_name_basics.sql
 6│   └── int_primary_profession_flattened_from_name_basics.sql
 7└── title
 8    ├── _int_title__models.yml
 9    ├── int_directors_flattened_from_title_crews.sql
10    ├── int_genres_flattened_from_title_basics.sql
11    └── int_writers_flattened_from_title_crews.sql
12
13glue/dbt_glue_proj/macros/
14└── flatten_fields.sql

Marts

The models in the marts layer are configured to be materialised as tables in a custom schema. Their materialisation is set to table and the custom schema is specified as analytics while taking _parquet _as the file format. Note that the custom schema name becomes imdb_analytics according to the naming convention of dbt custom schemas. Models of both the staging and intermediate layers are used to create final models to be used for reporting and analytics.

 1-- glue/dbt_glue_proj/models/marts/analytics/titles.sql
 2{{
 3    config(
 4        schema='analytics',
 5        materialized='table',
 6        file_format='parquet'
 7    )
 8}}
 9
10with titles as (
11
12    select * from {{ ref('stg_imdb__title_basics') }}
13
14),
15
16principals as (
17
18    select
19        title_id,
20        count(name_id) as num_principals
21    from {{ ref('stg_imdb__title_principals') }}
22    group by title_id
23
24),
25
26names as (
27
28    select
29        title_id,
30        count(name_id) as num_names
31    from {{ ref('int_known_for_titles_flattened_from_name_basics') }}
32    group by title_id
33
34),
35
36ratings as (
37
38    select
39        title_id,
40        average_rating,
41        num_votes
42    from {{ ref('stg_imdb__title_ratings') }}
43
44),
45
46episodes as (
47
48    select
49        parent_title_id,
50        count(title_id) as num_episodes
51    from {{ ref('stg_imdb__title_episodes') }}
52    group by parent_title_id
53
54),
55
56distributions as (
57
58    select
59        title_id,
60        count(title) as num_distributions
61    from {{ ref('stg_imdb__title_akas') }}
62    group by title_id
63
64),
65
66final as (
67
68    select
69        t.title_id,
70        t.title_type,
71        t.primary_title,
72        t.original_title,
73        t.is_adult,
74        t.start_year,
75        t.end_year,
76        t.runtime_minutes,
77        t.genres,
78        p.num_principals,
79        n.num_names,
80        r.average_rating,
81        r.num_votes,
82        e.num_episodes,
83        d.num_distributions
84    from titles as t
85    left join principals as p on t.title_id = p.title_id
86    left join names as n on t.title_id = n.title_id
87    left join ratings as r on t.title_id = r.title_id
88    left join episodes as e on t.title_id = e.parent_title_id
89    left join distributions as d on t.title_id = d.title_id
90
91)
92
93select * from final

The details of the three models can be found in a YAML file (_analytics__models.yml). We can add tests to models and below we see tests of row count matching to their corresponding staging models.

 1# glue/dbt_glue_proj/models/marts/analytics/_analytics__models.yml
 2version: 2
 3
 4models:
 5  - name: names
 6    description: Table that contains all names with additional details
 7    tests:
 8      - dbt_utils.equal_rowcount:
 9          compare_model: ref('stg_imdb__name_basics')
10  - name: titles
11    description: Table that contains all titles with additional details
12    tests:
13      - dbt_utils.equal_rowcount:
14          compare_model: ref('stg_imdb__title_basics')
15  - name: genre_titles
16    description: Table that contains basic title details after flattening genres

The models of the marts layer can be tested using the dbt test command as shown below.

 1$ dbt test --select marts
 209:11:51  Running with dbt=1.2.1
 309:11:51  Found 15 models, 17 tests, 0 snapshots, 0 analyses, 521 macros, 0 operations, 0 seed files, 7 sources, 0 exposures, 0 metrics
 409:11:51  
 509:12:28  Concurrency: 1 threads (target='dev')
 609:12:28  
 709:12:28  1 of 2 START test dbt_utils_equal_rowcount_names_ref_stg_imdb__name_basics_ .... [RUN]
 809:12:53  1 of 2 PASS dbt_utils_equal_rowcount_names_ref_stg_imdb__name_basics_ .......... [PASS in 24.94s]
 909:12:53  2 of 2 START test dbt_utils_equal_rowcount_titles_ref_stg_imdb__title_basics_ .. [RUN]
1009:13:04  2 of 2 PASS dbt_utils_equal_rowcount_titles_ref_stg_imdb__title_basics_ ........ [PASS in 11.63s]
1109:13:07  
1209:13:07  Finished running 2 tests in 0 hours 1 minutes and 15.74 seconds (75.74s).
1309:13:07  
1409:13:07  Completed successfully
1509:13:07  
1609:13:07  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Below shows the file tree of the marts models. As with the other layers, the marts models can be executed by dbt run –select marts.

$ tree glue/dbt_glue_proj/models/marts/
glue/dbt_glue_proj/models/marts/
└── analytics
    ├── _analytics__models.yml
    ├── genre_titles.sql
    ├── names.sql
    └── titles.sql

Build Dashboard

The models of the marts layer can be consumed by external tools such as Amazon QuickSight. Below shows an example dashboard. The two pie charts on top show proportions of genre and title type. The box plots at the bottom show dispersion of the number of votes and average rating by title type.

Generate dbt Documentation

A nice feature of dbt is documentation. It provides information about the project and the data warehouse, and it facilitates consumers as well as other developers to discover and understand the datasets better. We can generate the project documents and start a document server as shown below.

1$ dbt docs generate
2$ dbt docs serve

A very useful element of dbt documentation is data lineage, which provides an overall view about how data is transformed and consumed. Below we can see that the final titles model consumes all title-related stating models and an intermediate model from the name basics staging model.

Summary

In this post, we discussed how to build data transformation pipelines using dbt on AWS Glue. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices. dbt can be used as an effective tool for data transformation in a wide range of data projects from data warehousing to data lake to data lakehouse and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. More examples of using dbt will be discussed in subsequent posts.

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

Contents

Infrastructure

Glue Databases

Glue Crawlers

Glue Runtime Role for Interactive Sessions

Project

Save Data to S3

Run Glue Crawlers

Setup dbt Project

Create dbt Models

Staging

Intermediate

Marts

Build Dashboard

Generate dbt Documentation

Summary

Comments

Data Build Tool (Dbt) for Effective Data Transformation on AWS – Part 2 Glue

Contents

Infrastructure

Glue Databases

Glue Crawlers

Glue Runtime Role for Interactive Sessions

Project

Save Data to S3

Run Glue Crawlers

Setup dbt Project

Create dbt Models

Staging

Intermediate

Marts

Build Dashboard

Generate dbt Documentation

Summary

Related Posts

Comments