The data build tool (dbt) is an effective data transformation tool and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. In part 1 of the dbt on AWS series, we discuss data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices.


In our experience delivering data solutions for our customers, we have observed a desire to move away from a centralised team function, responsible for the data collection, analysis and reporting, towards shifting this responsibility to an organisation’s lines of business (LOB) teams. The key driver for this comes from the recognition that LOBs retain the deep data knowledge and business understanding for their respective data domain; which improves the speed with which these teams can develop data solutions and gain customer insights. This shift away from centralised data engineering to LOBs exposed a skills and tooling gap.

Let’s assume as a starting point that the central data engineering team has chosen a project that migrates an on-premise data warehouse into a data lake (spark + iceberg + redshift) on AWS, to provide a cost-effective way to serve data consumers thanks to iceberg’s ACID transaction features. The LOB data engineers are new to spark, and they have a little bit of experience in python while the majority of their work is based on SQL. Thanks to their expertise in SQL, however, they are able to get started building data transformation logic on jupyter notebooks using Pyspark. However, they soon find the codebase gets quite bigger even during the minimum valuable product (MVP) phase, which would only amplify the issue as they extend it to cover the entire data warehouse. Additionally the use of notebooks makes development challenging mainly due to lack of modularity and fai,ling to incorporate testing. Upon contacting the central data engineering team for assistance they are advised that the team uses scala and many other tools (e.g. Metorikku) that are successful for them, however cannot be used directly by the engineers of the LOB. Moreover, the engineering team don’t even have a suitable data transformation framework that supports iceberg. The LOB data engineering team understand that the data democratisation plan of the enterprise can be more effective if there is a tool or framework that:

  • can be shared across LOBs although they can have different technology stack and practices,
  • fits into various project types from traditional data warehousing to data lakehouse projects, and
  • supports more than a notebook environment by facilitating code modularity and incorporating testing.

The data build tool (dbt) is an open-source command line tool, and it does the T in ELT (Extract, Load, Transform) processes well. It supports a wide range of data platforms and the following key AWS analytics services are covered - Redshift, Glue, EMR and Athena. It is one of the most popular tools in the modern data stack that originally covers data warehousing projects. Its scope is extended to data lake projects by the addition of the dbt-spark and dbt-glue adapter where we can develop data lakes with spark SQL. Recently the spark adapter added open source table formats (hudi, iceberg and delta lake) as the supported file formats, and it allows you to work on data lake house projects with it. As discussed in this blog post, dbt has clear advantages compared to spark in terms of

  • low learning curve as SQL is easier than spark
  • better code organisation as there is no correct way of organising transformation pipeline with spark

On the other hand, its weaknesses are

  • lack of expressiveness as Jinja is quite heavy and verbose, not very readable, and unit-testing is rather tedious
  • limitation of SQL as some logic is much easier to implement with user defined functions rather than SQL

Those weaknesses can be overcome by Python models as it allows you to apply transformations as DataFrame operations. Unfortunately the beta feature is not available on any of the AWS services, however it is available on Snowflake, Databricks and BigQuery. Hopefully we can use this feature on Redshift, Glue and EMR in the near future.

Finally, the following areas are supported by spark, however not supported by DBT:

  • E and L of ELT processes
  • real time data processing

Overall dbt can be used as an effective tool for data transformation in a wide range of data projects from data warehousing to data lake to data lakehouse. Also it can be more powerful with spark by its Python models feature. Below shows an overview diagram of the scope of this dbt on AWS series. Redshift is highlighted as it is discussed in this post.


A VPC with 3 public and private subnets is created using the AWS VPC Terraform module. The following Redshift serverless resources are deployed to work with the dbt project. As explained in the Redshift user guide, a namespace is a collection of database objects and users and a workgroup is a collection of compute resources. We also need a Redshift-managed VPC endpoint for a private connection from a client tool. The source can be found in the GitHub repository of this post.

 1# redshift-sls/infra/
 2resource "aws_redshiftserverless_namespace" "namespace" {
 3  namespace_name = "${}-namespace"
 5  admin_username       = local.redshift.admin_username
 6  admin_user_password  = local.secrets.redshift_admin_password
 7  db_name              = local.redshift.db_name
 8  default_iam_role_arn = aws_iam_role.redshift_serverless_role.arn
 9  iam_roles            = [aws_iam_role.redshift_serverless_role.arn]
11  tags = local.tags
14resource "aws_redshiftserverless_workgroup" "workgroup" {
15  namespace_name =
16  workgroup_name = "${}-workgroup"
18  base_capacity      = local.redshift.base_capacity # 128 
19  subnet_ids         = module.vpc.private_subnets
20  security_group_ids = []
22  tags = local.tags
25resource "aws_redshiftserverless_endpoint_access" "endpoint_access" {
26  endpoint_name = "${}-endpoint"
28  workgroup_name =
29  subnet_ids     = module.vpc.private_subnets

As in the previous post, we connect to Redshift via SoftEther VPN to improve developer experience significantly by accessing the database directly from the developer machine. Instead of providing VPN related secrets as Terraform variables in the earlier post, they are created internally and stored to AWS Secrets Manager. Also, the Redshift admin username and password are included so that the secrets can be accessed securely. The details can be found in redshift-sls/infra/ and the secret string can be retrieved as shown below.

1$ aws secretsmanager get-secret-value --secret-id redshift-sls-all-secrets --query "SecretString" --output text
2  {
3    "vpn_pre_shared_key": "<vpn-pre-shared-key>",
4    "vpn_admin_password": "<vpn-admin-password>",
5    "redshift_admin_username": "master",
6    "redshift_admin_password": "<redshift-admin-password>"
7  }

The previous post demonstrates how to create a VPN user and to establish connection in detail. An example of a successful connection is shown below.


We build a data transformation pipeline using subsets of IMDb data - seven titles and names related datasets are provided as gzipped, tab-separated-values (TSV) formatted files. This results in three tables that can be used for reporting and analysis.

Create Database Objects

The majority of data transformation is performed in the imdb schema, which is configured as the dbt target schema. We create the final three tables in a custom schema named imdb_analytics. Note that its name is according to the naming convention of the dbt custom schema, which is <target_schema><custom_schema>_. After creating the database schemas, we create a development user (dbt) and a group that the user belongs to, followed by granting necessary permissions of the new schemas to the new group and reassigning schema ownership to the new user.

 1-- redshift-sls/setup-redshift.sql
 2-- // create db schemas
 3create schema if not exists imdb;
 4create schema if not exists imdb_analytics;
 6-- // create db user and group
 7create user dbt with password '<password>';
 8create group dbt with user dbt;
10-- // grant permissions to new schemas
11grant usage on schema imdb to group dbt;
12grant create on schema imdb to group dbt;
13grant all on all tables in schema imdb to group dbt;
15grant usage on schema imdb_analytics to group dbt;
16grant create on schema imdb_analytics to group dbt;
17grant all on all tables in schema imdb_analytics to group dbt;
19-- reassign schema ownership to dbt
20alter schema imdb owner to dbt;
21alter schema imdb_analytics owner to dbt;

Save Data to S3

The Axel download accelerator is used to download the data files locally followed by decompressing with the gzip utility. Note that simple retry logic is added as I see download failure from time to time. Finally the decompressed files are saved into the project S3 bucket using the S3 sync command.,

 1# redshift-sls/
 2#!/usr/bin/env bash
 6declare -a file_names=(
 7  "name.basics.tsv.gz" \
 8  "title.akas.tsv.gz" \
 9  "title.basics.tsv.gz" \
10  "title.crew.tsv.gz" \
11  "title.episode.tsv.gz" \
12  "title.principals.tsv.gz" \
13  "title.ratings.tsv.gz"
14  )
16rm -rf imdb-data
18for fn in "${file_names[@]}"
20  download_url="https://$hostname/$fn"
21  prefix=$(echo ${fn::-7} | tr '.' '_')
22  echo "download imdb-data/$prefix/$fn from $download_url"
23  # download can fail, retry after removing temporary files if failed
24  while true;
25  do
26    mkdir -p imdb-data/$prefix
27    axel -n 32 -a -o imdb-data/$prefix/$fn $download_url
28    gzip -d imdb-data/$prefix/$fn
29    num_files=$(ls imdb-data/$prefix | wc -l)
30    if [ $num_files == 1 ]; then
31      break
32    fi
33    rm -rf imdb-data/$prefix
34  done
37aws s3 sync ./imdb-data s3://$s3_bucket

Copy Data

The data files in S3 are loaded into Redshift using the COPY command as shown below.

  1-- redshift-sls/setup-redshift.sql
  2-- // copy data to tables
  3-- name_basics
  4drop table if exists imdb.name_basics;
  5create table imdb.name_basics (
  6    nconst text,
  7    primary_name text,
  8    birth_year text,
  9    death_year text,
 10    primary_profession text,
 11    known_for_titles text
 14copy imdb.name_basics
 15from 's3://<s3-bucket-name>/name_basics'
 16iam_role default
 17delimiter '\t'
 18region 'ap-southeast-2'
 19ignoreheader 1;
 21-- title_akas
 22drop table if exists imdb.title_akas;
 23create table imdb.title_akas (
 24    title_id text,
 25    ordering int,
 26    title varchar(max),
 27    region text,
 28    language text,
 29    types text,
 30    attributes text,
 31    is_original_title boolean
 34copy imdb.title_akas
 35from 's3://<s3-bucket-name>/title_akas'
 36iam_role default
 37delimiter '\t'
 38region 'ap-southeast-2'
 39ignoreheader 1;
 41-- title_basics
 42drop table if exists imdb.title_basics;
 43create table imdb.title_basics (
 44    tconst text,
 45    title_type text,
 46    primary_title varchar(max),
 47    original_title varchar(max),
 48    is_adult boolean,
 49    start_year text,
 50    end_year text,
 51    runtime_minutes text,
 52    genres text
 55copy imdb.title_basics
 56from 's3://<s3-bucket-name>/title_basics'
 57iam_role default
 58delimiter '\t'
 59region 'ap-southeast-2'
 60ignoreheader 1;
 62-- title_crews
 63drop table if exists imdb.title_crews;
 64create table imdb.title_crews (
 65    tconst text,
 66    directors varchar(max),
 67    writers varchar(max)
 70copy imdb.title_crews
 71from 's3://<s3-bucket-name>/title_crew'
 72iam_role default
 73delimiter '\t'
 74region 'ap-southeast-2'
 75ignoreheader 1;
 77-- title_episodes
 78drop table if exists imdb.title_episodes;
 79create table imdb.title_episodes (
 80    tconst text,
 81    parent_tconst text,
 82    season_number int,
 83    episode_number int
 86copy imdb.title_episodes
 87from 's3://<s3-bucket-name>/title_episode'
 88iam_role default
 89delimiter '\t'
 90region 'ap-southeast-2'
 91ignoreheader 1;
 93-- title_principals
 94drop table if exists imdb.title_principals;
 95create table imdb.title_principals (
 96    tconst text,
 97    ordering int,
 98    nconst text,
 99    category text,
100    job varchar(max),
101    characters varchar(max)
104copy imdb.title_principals
105from 's3://<s3-bucket-name>/title_principals'
106iam_role default
107delimiter '\t'
108region 'ap-southeast-2'
109ignoreheader 1;
111-- title_ratings
112drop table if exists imdb.title_ratings;
113create table imdb.title_ratings (
114    tconst text,
115    average_rating float,
116    num_votes int
119copy imdb.title_ratings
120from 's3://<s3-bucket-name>/title_ratings'
121iam_role default
122delimiter '\t'
123region 'ap-southeast-2'
124ignoreheader 1;

Initialise dbt Project

We need the dbt-core and dbt-redshift. Once installed, we can initialise a dbt project with the dbt init command. We are required to specify project details such as project name, database adapter and database connection info. Note dbt creates the project profile to .dbt/profile.yml of the user home directory by default.

 1$ dbt init
 207:07:16  Running with dbt=1.2.1
 3Enter a name for your project (letters, digits, underscore): dbt_redshift_sls
 4Which database would you like to use?
 5[1] postgres
 6[2] redshift
 8(Don't see the one you want?
10Enter a number: 2
11host ( <redshift-endpoint-url>
12port [5439]:
13user (dev username): dbt
14[1] password
15[2] iam
16Desired authentication method option (enter a number): 1
17password (dev password):
18dbname (default database that dbt will build objects in): main
19schema (default schema that dbt will build objects in): gdelt
20threads (1 or more) [1]: 4
2107:08:13  Profile dbt_redshift_sls written to /home/<username>/.dbt/profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection.
23Your new dbt project "dbt_redshift_sls" was created!
25For more information on how to configure the profiles.yml file,
26please consult the dbt documentation here:
30One more thing:
32Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
36Happy modeling!

dbt initialises a project in a folder that matches to the project name and generates project boilerplate as shown below. Some of the main objects are dbt_project.yml, and the model folder. The former is required because dbt doesn’t know if a folder is a dbt project without it. Also it contains information that tells dbt how to operate on the project. The latter is for including dbt models, which is basically a set of SQL select statements. See dbt documentation for more details.

 1$ tree dbt_redshift_sls/ -L 1
 4├── analyses
 5├── dbt_project.yml
 6├── macros
 7├── models
 8├── seeds
 9├── snapshots
10└── tests

We can check the database connection with the dbt debug command. Do not forget to connect to VPN as mentioned earlier.

 1$ dbt debug
 203:50:58  Running with dbt=1.2.1
 3dbt version: 1.2.1
 4python version: 3.8.10
 5python path: <path-to-python-path>
 6os info: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
 7Using profiles.yml file at /home/<username>/.dbt/profiles.yml
 8Using dbt_project.yml file at <path-to-dbt-project>/dbt_project.yml
11  profiles.yml file [OK found and valid]
12  dbt_project.yml file [OK found and valid]
14Required dependencies:
15 - git [OK found]
18  host: <redshift-endpoint-url>
19  port: 5439
20  user: dbt
21  database: main
22  schema: imdb
23  search_path: None
24  keepalives_idle: 240
25  sslmode: None
26  method: database
27  cluster_id: None
28  iam_profile: None
29  iam_duration_seconds: 900
30  Connection test: [OK connection ok]
32All checks passed!

After initialisation, the model configuration is updated. The project materialisation is specified as view although it is the default materialisation. Also tags are added to the entire model folder as well as folders of specific layers - staging, intermediate and marts. As shown below, tags can simplify model execution.

 1# redshift-sls/dbt_redshift_sls/dbt_project.yml
 2name: "dbt_redshift_sls"
 6  dbt_redshift_sls:
 7    +materialized: view
 8    +tags:
 9      - "imdb"
10    staging:
11      +tags:
12        - "staging"
13    intermediate:
14      +tags:
15        - "intermediate"
16    marts:
17      +tags:
18        - "marts"

Two dbt packages are used in this project. The dbt-labs/codegen is used to save typing when generating the source and base models while dbt-labda/dbt_utils for adding tests to the final marts models. The packages can be installed by the dbt deps command.

1# redshift-sls/dbt_redshift_sls/packages.yml
3  - package: dbt-labs/codegen
4    version: 0.8.0
5  - package: dbt-labs/dbt_utils
6    version: 0.9.2

Create dbt Models

The models for this post are organised into three layers according to the dbt best practices - staging, intermediate and marts.


The seven tables that are loaded from S3 are dbt source tables and their details are declared in a YAML file (_imdb_sources.yml). By doing so, we are able to refer to the source tables with the {{ source() }} function. Also we can add tests to source tables. For example below two tests (unique, not_null) are added to the tconst column of the title_basics table below and these tests can be executed by the dbt test command.

 1# redshift-sls/dbt_redshift_sls/models/staging/imdb/_imdb__sources.yml
 2version: 2
 5  - name: imdb
 6    description: Subsets of IMDb data, which are available for access to customers for personal and non-commercial use
 7    tables:
 8      ...
 9      - name: title_basics
10        description: Table that contains basic information of titles
11        columns:
12          - name: tconst
13            description: alphanumeric unique identifier of the title
14            tests:
15              - unique
16              - not_null
17          - name: title_type
18            description: the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
19          - name: primary_title
20            description: the more popular title / the title used by the filmmakers on promotional materials at the point of release
21          - name: original_title
22            description: original title, in the original language
23          - name: is_adult
24            description: flag that indicates whether it is an adult title or not
25          - name: start_year
26            description: represents the release year of a title. In the case of TV Series, it is the series start year
27          - name: end_year
28            description: TV Series end year. NULL for all other title types
29          - name: runtime_minutes
30            description: primary runtime of the title, in minutes
31          - name: genres
32            description: includes up to three genres associated with the title
33      ...

Based on the source tables, staging models are created. They are created as views, which is the project’s default materialisation. In the SQL statements, column names and data types are modified mainly.

 1-- // redshift-sls/dbt_redshift_sls/models/staging/imdb/stg_imdb__title_basics.sql
 2with source as (
 4    select * from {{ source('imdb', 'title_basics') }}
 8renamed as (
10    select
11        tconst as title_id,
12        title_type,
13        primary_title,
14        original_title,
15        is_adult,
16        start_year::int as start_year,
17        end_year::int as end_year,
18        runtime_minutes::int as runtime_minutes,
19        genres
20    from source
24select * from renamed

Below shows the file tree of the staging models. The staging models can be executed using the dbt run command. As we’ve added tags to the staging layer models, we can limit to execute only this layer by dbt run --select staging.

 2└── imdb
 3    ├── _imdb__models.yml
 4    ├── _imdb__sources.yml
 5    ├── stg_imdb__name_basics.sql
 6    ├── stg_imdb__title_akas.sql
 7    ├── stg_imdb__title_basics.sql
 8    ├── stg_imdb__title_crews.sql
 9    ├── stg_imdb__title_episodes.sql
10    ├── stg_imdb__title_principals.sql
11    └── stg_imdb__title_ratings.sql


We can keep intermediate results in this layer so that the models of the final marts layer can be simplified. The source data includes columns where array values are kept as comma separated strings. For example, the genres column of the stg_imdb__title_basics model includes up to 3 genre values as shown below.

A total of seven columns in three models are columns of comma-separated strings and it is better to flatten them in the intermediate layer. Also, in order to avoid repetition, a dbt macro (f_latten_fields_) is created to share the column-flattening logic.

 1# redshift-sls/dbt_redshift_sls/macros/flatten_fields.sql
 2{% macro flatten_fields(model, field_name, id_field_name) %}
 3    with subset as (
 4        select
 5            {{ id_field_name }} as id,
 6            regexp_count({{ field_name }}, ',') + 1 AS num_fields,
 7            {{ field_name }} as fields
 8        from {{ model }}
 9    )
10    select
11        id,
12        1 as idx,
13        split_part(fields, ',', 1) as field
14    from subset
15    union all
16    select
18        idx + 1 as idx,
19        split_part(s.fields, ',', idx + 1)
20    from subset s
21    join cte on =
22    where idx < num_fields
23{% endmacro %}

The macro function can be added inside a recursive cte by specifying the relevant model, field name to flatten and ID field name.

 1-- dbt_redshift_sls/models/intermediate/title/int_genres_flattened_from_title_basics.sql
 2with recursive cte (id, idx, field) as (
 3    {{ flatten_fields(ref('stg_imdb__title_basics'), 'genres', 'title_id') }}
 7    id as title_id,
 8    field as genre
 9from cte
10order by id

The intermediate models are also materialised as views, and we can check the array columns are flattened as expected.

Below shows the file tree of the intermediate models. Similar to the staging models, the intermediate models can be executed by dbt run --select intermediate.

 2├── name
 3│   ├── _int_name__models.yml
 4│   ├── int_known_for_titles_flattened_from_name_basics.sql
 5│   └── int_primary_profession_flattened_from_name_basics.sql
 6└── title
 7    ├── _int_title__models.yml
 8    ├── int_directors_flattened_from_title_crews.sql
 9    ├── int_genres_flattened_from_title_basics.sql
10    └── int_writers_flattened_from_title_crews.sql
13└── flatten_fields.sql


The models in the marts layer are configured to be materialised as tables in a custom schema. Their materialisation is set to table and the custom schema is specified as analytics. Note that the custom schema name becomes imdb_analytics according to the naming convention of dbt custom schemas. Models of both the staging and intermediate layers are used to create final models to be used for reporting and analytics.

 1-- redshift-sls/dbt_redshift_sls/models/marts/analytics/titles.sql
 3    config(
 4        schema='analytics',
 5        materialized='table',
 6        sort='title_id',
 7        dist='title_id'
 8    )
11with titles as (
13    select * from {{ ref('stg_imdb__title_basics') }}
17principals as (
19    select
20        title_id,
21        count(name_id) as num_principals
22    from {{ ref('stg_imdb__title_principals') }}
23    group by title_id
27names as (
29    select
30        title_id,
31        count(name_id) as num_names
32    from {{ ref('int_known_for_titles_flattened_from_name_basics') }}
33    group by title_id
37ratings as (
39    select
40        title_id,
41        average_rating,
42        num_votes
43    from {{ ref('stg_imdb__title_ratings') }}
47episodes as (
49    select
50        parent_title_id,
51        count(title_id) as num_episodes
52    from {{ ref('stg_imdb__title_episodes') }}
53    group by parent_title_id
57distributions as (
59    select
60        title_id,
61        count(title) as num_distributions
62    from {{ ref('stg_imdb__title_akas') }}
63    group by title_id
67final as (
69    select
70        t.title_id,
71        t.title_type,
72        t.primary_title,
73        t.original_title,
74        t.is_adult,
75        t.start_year,
76        t.end_year,
77        t.runtime_minutes,
78        t.genres,
79        p.num_principals,
80        n.num_names,
81        r.average_rating,
82        r.num_votes,
83        e.num_episodes,
84        d.num_distributions
85    from titles as t
86    left join principals as p on t.title_id = p.title_id
87    left join names as n on t.title_id = n.title_id
88    left join ratings as r on t.title_id = r.title_id
89    left join episodes as e on t.title_id = e.parent_title_id
90    left join distributions as d on t.title_id = d.title_id
94select * from final

The details of the three models can be found in a YAML file (_analytics__models.yml). We can add tests to models and below we see tests of row count matching to their corresponding staging models.

 1# redshift-sls/dbt_redshift_sls/models/marts/analytics/_analytics__models.yml
 2version: 2
 5  - name: names
 6    description: Table that contains all names with additional details
 7    tests:
 8      - dbt_utils.equal_rowcount:
 9          compare_model: ref('stg_imdb__name_basics')
10  - name: titles
11    description: Table that contains all titles with additional details
12    tests:
13      - dbt_utils.equal_rowcount:
14          compare_model: ref('stg_imdb__title_basics')
15  - name: genre_titles
16    description: Table that contains basic title details after flattening genres

Below shows the file tree of the marts models. As with the other layers, the marts models can be executed by dbt run –select marts.

2└── analytics
3    ├── _analytics__models.yml
4    ├── genre_titles.sql
5    ├── names.sql
6    └── titles.sql

Using the Redshift query editor v2, we can quickly create charts with the final models. The example below shows a pie chart and we see about 50% of titles are from the top 5 genres.

Generate dbt Documentation

A nice feature of dbt is documentation. It provides information about the project and the data warehouse, and it facilitates consumers as well as other developers to discover and understand the datasets better. We can generate the project documents and start a document server as shown below.

1$ dbt docs generate
2$ dbt docs serve

A very useful element of dbt documentation is data lineage, which provides an overall view about how data is transformed and consumed. Below we can see that the final titles model consumes all title-related stating models and an intermediate model from the name basics staging model.


In this post, we discussed how to build data transformation pipelines using dbt on Redshift Serverless. Subsets of IMDb data are used as source and data models are developed in multiple layers according to the dbt best practices. dbt can be used as an effective tool for data transformation in a wide range of data projects from data warehousing to data lake to data lakehouse, and it supports key AWS analytics services - Redshift, Glue, EMR and Athena. More examples of using dbt will be discussed in subsequent posts.