Data Engineering

Self-Service Data Platform via a Multi-Tenant SQL Gateway

July 17, 202515 min read Big Data Data Architecture Data Engineering Data Platform Data Streaming Apache Flink Apache Kyuubi Apache Langer Apache Spark Data Governance Data Lakehouse Data Lineage Hive Metastore Marquez Multi-Tenancy OpenLineage Self-Service Analytics SQL Gateway Trino

Providing direct access to big data engines like Spark and Flink often creates chaos. A gateway-centric architecture solves this by introducing a robust control plane. This article presents a detailed blueprint using Apache Kyuubi, a multi-tenant SQL gateway, to provision and manage on-demand Spark, Flink, and Trino engines. Learn how this model delivers true self-service analytics with centralized governance, finally resolving the conflict between user empowerment and platform stability.

May 6, 20256 min read Big Data Data Architecture Data Engineering Data Streaming Apache Flink Apache Iceberg Apache Paimon Fluss

The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the “Streamhouse” - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.

Today, we’ll introduce three key open-source technologies shaping this space: Apache Paimon™, Fluss, and Apache Iceberg. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.

September 13, 202423 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

In the previous post, we started discussing a continuous integration/continuous delivery (CI/CD) process of a dbt project by introducing two GitHub Actions workflows - slim-ci and deploy. The former is triggered when a pull request is created to the main branch, and it builds only modified models and its first-order children in a ci dataset, followed by performing tests on them. The second workflow gets triggered once a pull request is merged. Beginning with running unit tests, it packages the dbt project as a Docker container and publishes to Artifact Registry. In this post, we focus on how to deploy a dbt project in multiple environments while walking through the entire CI/CD process step-by-step.

September 5, 202418 min read Data Engineering DBT Guide for Production BigQuery Continuous Delivery Continuous Integration Dbt GitHub Actions

Continuous integration (CI) is the process of ensuring new code integrates with the larger code base, and it puts a great emphasis on testing automation to check that the application is not broken whenever new commits are integrated into the main branch. Continuous delivery (CD) is an extension of continuous integration since it automatically deploys all code changes to a testing and/or production environment after the build stage. CI/CD helps development teams avoid bugs and code failures while maintaining a continuous cycle of software development and updates. In this post, we discuss how to set up a CI/CD pipeline for a data build tool (dbt) project using GitHub Actions where BigQuery is used as the target data warehouse.

May 9, 202412 min read Data Engineering Data Streaming Apache Beam Local Development With Python Apache Beam Python

We developed batch and streaming pipelines in Part 2 and Part 4. Often it is faster and simpler to identify and fix bugs on the pipeline code by performing local unit testing. Moreover, especially when it comes to creating a streaming pipeline, unit testing cases can facilitate development further by using TestStream as it allows us to advance watermarks or processing time according to different scenarios. In this post, we discuss how to perform unit testing of the batch and streaming pipelines that we developed earlier.

March 14, 20248 min read Data Engineering DBT Pizza Shop Demo Amazon Athena Apache Airflow AWS Dbt Docker Python

In Part 5, we developed a dbt project that that targets Apache Iceberg where transformations are performed on Amazon Athena. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. To improve query performance, the fact table is denormalized to pre-join records from the dimension tables using the array and struct data types. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

March 7, 202416 min read Data Engineering DBT Pizza Shop Demo Amazon Athena AWS Dbt Docker Python

In Part 1 and Part 3, we developed data build tool (dbt) projects that target PostgreSQL and BigQuery using fictional pizza shop data. The data is modelled by SCD type 2 dimension tables and one transactional fact table. While the order records should be joined with dimension tables to get complete details for PostgreSQL, the fact table is denormalized using nested and repeated fields to improve query performance for BigQuery. Open Table Formats such as Apache Iceberg bring a new opportunity that implements data warehousing features in a data lake (i.e. data lakehouse) and Amazon Athena is probably the easiest way to perform such tasks on AWS. In this post, we create a new dbt project that targets Apache Iceberg where transformations are performed on Amazon Athena. Data modelling is similar to the BigQuery project where the dimension tables are modelled by the SCD type 2 approach and the fact table is denormalized using the array and struct data types.

February 22, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow BigQuery Dbt Docker GCP Python

In Part 3, we developed a dbt project that targets Google BigQuery with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. The fact table is denormalized using nested and repeated fields for improving query performance. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

February 8, 202416 min read Data Engineering DBT Pizza Shop Demo BigQuery Dbt Docker GCP Python

In this series, we discuss practical examples of data warehouse and lakehouse development where data transformation is performed by the data build tool (dbt) and ETL is managed by Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL using fictional pizza shop data. At the end, the data sets are modelled by two SCD type 2 dimension tables and one transactional fact table. In this post, we create a new dbt project that targets Google BigQuery. While the dimension tables are kept by the same SCD type 2 approach, the fact table is denormalized using nested and repeated fields, which potentially can improve query performance by pre-joining corresponding dimension records.

January 25, 20249 min read Data Engineering DBT Pizza Shop Demo Apache Airflow Dbt Docker PostgreSQL Python

In this series of posts, we discuss data warehouse/lakehouse examples using data build tool (dbt) including ETL orchestration with Apache Airflow. In Part 1, we developed a dbt project on PostgreSQL with fictional pizza shop data. Two dimension tables that keep product and user records are created as Type 2 slowly changing dimension (SCD Type 2) tables, and one transactional fact table is built to keep pizza orders. In this post, we discuss how to set up an ETL process on the project using Apache Airflow.

Self-Service Data Platform via a Multi-Tenant SQL Gateway

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

Guide to Running DBT in Production

DBT CI/CD Demo With BigQuery and GitHub Actions

Apache Beam Local Development With Python - Part 5 Testing Pipelines

Data Build Tool (Dbt) Pizza Shop Demo - Part 6 ETL on Amazon Athena via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 5 Modelling on Amazon Athena

Data Build Tool (Dbt) Pizza Shop Demo - Part 4 ETL on BigQuery via Airflow

Data Build Tool (Dbt) Pizza Shop Demo - Part 3 Modelling on BigQuery

Data Build Tool (Dbt) Pizza Shop Demo - Part 2 ETL on PostgreSQL via Airflow