Data Lake

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

May 6, 20256 min read Data Streaming Apache Flink Apache Iceberg Apache Paimon Data Architecture Data Lake Data Management Fluss Real Time Data Streamhouse

The world of data is converging. The traditional divide between batch processing for historical analytics and stream processing for real-time insights is becoming increasingly blurry. Businesses demand architectures that handle both seamlessly. Enter the “Streamhouse” - an evolution of the Lakehouse concept, designed with streaming as a first-class citizen.

Today, we’ll introduce three key open-source technologies shaping this space: Apache Paimon™, Fluss, and Apache Iceberg. While each has unique strengths, their true power lies in how they can be integrated to build robust, flexible, and performant data platforms.

December 19, 202111 min read Apache Kafka Apache Spark Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build a Hudi DeltaStramer app on Amazon EMR and use the resulting Hudi table with Athena and Quicksight to build a dashboard.

December 12, 202117 min read Apache Kafka Apache Spark Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Kafka Connect Terraform

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. In this post, we'll build CDC with Amazon MSK and MSK Connect.

December 5, 202118 min read Apache Kafka Apache Spark Data Engineering Data Lake Demo Using Change Data Capture Amazon EMR Amazon MSK Amazon MSK Connect Apache Hudi Apache Kafka Apache Spark AWS Change Data Capture Data Lake Docker Docker Compose Kafka Connect

Change data capture (CDC) on Amazon MSK and ingesting data using Apache Hudi on Amazon EMR can be used to build an efficient data lake solution. As a starting point, we’ll discuss the source database and CDC streaming infrastructure in the local environment.

Meet the Streamhouse Trio - Paimon, Fluss, and Iceberg for Unified Data Architectures

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 3 Implement Data Lake

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 2 Implement CDC

Data Lake Demo Using Change Data Capture (CDC) on AWS – Part 1 Local Development