Data modelling in big data world

RAJAT BHATHEJA
2 min readAug 18, 2023

Differences in Data Modeling: Big Data vs. Traditional RDBMS

Before going deeper into modeling techniques, let us understand how modern datalakes differ from traditional data warehouses.

1. Foundational Principles:

  • Traditional RDBMS: Governed by strict ACID properties (Atomicity, Consistency, Isolation, Durability). Primarily manage structured data. Emphasis on Transactional consistency and integrity.
  • Big Data: Embraces BASE properties (Basically Available, Soft state, Eventually consistent). Handle vast volumes of diverse data, including structured, semi-structured, and unstructured. Emphasis on Scalability, distributed processing, and analytical processing.

2. Modeling Techniques:

  • Traditional RDBMS: Focuses on normalized schemas. (Denormalized in case of data warehouse but not exactly columnar) — Reduce redundancy and maintain relational integrity.
  • Big Data: Leans towards denormalized structures and columnar storage. — Optimize analytical processing, flexibility, and manage diverse data types.

3. System Characteristics:

  • Traditional RDBMS: Limited scalability, primarily transactional databases.
  • Big Data: Distributed data frameworks designed for horizontal scaling and high-speed data processing.

Now, Contrasting Data Modeling Techniques: Traditional Data Warehouse vs. Modern Data Lake/Lakehouse

1. Dimensional Modeling (Star & Snowflake Schema):

  • Traditional DW: Commonly used due to its optimization for query performance. Star schema provides denormalized views, facilitating faster queries. Snowflake schema is used less frequently due to the normalized nature which can slow down queries but ensures data integrity.
  • Data Lake/Lakehouse: While still applicable, these schemas might be part of a broader ecosystem, often coexisting with raw, untransformed data. Used in the structured zone or curated layers of the data lake/lakehouse.

2. Data Mesh:

  • Traditional DW: Not a common approach; centralized models were the norm with consolidated data marts for specific business units.
  • Data Lake/Lakehouse: More relevant as data ownership becomes decentralized. Domains or teams own their data products, promoting flexibility and scalability.

3. Data Vault:

  • Traditional DW: Employed to capture and store data from different sources in a consistent and auditable manner. Provides flexibility in handling changing data sources.
  • Data Lake/Lakehouse: Can be used, especially when integrating with existing data warehousing solutions or when historical tracking across multiple sources is crucial. Serves as a structured or curated layer within the data lake/lakehouse.

4. Normalized Modeling:

  • Traditional DW: Used primarily in OLTP systems and sometimes in staging areas of the data warehouse before transforming to dimensional models. Ensures minimized data redundancy and integrity.
  • Data Lake/Lakehouse: While raw data might be stored in its native format, normalized models might exist in curated or processed zones. Useful for ensuring data consistency before analytics or reporting.

In essence, while traditional data warehousing models were more rigid and structured, data lakes and lakehouses offer flexibility, accommodating various modeling techniques based on specific analytical needs and the nature of data.

--

--

RAJAT BHATHEJA
RAJAT BHATHEJA

Written by RAJAT BHATHEJA

A seasoned Data Architect & lead with over 13 years of experience, I specialize in transforming complex data landscapes into scalable, efficient platforms.

No responses yet