Semantic Layer Architectures (2020–2025): Snowflake/Databricks Warehouse-Native, dbt MetricFlow, and Cube.dev OLAP—A Decision Guide for BI & LLMs

Introduction: The Semantic Layer Problem Space

Every data team eventually hits the same wall: metrics defined differently across tools, dashboards showing conflicting numbers, and business users losing trust in data. The CFO asks "What was Q3 revenue?" and gets three different answers from three different dashboards.

This isn't a new problem. For decades, the industry tried to solve it with OLAP cubes, BI tool semantic layers, and metric catalogs. But the modern cloud data warehouse era demanded a fresh approach. Between 2020 and 2025, as organizations migrated to Snowflake, Databricks, and BigQuery, a new generation of semantic layer solutions emerged—each taking radically different architectural approaches to the same fundamental challenge.

The Core Challenge: Where Should Semantic Logic Live?

The semantic layer problem breaks down into several interconnected questions:

The Modeling Question: How do you define metrics once in a way that's reusable across all tools?

The Execution Question: When a user queries "revenue by region," who computes it—the BI tool, a middleware layer, or the database itself?

The Performance Question: Do you pre-aggregate everything (fast but inflexible), query raw data every time (flexible but slow), or find a middle ground?

The Governance Question: How do you enforce who can see what metrics, and ensure definitions don't drift as teams change them?

Between 2020-2025, the industry converged on three distinct architectural patterns, each making different trade-offs on these questions:

Warehouse-Native: Store semantic metadata as database objects (Snowflake, Databricks)
Transformation-Layer: Semantic models as code in the data transformation layer (dbt MetricFlow)
OLAP-Acceleration: Intelligent caching layer with pre-aggregations (Cube.dev)

This article dissects how each approach works internally, the engineering decisions that shaped them, and the trade-offs teams face when choosing between them.

Approach 1: Warehouse-Native Semantic Layers

The Design Philosophy: Metadata Living With Data

In 2024-2025, both Snowflake and Databricks made a controversial bet: the semantic layer should live inside the data warehouse, not as external tooling. This wasn't obvious. For years, the industry assumed semantic layers were middleware—Looker, Tableau, and SSAS all sat between users and databases, translating business logic into SQL.

But Snowflake and Databricks asked a provocative question: What if semantic definitions were database objects, just like tables and views?

The reasoning:

Governance is simpler when there's one security model (the warehouse's RBAC)
Performance is better when there's no network hop to a separate service
Operations are simpler with zero additional infrastructure to manage
Lineage is native since the warehouse already tracks object dependencies

This led to fundamentally different architectures than middleware-based semantic layers.

Snowflake Semantic Views: Architecture Deep Dive

Launch: Public Preview Q1 2025, GA August 2025

When Snowflake announced Semantic Views at Summit 2025, they introduced a new database object type: SEMANTIC VIEW. This sits alongside tables, standard views, and materialized views in Snowflake's metadata catalog.

How It Works: The Logical-to-Physical Mapping

A Snowflake Semantic View defines a logical data model that abstracts physical tables:

sql
CREATE SEMANTIC VIEW sales_analytics AS
  LOGICAL TABLES
    orders (
      FACTS order_amount, tax_amount
      DIMENSIONS customer_id, product_id, order_date
      PRIMARY KEY order_id
    ),
    customers (
      DIMENSIONS customer_id, region, segment
      PRIMARY KEY customer_id
    ),
    products (
      DIMENSIONS product_id, category, brand
      PRIMARY KEY product_id
    )
  RELATIONSHIPS
    orders.customer_id = customers.customer_id,
    orders.product_id = products.product_id
  METRICS
    total_revenue AS SUM(order_amount),
    avg_order_value AS SUM(order_amount) / COUNT(DISTINCT order_id)

What's happening under the hood:

Metadata storage: The semantic view definition is stored in Snowflake's Information Schema, just like table schemas. SHOW SEMANTIC METRICS and INFORMATION_SCHEMA.SEMANTIC_METRICS expose the definitions.

Query rewriting: When you query a semantic view:

sql
SELECT * FROM SEMANTIC_VIEW(
  sales_analytics
  DIMENSIONS customers.region, products.category
  METRICS total_revenue
)

Snowflake's query planner rewrites this into standard SQL:

sql
SELECT
  c.region,
  p.category,
  SUM(o.order_amount) AS total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
GROUP BY c.region, p.category

Optimization: The rewritten query goes through Snowflake's standard optimizer:
- Micro-partition pruning (skip files that don't contain relevant data)
- Join order optimization
- Predicate pushdown
- Result caching (identical queries return instantly if data unchanged)

The key architectural insight: Semantic Views add zero execution overhead. The generated SQL is identical to what a skilled analyst would write by hand.

Aggregations and Drill-Down: Flexible Grouping at Query Time

Snowflake's approach to aggregations is pure query-time flexibility. The semantic view defines measures (what to aggregate) and dimensions (what to group by), but doesn't lock them together.

Example: The total_revenue metric is defined as SUM(order_amount), but you can query it by:

Region: DIMENSIONS customers.region
Product category: DIMENSIONS products.category
Month: DIMENSIONS DATE_TRUNC('month', order_date)
All three: DIMENSIONS customers.region, products.category, DATE_TRUNC('month', order_date)

The engine generates the appropriate GROUP BY clause each time. This is drill-down by design—you're not pre-computing aggregates at every level, you're dynamically aggregating to the requested grain.

Why this works efficiently: Snowflake's columnar storage and query optimizer handle on-demand aggregation well. For a 500M row fact table:

Simple metric by single dimension: 2-3 seconds (Medium warehouse)
Complex metric across 3-hop join: 8-12 seconds (Large warehouse)
Warm cache: <1 second

The trade-off: You can't query specific dimensions faster by pre-aggregating them (unless you create materialized views separately). But the flexibility often outweighs this for exploratory analytics.

Complex Formulas and Calculated Measures

Snowflake handles complex calculations by storing formula definitions in metadata and computing them at query time with correct aggregation semantics.

Ratio metrics are a classic pitfall. Computing revenue / orders naively:

sql
-- WRONG: This computes ratio per row, then sums ratios
SELECT SUM(revenue / orders) ...  -- Mathematically incorrect!

Snowflake's semantic layer prevents this. When you define:

sql
METRICS
  avg_order_value AS SUM(order_amount) / COUNT(DISTINCT order_id)

The engine computes the numerator and denominator separately, then divides:

sql
-- CORRECT: Generated by Snowflake
SELECT
  SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value
FROM ...
GROUP BY ...

This extends to arbitrary complexity:

Churn rate example:

sql
METRICS
  churn_rate AS
    COUNT(DISTINCT CASE WHEN last_order_date < CURRENT_DATE - 90 THEN customer_id END)
    / COUNT(DISTINCT customer_id)

Derived metrics (metrics built from other metrics):

sql
METRICS
  gross_revenue AS SUM(order_amount),
  discounts AS SUM(discount_amount),
  net_revenue AS gross_revenue - discounts

The semantic view expands net_revenue into:

sql
SUM(order_amount) - SUM(discount_amount) AS net_revenue

This composability is critical for maintainability—change gross_revenue once, and net_revenue updates automatically.

Multidimensional Modeling: Star Schemas Made Explicit

Snowflake's logical tables and relationships formalize star schema modeling in metadata.

Traditional approach (before semantic views):

Data engineers create fact and dimension tables
BI developers remember (or guess) how to join them
Each dashboard rebuilds the same join logic
Mistakes happen (Cartesian products, fan-out issues)

Semantic Views approach:

Define relationships once: orders.customer_id = customers.customer_id
The warehouse enforces cardinality (one-to-many, many-to-one)
Every query through the semantic view uses correct joins

Multi-hop joins are supported transparently. Example:

Orders → Stores → Regions

Query "total_revenue by region_name":

Semantic view knows: Orders links to Stores via store_id, Stores links to Regions via region_id
Generates: FROM orders o JOIN stores s ON o.store_id = s.store_id JOIN regions r ON s.region_id = r.region_id
User never writes JOIN syntax

This prevents fan-out bugs. If you naively did FROM orders JOIN regions without the intermediate store join, you'd get incorrect results (multiple rows per order if regions aren't properly scoped). The semantic view's relationship definitions prevent this.

Security and Governance: Native Integration

The killer feature of warehouse-native semantic layers: you get the warehouse's security model for free.

Access control:

sql
GRANT SELECT ON SEMANTIC VIEW sales_analytics TO ROLE analyst_role;

Users with analyst_role can query metrics, but don't need access to underlying tables:

Raw table: raw.orders (contains PII, restricted)
Semantic view: sales_analytics (exposes only aggregated metrics)
Analysts: SELECT on semantic view, no access to raw.orders

This is governed self-service—explore metrics without seeing sensitive data.

Row-level security flows through automatically. If raw.orders has a row access policy:

sql
CREATE ROW ACCESS POLICY orders_region_filter
AS (region STRING) RETURNS BOOLEAN ->
  region = CURRENT_USER_REGION()

Then queries via the semantic view inherit that filter. A user in the EMEA region only sees EMEA orders in any metric, with zero additional configuration.

Audit logging: All semantic view queries appear in ACCOUNT_USAGE.QUERY_HISTORY:

sql
SELECT user_name, query_text, execution_time
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_text ILIKE '%SEMANTIC_VIEW(sales_analytics%'

Track who queries which metrics, how often, and which dimensions they analyze.

Performance Optimization: Leveraging Warehouse Features

Snowflake Semantic Views don't have separate caching—they leverage Snowflake's existing performance infrastructure:

Result cache: Query total_revenue by region twice → second query is instant (if data unchanged).

Warehouse cache: Frequently accessed micro-partitions stay in local SSD cache.

Clustering: If orders is clustered by order_date, queries filtered by date prune partitions efficiently.

For extreme performance needs, combine semantic views with materialized views:

sql
-- Pre-aggregate expensive metric at common grain
CREATE MATERIALIZED VIEW daily_revenue_by_region AS
SELECT
  DATE_TRUNC('day', order_date) AS day,
  customer_region,
  SUM(order_amount) AS revenue
FROM raw.orders
JOIN raw.customers USING (customer_id)
GROUP BY 1, 2;

-- Point semantic view's logical table at the materialized view
CREATE SEMANTIC VIEW sales_analytics AS
  LOGICAL TABLES
    daily_summary (
      BASE_TABLE daily_revenue_by_region
      METRICS revenue
      DIMENSIONS day, customer_region
    )
  ...

Now queries hit the pre-aggregated table. But this sacrifices flexibility (you're locked into day + region grain).

Snowflake's design philosophy: Query raw data by default (leverage warehouse power), pre-aggregate selectively when needed.

Trade-Offs of the Warehouse-Native Approach

✅ Advantages:

Zero infrastructure overhead: No separate service to deploy or maintain
Native performance: Fully optimized by Snowflake's engine
Unified governance: One security model for data and metrics
Operational simplicity: Metrics managed like any other database object

❌ Limitations:

Snowflake-only: Can't use metrics on data in BigQuery, Databricks, etc.
Limited BI tool support (initially): Most tools can't natively "discover" semantic views yet (requires custom SQL or waiting for OSI integrations)
No native version control: YAML export/import exists, but no git integration like dbt
Proprietary syntax: SEMANTIC_VIEW() is Snowflake-specific SQL

Slowly Changing Dimensions (SCD): Not automatically handled. If you have Type-2 dimensions (multiple records per entity over time), you must model them correctly in your joins:

sql
-- Manual SCD handling in semantic view join
RELATIONSHIPS
  orders.customer_id = customers.customer_id
  AND orders.order_date BETWEEN customers.effective_start_date
                            AND customers.effective_end_date

Snowflake doesn't generate this logic—you define it.

Sparse data management: Non-issue for Snowflake Semantic Views because they query on-demand. Only data that exists is processed. For time-series gaps (months with zero sales), use a date dimension table:

sql
-- Time spine to fill gaps
LOGICAL TABLES
  date_dim (
    DIMENSIONS date, year, month, quarter
  )
RELATIONSHIPS
  orders.order_date = date_dim.date (LEFT JOIN)

The LEFT JOIN ensures all dates appear, with NULL/zero for missing facts.

Databricks Unity Catalog Metric Views: Architecture Deep Dive

Launch: Public Preview June 2025, GA late 2025

Databricks entered the semantic layer space later than Snowflake, but with a unique advantage: Unity Catalog, their governance layer, was already battle-tested. Metric Views extend Unity Catalog to cover metrics, not just tables.

The architectural philosophy: Unify data science, ML, and BI under one semantic layer. Unlike BI-focused semantic layers, Databricks designed Metric Views to work equally well in:

SQL dashboards
Python/R notebooks
ML feature engineering
LLM-powered AI agents

How It Works: The MEASURE() Function and Spark Query Rewriting

Databricks Metric Views are defined in YAML or SQL DDL:

YAML definition:

yaml
metric_view:
  name: sales_metrics
  source_table: main.sales.orders
  joins:
    - table: main.sales.customers
      on: orders.customer_id = customers.customer_id
    - table: main.sales.products
      on: orders.product_id = products.product_id

  measures:
    - name: total_revenue
      expr: SUM(order_amount)
    - name: avg_order_value
      expr: SUM(order_amount) / COUNT(DISTINCT order_id)

  dimensions:
    - name: customer_region
      expr: customers.region
    - name: product_category
      expr: products.category

Register in Unity Catalog:

sql
CREATE METRIC VIEW main.sales.sales_metrics
FROM 'dbfs:/metrics/sales_metrics.yaml';

Querying uses a special MEASURE() function:

sql
SELECT
  customer_region,
  MEASURE(total_revenue),
  MEASURE(avg_order_value)
FROM main.sales.sales_metrics
WHERE product_category = 'Electronics'
GROUP BY customer_region;

Query compilation (Spark SQL + Catalyst optimizer):

Parse the MEASURE() syntax
Expand measures into their definitions:
- MEASURE(total_revenue) → SUM(order_amount)
- MEASURE(avg_order_value) → SUM(order_amount) / COUNT(DISTINCT order_id)
Inject joins from metric view spec:
- orders → customers on customer_id
- orders → products on product_id
Optimize via Catalyst:
- Predicate pushdown: WHERE product_category = 'Electronics' → pushed to products table scan
- Join reordering: Small dimension tables broadcast to executors
- Partition pruning: Skip irrelevant Delta Lake files
Execute with Photon (vectorized engine for analytical queries)

Generated physical plan (conceptual):

sql
SELECT
  c.region AS customer_region,
  SUM(o.order_amount) AS total_revenue,
  SUM(o.order_amount) / COUNT(DISTINCT o.order_id) AS avg_order_value
FROM main.sales.orders o
JOIN main.sales.customers c ON o.customer_id = c.customer_id
JOIN main.sales.products p ON o.product_id = p.product_id
WHERE p.category = 'Electronics'
GROUP BY c.region;

Like Snowflake, no execution overhead—the metric view generates efficient native SQL.

Aggregations and Drill-Down: Dynamic Grouping with Star Schema Support

Databricks' approach mirrors Snowflake's: define measures and dimensions separately, combine at query time.

Key innovation: Multi-level joins for dimensional hierarchies.

Example star schema:

Orders (fact) → Stores → Regions → Countries

Metric view YAML:

yaml
joins:
  - table: dim_stores
    on: orders.store_id = dim_stores.store_id
  - table: dim_regions
    on: dim_stores.region_id = dim_regions.region_id
  - table: dim_countries
    on: dim_regions.country_id = dim_countries.country_id

Query "total_revenue by country":

sql
SELECT
  country_name,
  MEASURE(total_revenue)
FROM sales_metrics
GROUP BY country_name;

Databricks traces the join path:

orders → stores → regions → countries

Generates three-hop join automatically. This prevents fan-out (the common mistake of joining facts directly to distant dimensions, causing duplicate rows).

Drill-down is implicit. Start with country-level:

sql
SELECT country_name, MEASURE(total_revenue)
FROM sales_metrics
GROUP BY country_name;

Drill into regions within a country:

sql
SELECT country_name, region_name, MEASURE(total_revenue)
FROM sales_metrics
WHERE country_name = 'United States'
GROUP BY country_name, region_name;

Further drill into stores:

sql
SELECT region_name, store_name, MEASURE(total_revenue)
FROM sales_metrics
WHERE country_name = 'United States'
GROUP BY region_name, store_name;

Each query dynamically aggregates to the requested grain. No pre-defined drill paths—the semantic graph supports any traversal.

Complex Formulas: Handling Ratios, Distinct Counts, and Derived Measures

Databricks explicitly designed Metric Views to handle "complex measures like ratios and distinct counts that cannot be safely re-aggregated".

The problem with naive views:

sql
-- Traditional view (WRONG for re-aggregation)
CREATE VIEW revenue_per_customer AS
SELECT
  region,
  SUM(revenue) / COUNT(DISTINCT customer_id) AS rpc
FROM orders
GROUP BY region;

-- Now try to roll up to country level:
SELECT country, SUM(rpc) FROM revenue_per_customer ...
-- WRONG! You're summing ratios, not recomputing ratio of sums

Metric Views solution: Separate measure definitions from aggregation grain.

yaml
measures:
  - name: total_revenue
    expr: SUM(revenue)
  - name: unique_customers
    expr: COUNT(DISTINCT customer_id)
  - name: revenue_per_customer
    expr: total_revenue / unique_customers  # Derived measure

Query at region level:

sql
SELECT region, MEASURE(revenue_per_customer)
FROM sales_metrics
GROUP BY region;
-- Computes: SUM(revenue) / COUNT(DISTINCT customer_id) per region

Roll up to country level:

sql
SELECT country, MEASURE(revenue_per_customer)
FROM sales_metrics
GROUP BY country;
-- Computes: SUM(revenue) / COUNT(DISTINCT customer_id) per country
-- NOT: SUM(region_ratios)

The engine recomputes base measures at the requested grain, then applies the formula.

Distinct counts across joins are handled correctly:

yaml
measures:
  - name: unique_products_sold
    expr: COUNT(DISTINCT product_id)

When querying by customer region (requiring orders → customers join):

sql
SELECT customer_region, MEASURE(unique_products_sold)
FROM sales_metrics
GROUP BY customer_region;

Databricks ensures the distinct count happens after the join, not before:

sql
-- Generated SQL
SELECT
  c.region,
  COUNT(DISTINCT o.product_id)  -- Distinct after join
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.region;

If distinct count happened before the join, you'd count products globally, not per region.

Complex time-based calculations are supported via Spark SQL expressions:

yaml
measures:
  - name: rolling_30day_revenue
    expr: SUM(revenue) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)

Though window functions in measures are less common (typically done in dimensions or post-query).

Security and Governance: Unity Catalog Integration

Unity Catalog is Databricks' governance foundation. Metric Views inherit all governance features:

Access control:

sql
GRANT SELECT ON METRIC VIEW main.sales.sales_metrics TO `marketing_team`;

Fine-grained: Users need permission on the metric view, not on underlying tables.

Row-level security (via Unity Catalog policies):

sql
-- Apply row filter to base table
CREATE ROW FILTER region_filter
ON main.sales.orders
RETURNS region = current_user_region();

Queries via metric views automatically apply the filter. A user in APAC sees only APAC orders in all metrics.

Column masking:

sql
ALTER TABLE main.sales.customers
SET COLUMN customer_email MASK hash(customer_email);

If a metric or dimension references customer_email, non-privileged users see hashed values.

Certification: Mark metrics as production-ready:

sql
ALTER METRIC VIEW main.sales.sales_metrics
SET TBLPROPERTIES ('certified' = 'true', 'owner' = 'finance_team');

In Databricks UI, certified metrics display a trust badge. Analysts know these are approved for reporting.

Tagging (for discovery and compliance):

sql
ALTER METRIC VIEW main.sales.sales_metrics
SET TAGS ('domain' = 'finance', 'pii' = 'false', 'refresh_frequency' = 'daily');

Query metrics by tag:

sql
-- Find all finance domain metrics
SELECT metric_view_name, tags
FROM system.information_schema.metric_views
WHERE tags.domain = 'finance';

Lineage: Unity Catalog UI shows end-to-end lineage:

S3 raw data → Bronze tables → Silver tables → Gold tables → Metric View → Tableau Dashboard

Click on sales_metrics metric view:

Upstream dependencies: main.sales.orders, main.sales.customers, main.sales.products
Downstream consumers: Finance Dashboard, Marketing Notebook, Churn Model

Audit logging:

sql
SELECT user_name, request_params.metric_view, query_text
FROM system.access.audit
WHERE request_params.metric_view = 'main.sales.sales_metrics'
ORDER BY event_time DESC;

Track every metric query: who, when, which measures/dimensions, performance.

Performance Optimization: Spark SQL + Photon + Delta Lake

Databricks Metric Views execute on Databricks SQL warehouses, combining:

Spark SQL: Distributed query engine with Catalyst optimizer
Photon: Vectorized execution (2-3× faster than standard Spark)
Delta Lake: Columnar storage with ACID, time travel, and Z-ordering

Performance strategies:

1. Data skipping: Delta Lake tracks min/max/null count stats per file (Parquet). Queries with filters skip irrelevant files:

sql
WHERE order_date >= '2025-01-01'
-- Delta skips all files with max(order_date) < 2025-01-01

2. Z-ordering: Optimize file layout for common filter columns:

sql
OPTIMIZE main.sales.orders ZORDER BY (order_date, customer_region);

Queries filtered by date or region read fewer files.

3. Broadcast joins: Small dimension tables (<10MB) broadcast to all executors:

sql
-- Spark detects customers table is small
-- Broadcasts it to all nodes instead of shuffling
orders JOIN customers ...

4. Adaptive Query Execution (AQE): Spark adjusts plan mid-execution based on runtime stats:

Convert sort-merge join to broadcast join if table is smaller than expected
Coalesce small shuffle partitions to reduce overhead
Optimize skew joins (if one partition is 100× larger, split it further)

5. Delta cache: Local SSD cache on warehouse nodes. Frequently accessed data (dimension tables) stays cached, reducing cloud storage reads.

6. Query result cache: Identical queries return instantly if data unchanged (like Snowflake).

Real-world performance (Databricks internal benchmarks):

10M row fact table, 100K dimension rows: 1-2 seconds (Small warehouse)
500M row fact table, 5M dimension rows: 5-8 seconds (Medium warehouse)
10B row fact table, 100M dimension rows: 30-60 seconds (Large warehouse with Photon)

Pre-aggregation option: For very expensive metrics queried frequently, create materialized views:

sql
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT
  DATE_TRUNC('month', order_date) AS month,
  customer_region,
  MEASURE(total_revenue) AS revenue
FROM main.sales.sales_metrics
GROUP BY month, customer_region;

But this sacrifices flexibility (locked into month + region grain).

Databricks' design philosophy: Query raw data by default (leverage Spark + Delta's power), pre-aggregate only for proven hotspots.

Trade-Offs of the Warehouse-Native Lakehouse Approach

✅ Advantages:

Unified data science + BI + ML: Same metrics in SQL dashboards, notebooks, and feature stores
Unity Catalog governance: Enterprise-grade RBAC, lineage, certification
High performance: Photon + Delta Lake handle billion-row datasets efficiently
Domain-driven organization: Group metrics by business domain (Finance, Marketing, Sales)
No additional infrastructure: Metrics are just catalog objects

❌ Limitations:

Databricks-only: Metrics don't work outside Databricks (can't query from Snowflake)
MEASURE() syntax: Required function (can't treat metric views like normal tables)
Join limitations: Can't freely join metric views to other tables in one query (must use CTEs or include tables in metric view definition)
Unity Catalog requirement: Must migrate from legacy Hive Metastore
Time grain handling: No automatic time dimension shortcuts (must manually define order_month, order_quarter dimensions)
New feature, limited docs: Best practices still emerging (as of late 2025)

Slowly Changing Dimensions: Like Snowflake, not automatically handled. Model Type-2 joins manually:

yaml
joins:
  - table: dim_customers_history
    on: |
      orders.customer_id = dim_customers_history.customer_id
      AND orders.order_date BETWEEN dim_customers_history.effective_start_date
                                AND dim_customers_history.effective_end_date

Sparse data management: Query-time filtering handles sparsity naturally (only existing data is processed). For time-series gaps, use a date dimension:

yaml
dimensions:
  - name: calendar_date
    expr: date_dim.date
joins:
  - table: dim_dates
    on: orders.order_date = dim_dates.date
    type: left  # Ensures all dates appear, even with zero orders

Warehouse-Native Summary: What These Architectures Share

Both Snowflake and Databricks converged on similar principles:

Metadata as database objects: Semantic definitions live in the catalog (Information Schema / Unity Catalog)
Query-time rewriting: Translate semantic queries into optimized native SQL/Spark SQL
No separate cache: Rely on warehouse caching (result cache, local SSD cache)
Governance through the warehouse: Leverage existing RBAC, row filters, audit logs
Performance through the engine: Trust the warehouse optimizer + columnar storage

Why this approach won mindshare (2024-2025):

Simplicity: No separate service to deploy, secure, or scale
Trust: Metrics computed directly on source data (no stale cache issues)
Governance: One security model to learn and audit
Cost: No additional license (semantic layer is included with warehouse)

When this approach struggles:

Multi-cloud: If data lives in multiple warehouses (Snowflake + BigQuery), warehouse-native doesn't span them
BI tool limitations: Many tools can't yet natively "discover" semantic views (requires custom SQL or waiting for integrations)
Advanced caching needs: If you need aggressive pre-aggregation for extreme performance, warehouse-native alone may not suffice (though you can layer materialized views on top)

Approach 2: Transformation-Layer Semantic Models (dbt MetricFlow)

The Design Philosophy: Metrics as Code in the Transformation Layer

While Snowflake and Databricks embedded semantics into the warehouse, dbt Labs took a radically different approach: define metrics as version-controlled code in the transformation layer.

The reasoning: Data teams already use dbt to transform raw data into analytics-ready tables. Why introduce a separate tool for metrics? Instead, extend dbt's existing workflow to include metric definitions alongside table transformations.

Launch: dbt MetricFlow public preview June 2024, GA October 2024

Key architectural decision: Metrics live in YAML files in your dbt project, version-controlled in Git alongside your transformation SQL.

How It Works: YAML Definitions + SQL Generation

A dbt semantic model defines entities, dimensions, and measures:

yaml
# models/metrics/sales_semantic_model.yml
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign

    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: customer_region
        expr: dim_customers.region

    measures:
      - name: order_count
        agg: count
      - name: total_revenue
        agg: sum
        expr: order_amount
      - name: avg_order_value
        agg: ratio
        numerator: total_revenue
        denominator: order_count

Query via dbt Semantic Layer API:

python
# Python SDK
from dbt_sl import DbtSemanticLayerClient

client = DbtSemanticLayerClient(environment="prod")
result = client.query(
    metrics=["total_revenue", "avg_order_value"],
    group_by=["customer_region", "order_date__month"]
)

What happens under the hood:

Request parsing: API receives metrics + dimensions
SQL generation: MetricFlow generates optimized SQL:

sql
-- Generated by MetricFlow
SELECT
  c.region AS customer_region,
  DATE_TRUNC('month', o.order_date) AS order_date__month,
  SUM(o.order_amount) AS total_revenue,
  SUM(o.order_amount) / COUNT(*) AS avg_order_value
FROM analytics.fct_orders o
JOIN analytics.dim_customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2

Execution: SQL runs on your data warehouse (Snowflake, BigQuery, Databricks, Redshift)
Result return: Data flows back through the API to the requesting tool

Critical difference from warehouse-native: The semantic layer is middleware. It sits between BI tools and the warehouse, translating metric requests into SQL.

The Git-Based Governance Model

dbt MetricFlow's killer feature: metrics are code, which means:

Version control:

bash
git log models/metrics/sales_semantic_model.yml
# See complete history of metric definition changes

Code review:

yaml
# Pull request: Change revenue metric definition
- measures:
-   - name: total_revenue
-     agg: sum
-     expr: order_amount
+ measures:
+   - name: total_revenue
+     agg: sum
+     expr: order_amount - discount_amount  # Now net of discounts

Analytics engineer reviews: "This changes Q4 revenue by $2M, approved after confirming with finance team."

CI/CD integration:

yaml
# .github/workflows/dbt_ci.yml
- name: Test metric changes
  run: |
    dbt parse  # Validates YAML syntax
    dbt test   # Runs data quality tests
    dbt-sl compare-metrics --base main --head feature-branch
    # Shows how metric values change with new definitions

Automatic documentation:

bash
dbt docs generate
# Creates searchable catalog of all metrics with:
# - Definition (SQL)
# - Lineage (which models it depends on)
# - Ownership (who maintains it)
# - Change log (git history)

Why this matters: Metrics aren't hidden in a database. They're transparent, auditable code that follows software engineering best practices.

Integration with dbt's Transformation Pipeline

The transformation layer approach creates a unified workflow:

Step 1: Transform raw data (traditional dbt):

sql
-- models/staging/stg_orders.sql
SELECT
  order_id,
  customer_id,
  order_date,
  order_amount
FROM raw.orders
WHERE order_status != 'cancelled'

Step 2: Build fact table (traditional dbt):

sql
-- models/marts/fct_orders.sql
SELECT
  o.order_id,
  o.customer_id,
  o.order_date,
  o.order_amount,
  c.region AS customer_region
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c USING (customer_id)

Step 3: Define metrics on fact table (MetricFlow):

yaml
semantic_models:
  - name: orders
    model: ref('fct_orders')  # Points to Step 2
    measures:
      - name: total_revenue
        agg: sum
        expr: order_amount

One DAG, one pipeline: Metrics inherit dbt's dependency graph. If stg_orders changes, both fct_orders and downstream metrics are refreshed.

Query-Time Aggregation and BI Tool Integration

dbt MetricFlow queries are always aggregated on-demand (like warehouse-native approaches, unlike pre-aggregated OLAP cubes).

BI tool integration happens via:

Tableau via Semantic Layer API: Tableau connects to dbt Cloud API endpoint, discovers metrics
Looker Studio: Custom connector queries MetricFlow programmatically
Hex/Mode: Python SDK in notebooks
Excel/Spreadsheets: REST API endpoints

Example REST API request:

bash
curl -X POST https://semantic-layer.cloud.getdbt.com/api/graphql \
  -H "Authorization: Bearer $DBT_API_KEY" \
  -d '{
    "metrics": ["total_revenue"],
    "groupBy": ["customer_region"],
    "where": "order_date >= 2025-01-01"
  }'

Performance characteristics:

Latency: API overhead (~100-300ms) + warehouse query time
Caching: dbt Cloud caches results for 24 hours by default (configurable)
Scaling: Limited by your warehouse capacity (MetricFlow generates SQL, warehouse executes it)

Trade-Offs of the Transformation-Layer Approach

✅ Advantages:

Warehouse-agnostic: Works on Snowflake, BigQuery, Databricks, Redshift, Postgres
Git-based governance: Version control, code review, CI/CD for metrics
Unified workflow: Metrics live in the same repo as transformations
dbt ecosystem: Leverage existing dbt tests, documentation, packages
No vendor lock-in: Metrics are portable YAML files

❌ Limitations:

API dependency: BI tools must integrate with dbt Cloud API (or self-host dbt Semantic Layer server)
Query latency: Network hop adds 100-300ms vs warehouse-native
Limited caching control: Can't fine-tune caching per metric (without enterprise features)
dbt Cloud cost: Semantic Layer requires Team or Enterprise plan
Learning curve: Teams must learn both dbt and MetricFlow YAML spec

Approach 3: OLAP-Style Semantic Acceleration (Cube.dev)

The Design Philosophy: Intelligent Caching Layer

Cube.dev (formerly Cube.js) represents the third architectural paradigm: an intelligent caching and pre-aggregation layer between BI tools and data warehouses.

Unlike warehouse-native (compute in database) or transformation-layer (generate SQL via API), Cube adds a dedicated stateful service that:

Stores pre-aggregated data
Routes queries to cache or warehouse
Manages incremental refresh

The core bet: For interactive dashboards, on-demand aggregation is too slow. Pre-aggregating common query patterns delivers sub-second response times at scale.

Launch: Open-sourced 2019, Cube Cloud GA 2021

How It Works: Data Modeling + Pre-Aggregation Engine

Cube models are defined in JavaScript:

javascript
// schema/Orders.js
cube('Orders', {
  sql: `SELECT * FROM analytics.fct_orders`,

  joins: {
    Customers: {
      sql: `${Orders}.customer_id = ${Customers}.customer_id`,
      relationship: 'belongsTo'
    }
  },

  measures: {
    totalRevenue: {
      sql: 'order_amount',
      type: 'sum'
    },
    avgOrderValue: {
      sql: 'order_amount',
      type: 'avg'
    }
  },

  dimensions: {
    customerRegion: {
      sql: `${Customers}.region`,
      type: 'string'
    },
    orderDate: {
      sql: 'order_date',
      type: 'time'
    }
  },

  preAggregations: {
    dailyByRegion: {
      measures: [Orders.totalRevenue, Orders.avgOrderValue],
      dimensions: [Orders.customerRegion],
      timeDimension: Orders.orderDate,
      granularity: 'day',
      refreshKey: {
        every: '1 hour'
      }
    }
  }
});

Pre-aggregation logic:

Build phase: Cube runs SQL on the warehouse to create aggregated tables:

sql
-- Generated pre-aggregation query
SELECT
  DATE_TRUNC('day', order_date) AS order_date_day,
  c.region AS customer_region,
  SUM(order_amount) AS total_revenue,
  AVG(order_amount) AS avg_order_value,
  COUNT(*) AS record_count
FROM analytics.fct_orders o
JOIN analytics.dim_customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2

Storage: Results stored in:
- Cube Store (Cube's embedded columnar database)
- OR warehouse tables (e.g., Snowflake tables managed by Cube)
- OR external cache (Redis)
Query routing: When a dashboard queries "total revenue by region for last 30 days":
- Cube checks if dailyByRegion pre-agg covers it → Yes, serve from cache (10-50ms)
- If not covered → Generate SQL, query warehouse (2-10 seconds)
Incremental refresh: Every hour, Cube runs:

sql
SELECT ... WHERE order_date >= (SELECT MAX(order_date) FROM pre_agg_table)

Only fetches new data, appends to pre-aggregation.

The Query Planning Engine: Cache vs Warehouse Decisions

Cube's query planner determines whether to use cache or hit the warehouse:

Decision tree:

Does a pre-aggregation cover the requested measures + dimensions + time range?
Is the pre-aggregation fresh (last refresh within refreshKey interval)?
If yes → Serve from cache
If no → Query warehouse directly

Example scenarios:

Query	Pre-Agg Exists?	Covers Query?	Result
Revenue by region, last 30 days	✅ Yes (`dailyByRegion`)	✅ Yes	Cache hit (20ms)
Revenue by region + product category	✅ Yes (`dailyByRegion`)	❌ No (missing product dimension)	Warehouse query (5s)
Revenue for single order ID	✅ Yes	❌ No (grain too coarse)	Warehouse query (2s)

Hybrid queries: If a query is partially covered, Cube can:

Fetch aggregates from cache
Join with dimension tables from warehouse
Return combined result

Cube Store: The Custom Columnar Engine

Cube Store is Cube's embedded OLAP engine (written in Rust):

Architecture:

Columnar format: Apache Parquet files
Distributed: Horizontal scaling with data partitioning
Query engine: Vectorized execution (similar to DuckDB/ClickHouse)

Why build a custom engine? Warehouse pre-aggregation tables have limitations:

Cost: Storing many rollup tables in Snowflake is expensive
Query speed: Even querying small aggregated tables has 500ms-1s overhead
Concurrency: Warehouse queries queue under high load

Cube Store delivers:

Sub-50ms queries for pre-aggregated data
No warehouse cost for cached queries
Higher concurrency (thousands of queries/second)

Performance comparison (10M row dataset, querying daily aggregates):

Snowflake table: 800ms average
Cube Store: 25ms average
Cost: $0.03/query (Snowflake) vs $0.00/query (Cube Store)

Real-Time Data and Streaming Integrations

Cube supports streaming pre-aggregations for real-time dashboards:

javascript
preAggregations: {
  realtimeSales: {
    measures: [Orders.totalRevenue],
    dimensions: [Orders.customerRegion],
    timeDimension: Orders.orderDate,
    granularity: 'minute',
    refreshKey: {
      every: '10 seconds',  // Near-real-time refresh
      incremental: true      // Only fetch last 10 seconds of data
    }
  }
}

Integration with Kafka/streaming platforms:

Cube can ingest from Kafka topics directly
OR query change data capture (CDC) tables in warehouse
Append to rolling window pre-aggregations

Use case: Live operations dashboard showing last-hour metrics, refreshing every 10 seconds.

Trade-Offs of the OLAP-Acceleration Approach

✅ Advantages:

Extreme performance: Sub-second queries for common patterns
Cost savings: Reduce warehouse compute by 60-90% (cached queries free)
High concurrency: Serve thousands of users without warehouse scaling
Warehouse-agnostic: Works with any SQL database
Developer-friendly: JavaScript-based modeling

❌ Limitations:

Additional infrastructure: Must deploy and maintain Cube service
Pre-agg management: Requires tuning which aggregations to create
Stale data risk: Cache can be minutes/hours behind (depending on refresh interval)
Storage overhead: Pre-aggregations consume disk space
Modeling complexity: Must define measures, dimensions, joins, AND pre-aggs

Hybrid Approaches: Looker and Omni

Looker: LookML + Database Computation

Looker sits between transformation-layer and warehouse-native approaches:

Architecture:

LookML: Proprietary modeling language (similar to dbt YAML)
Query execution: Generates SQL, runs in warehouse (Looker is stateless)
Caching: Optional persistent derived tables (PDTs) in warehouse

Key features:

Modeling layer: Define metrics, dimensions, joins in LookML
Explore UI: Business users drag-and-drop to build queries
Derived tables: Pre-aggregate hot paths, stored as warehouse tables

Trade-offs:

✅ Tight integration with Looker BI tool
✅ No separate infrastructure (query warehouse directly)
❌ Vendor lock-in (LookML only works in Looker)
❌ Google-owned (synergy with BigQuery, but multi-cloud concerns)

Omni: Embedded Semantic Layer in BI Tool

Omni is a newer BI platform with a built-in semantic layer:

Architecture:

Omni data models: YAML-based metric definitions (inspired by dbt)
Query execution: Native SQL generation, warehouse execution
Integration: Tightly coupled with Omni's visualization layer

Positioning: "dbt MetricFlow + Looker" combined into one product.

Trade-offs:

✅ Unified BI + semantic layer (no separate tools)
✅ Git-based workflow (models stored in version control)
❌ Omni-only (can't use metrics outside Omni BI)
❌ Younger product (smaller ecosystem than Looker/Tableau)

Architectural Trade-Offs: Decision Matrix

Criteria	Warehouse-Native (Snowflake/Databricks)	Transformation-Layer (dbt MetricFlow)	OLAP-Acceleration (Cube.dev)
Performance	Good (2-10s for complex queries)	Good (warehouse-dependent)	Excellent (sub-second cached)
Cost	Moderate (warehouse compute costs)	Moderate (warehouse compute costs)	Low (cached queries free)
Freshness	Real-time (query latest data)	Real-time (query latest data)	Delayed (refresh interval)
Infrastructure	None (built into warehouse)	Minimal (API server)	Significant (Cube cluster + storage)
Warehouse Support	Single vendor	Multi-warehouse	Multi-warehouse
Governance	Native (warehouse RBAC/audit)	Git-based + dbt Cloud	Application-level RBAC
Version Control	Limited (export/import YAML)	Native (Git-first)	Native (JS models in Git)
BI Tool Support	Emerging (limited integrations)	API-based (requires connector)	Broad (REST/GraphQL APIs)
Developer Experience	SQL-based (familiar)	YAML + dbt ecosystem	JavaScript (flexible)
Learning Curve	Low (if you know SQL)	Moderate (learn dbt + MetricFlow)	Moderate (learn Cube schema)
Best For	Single-warehouse shops, governance-heavy	Multi-cloud, dbt-native teams	High-concurrency dashboards, cost optimization

Conclusion: Choosing Your Semantic Layer Architecture

The semantic layer market in 2025 is multi-paradigm. There's no one-size-fits-all answer—the right choice depends on your architecture, team, and use cases.

Decision Framework

Choose Warehouse-Native (Snowflake/Databricks) if:

✅ You're all-in on one warehouse platform
✅ Governance and security are top priorities
✅ You want zero operational overhead
✅ You need real-time data (no caching delays)
✅ Your team prefers SQL over YAML/JavaScript

Choose Transformation-Layer (dbt MetricFlow) if:

✅ You're already using dbt for transformations
✅ You need multi-warehouse support (Snowflake + BigQuery)
✅ Git-based workflows are essential
✅ You want vendor portability (avoid lock-in)
✅ Your team values code review and CI/CD for metrics

Choose OLAP-Acceleration (Cube.dev) if:

✅ Sub-second query performance is critical
✅ You have high-concurrency dashboards (100+ concurrent users)
✅ Warehouse costs are a major concern
✅ You're willing to manage additional infrastructure
✅ Stale data (minutes/hours lag) is acceptable for most queries

Hybrid Strategies

Many teams use combinations:

Example 1: dbt + Snowflake Semantic Views

Use dbt for transformations and data quality tests
Define metrics in Snowflake Semantic Views (for zero-latency governance)
Export dbt metric definitions → Snowflake (via CI/CD)

Example 2: dbt + Cube

dbt builds fact/dimension tables
Cube pre-aggregates for dashboards
dbt MetricFlow for ad-hoc SQL queries

Example 3: Databricks + Looker

Store data in Delta Lake
Define metrics in Databricks Unity Catalog
Query via Looker (using Databricks SQL connector)

The 2026 Outlook

The semantic layer space is rapidly evolving. Trends to watch:

Standardization: Open Semantic Layer Interoperability (OSI) initiative aims to make metrics portable across tools
AI integration: LLMs querying semantic layers in natural language ("What was Q3 revenue in EMEA?")
Real-time convergence: Streaming semantic layers for operational analytics
Cost pressure: More teams adopting pre-aggregation to control warehouse costs

The bottom line: The "semantic layer problem" isn't solved by one architecture. It's solved by choosing the architecture that aligns with your data platform, team skills, and business requirements.

The industry has built three viable paths. Your job is to pick the one that fits your journey.

Semantic Layer Architectures Explained: Warehouse-Native vs dbt vs Cube (2025)

Semantic Layer Architectures (2020–2025): Snowflake/Databricks Warehouse-Native, dbt MetricFlow, and Cube.dev OLAP—A Decision Guide for BI & LLMs

Introduction: The Semantic Layer Problem Space

The Core Challenge: Where Should Semantic Logic Live?

Approach 1: Warehouse-Native Semantic Layers

The Design Philosophy: Metadata Living With Data

Snowflake Semantic Views: Architecture Deep Dive

How It Works: The Logical-to-Physical Mapping

Aggregations and Drill-Down: Flexible Grouping at Query Time

Complex Formulas and Calculated Measures

Multidimensional Modeling: Star Schemas Made Explicit

Security and Governance: Native Integration

Performance Optimization: Leveraging Warehouse Features

Trade-Offs of the Warehouse-Native Approach

Databricks Unity Catalog Metric Views: Architecture Deep Dive

How It Works: The MEASURE() Function and Spark Query Rewriting

Aggregations and Drill-Down: Dynamic Grouping with Star Schema Support

Complex Formulas: Handling Ratios, Distinct Counts, and Derived Measures

Security and Governance: Unity Catalog Integration

Performance Optimization: Spark SQL + Photon + Delta Lake

Trade-Offs of the Warehouse-Native Lakehouse Approach

Warehouse-Native Summary: What These Architectures Share

Approach 2: Transformation-Layer Semantic Models (dbt MetricFlow)

The Design Philosophy: Metrics as Code in the Transformation Layer

How It Works: YAML Definitions + SQL Generation

The Git-Based Governance Model

Integration with dbt's Transformation Pipeline

Query-Time Aggregation and BI Tool Integration

Trade-Offs of the Transformation-Layer Approach

Approach 3: OLAP-Style Semantic Acceleration (Cube.dev)

The Design Philosophy: Intelligent Caching Layer

How It Works: Data Modeling + Pre-Aggregation Engine

The Query Planning Engine: Cache vs Warehouse Decisions

Cube Store: The Custom Columnar Engine

Real-Time Data and Streaming Integrations

Trade-Offs of the OLAP-Acceleration Approach

Hybrid Approaches: Looker and Omni

Looker: LookML + Database Computation

Omni: Embedded Semantic Layer in BI Tool

Architectural Trade-Offs: Decision Matrix

Conclusion: Choosing Your Semantic Layer Architecture

Decision Framework

Hybrid Strategies

The 2026 Outlook