Semantic Layer Architectures (2020–2025): Snowflake/Databricks Warehouse-Native, dbt MetricFlow, and Cube.dev OLAP—A Decision Guide for BI & LLMs
Introduction: The Semantic Layer Problem Space
Every data team eventually hits the same wall: metrics defined differently across tools, dashboards showing conflicting numbers, and business users losing trust in data. The CFO asks "What was Q3 revenue?" and gets three different answers from three different dashboards.
This isn't a new problem. For decades, the industry tried to solve it with OLAP cubes, BI tool semantic layers, and metric catalogs. But the modern cloud data warehouse era demanded a fresh approach. Between 2020 and 2025, as organizations migrated to Snowflake, Databricks, and BigQuery, a new generation of semantic layer solutions emerged—each taking radically different architectural approaches to the same fundamental challenge.
The Core Challenge: Where Should Semantic Logic Live?
The semantic layer problem breaks down into several interconnected questions:
The Modeling Question: How do you define metrics once in a way that's reusable across all tools?
The Execution Question: When a user queries "revenue by region," who computes it—the BI tool, a middleware layer, or the database itself?
The Performance Question: Do you pre-aggregate everything (fast but inflexible), query raw data every time (flexible but slow), or find a middle ground?
The Governance Question: How do you enforce who can see what metrics, and ensure definitions don't drift as teams change them?
Between 2020-2025, the industry converged on three distinct architectural patterns, each making different trade-offs on these questions:
- Warehouse-Native: Store semantic metadata as database objects (Snowflake, Databricks)
- Transformation-Layer: Semantic models as code in the data transformation layer (dbt MetricFlow)
- OLAP-Acceleration: Intelligent caching layer with pre-aggregations (Cube.dev)
This article dissects how each approach works internally, the engineering decisions that shaped them, and the trade-offs teams face when choosing between them.
Approach 1: Warehouse-Native Semantic Layers
The Design Philosophy: Metadata Living With Data
In 2024-2025, both Snowflake and Databricks made a controversial bet: the semantic layer should live inside the data warehouse, not as external tooling. This wasn't obvious. For years, the industry assumed semantic layers were middleware—Looker, Tableau, and SSAS all sat between users and databases, translating business logic into SQL.
But Snowflake and Databricks asked a provocative question: What if semantic definitions were database objects, just like tables and views?
The reasoning:
- Governance is simpler when there's one security model (the warehouse's RBAC)
- Performance is better when there's no network hop to a separate service
- Operations are simpler with zero additional infrastructure to manage
- Lineage is native since the warehouse already tracks object dependencies
This led to fundamentally different architectures than middleware-based semantic layers.
Snowflake Semantic Views: Architecture Deep Dive
Launch: Public Preview Q1 2025, GA August 2025
When Snowflake announced Semantic Views at Summit 2025, they introduced a new database object type: SEMANTIC VIEW. This sits alongside tables, standard views, and materialized views in Snowflake's metadata catalog.
How It Works: The Logical-to-Physical Mapping
A Snowflake Semantic View defines a logical data model that abstracts physical tables:
sqlCREATE SEMANTIC VIEW sales_analytics AS LOGICAL TABLES orders ( FACTS order_amount, tax_amount DIMENSIONS customer_id, product_id, order_date PRIMARY KEY order_id ), customers ( DIMENSIONS customer_id, region, segment PRIMARY KEY customer_id ), products ( DIMENSIONS product_id, category, brand PRIMARY KEY product_id ) RELATIONSHIPS orders.customer_id = customers.customer_id, orders.product_id = products.product_id METRICS total_revenue AS SUM(order_amount), avg_order_value AS SUM(order_amount) / COUNT(DISTINCT order_id)
What's happening under the hood:
-
Metadata storage: The semantic view definition is stored in Snowflake's Information Schema, just like table schemas.
SHOW SEMANTIC METRICSandINFORMATION_SCHEMA.SEMANTIC_METRICSexpose the definitions. -
Query rewriting: When you query a semantic view:
sqlSELECT * FROM SEMANTIC_VIEW( sales_analytics DIMENSIONS customers.region, products.category METRICS total_revenue )Snowflake's query planner rewrites this into standard SQL:
sqlSELECT c.region, p.category, SUM(o.order_amount) AS total_revenue FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN products p ON o.product_id = p.product_id GROUP BY c.region, p.category -
Optimization: The rewritten query goes through Snowflake's standard optimizer:
- Micro-partition pruning (skip files that don't contain relevant data)
- Join order optimization
- Predicate pushdown
- Result caching (identical queries return instantly if data unchanged)
The key architectural insight: Semantic Views add zero execution overhead. The generated SQL is identical to what a skilled analyst would write by hand.
Aggregations and Drill-Down: Flexible Grouping at Query Time
Snowflake's approach to aggregations is pure query-time flexibility. The semantic view defines measures (what to aggregate) and dimensions (what to group by), but doesn't lock them together.
Example: The total_revenue metric is defined as SUM(order_amount), but you can query it by:
- Region:
DIMENSIONS customers.region - Product category:
DIMENSIONS products.category - Month:
DIMENSIONS DATE_TRUNC('month', order_date) - All three:
DIMENSIONS customers.region, products.category, DATE_TRUNC('month', order_date)
The engine generates the appropriate GROUP BY clause each time. This is drill-down by design—you're not pre-computing aggregates at every level, you're dynamically aggregating to the requested grain.
Why this works efficiently: Snowflake's columnar storage and query optimizer handle on-demand aggregation well. For a 500M row fact table:
- Simple metric by single dimension: 2-3 seconds (Medium warehouse)
- Complex metric across 3-hop join: 8-12 seconds (Large warehouse)
- Warm cache: <1 second
The trade-off: You can't query specific dimensions faster by pre-aggregating them (unless you create materialized views separately). But the flexibility often outweighs this for exploratory analytics.
Complex Formulas and Calculated Measures
Snowflake handles complex calculations by storing formula definitions in metadata and computing them at query time with correct aggregation semantics.
Ratio metrics are a classic pitfall. Computing revenue / orders naively:
sql-- WRONG: This computes ratio per row, then sums ratios SELECT SUM(revenue / orders) ... -- Mathematically incorrect!
Snowflake's semantic layer prevents this. When you define:
sqlMETRICS avg_order_value AS SUM(order_amount) / COUNT(DISTINCT order_id)
The engine computes the numerator and denominator separately, then divides:
sql-- CORRECT: Generated by Snowflake SELECT SUM(order_amount) / COUNT(DISTINCT order_id) AS avg_order_value FROM ... GROUP BY ...
This extends to arbitrary complexity:
Churn rate example:
sqlMETRICS churn_rate AS COUNT(DISTINCT CASE WHEN last_order_date < CURRENT_DATE - 90 THEN customer_id END) / COUNT(DISTINCT customer_id)
Derived metrics (metrics built from other metrics):
sqlMETRICS gross_revenue AS SUM(order_amount), discounts AS SUM(discount_amount), net_revenue AS gross_revenue - discounts
The semantic view expands net_revenue into:
sqlSUM(order_amount) - SUM(discount_amount) AS net_revenue
This composability is critical for maintainability—change gross_revenue once, and net_revenue updates automatically.
Multidimensional Modeling: Star Schemas Made Explicit
Snowflake's logical tables and relationships formalize star schema modeling in metadata.
Traditional approach (before semantic views):
- Data engineers create fact and dimension tables
- BI developers remember (or guess) how to join them
- Each dashboard rebuilds the same join logic
- Mistakes happen (Cartesian products, fan-out issues)
Semantic Views approach:
- Define relationships once:
orders.customer_id = customers.customer_id - The warehouse enforces cardinality (one-to-many, many-to-one)
- Every query through the semantic view uses correct joins
Multi-hop joins are supported transparently. Example:
Orders → Stores → Regions
Query "total_revenue by region_name":
- Semantic view knows: Orders links to Stores via
store_id, Stores links to Regions viaregion_id - Generates:
FROM orders o JOIN stores s ON o.store_id = s.store_id JOIN regions r ON s.region_id = r.region_id - User never writes JOIN syntax
This prevents fan-out bugs. If you naively did FROM orders JOIN regions without the intermediate store join, you'd get incorrect results (multiple rows per order if regions aren't properly scoped). The semantic view's relationship definitions prevent this.
Security and Governance: Native Integration
The killer feature of warehouse-native semantic layers: you get the warehouse's security model for free.
Access control:
sqlGRANT SELECT ON SEMANTIC VIEW sales_analytics TO ROLE analyst_role;
Users with analyst_role can query metrics, but don't need access to underlying tables:
- Raw table:
raw.orders(contains PII, restricted) - Semantic view:
sales_analytics(exposes only aggregated metrics) - Analysts:
SELECTon semantic view, no access toraw.orders
This is governed self-service—explore metrics without seeing sensitive data.
Row-level security flows through automatically. If raw.orders has a row access policy:
sqlCREATE ROW ACCESS POLICY orders_region_filter AS (region STRING) RETURNS BOOLEAN -> region = CURRENT_USER_REGION()
Then queries via the semantic view inherit that filter. A user in the EMEA region only sees EMEA orders in any metric, with zero additional configuration.
Audit logging: All semantic view queries appear in ACCOUNT_USAGE.QUERY_HISTORY:
sqlSELECT user_name, query_text, execution_time FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE query_text ILIKE '%SEMANTIC_VIEW(sales_analytics%'
Track who queries which metrics, how often, and which dimensions they analyze.
Performance Optimization: Leveraging Warehouse Features
Snowflake Semantic Views don't have separate caching—they leverage Snowflake's existing performance infrastructure:
Result cache: Query total_revenue by region twice → second query is instant (if data unchanged).
Warehouse cache: Frequently accessed micro-partitions stay in local SSD cache.
Clustering: If orders is clustered by order_date, queries filtered by date prune partitions efficiently.
For extreme performance needs, combine semantic views with materialized views:
sql-- Pre-aggregate expensive metric at common grain CREATE MATERIALIZED VIEW daily_revenue_by_region AS SELECT DATE_TRUNC('day', order_date) AS day, customer_region, SUM(order_amount) AS revenue FROM raw.orders JOIN raw.customers USING (customer_id) GROUP BY 1, 2; -- Point semantic view's logical table at the materialized view CREATE SEMANTIC VIEW sales_analytics AS LOGICAL TABLES daily_summary ( BASE_TABLE daily_revenue_by_region METRICS revenue DIMENSIONS day, customer_region ) ...
Now queries hit the pre-aggregated table. But this sacrifices flexibility (you're locked into day + region grain).
Snowflake's design philosophy: Query raw data by default (leverage warehouse power), pre-aggregate selectively when needed.
Trade-Offs of the Warehouse-Native Approach
✅ Advantages:
- Zero infrastructure overhead: No separate service to deploy or maintain
- Native performance: Fully optimized by Snowflake's engine
- Unified governance: One security model for data and metrics
- Operational simplicity: Metrics managed like any other database object
❌ Limitations:
- Snowflake-only: Can't use metrics on data in BigQuery, Databricks, etc.
- Limited BI tool support (initially): Most tools can't natively "discover" semantic views yet (requires custom SQL or waiting for OSI integrations)
- No native version control: YAML export/import exists, but no git integration like dbt
- Proprietary syntax:
SEMANTIC_VIEW()is Snowflake-specific SQL
Slowly Changing Dimensions (SCD): Not automatically handled. If you have Type-2 dimensions (multiple records per entity over time), you must model them correctly in your joins:
sql-- Manual SCD handling in semantic view join RELATIONSHIPS orders.customer_id = customers.customer_id AND orders.order_date BETWEEN customers.effective_start_date AND customers.effective_end_date
Snowflake doesn't generate this logic—you define it.
Sparse data management: Non-issue for Snowflake Semantic Views because they query on-demand. Only data that exists is processed. For time-series gaps (months with zero sales), use a date dimension table:
sql-- Time spine to fill gaps LOGICAL TABLES date_dim ( DIMENSIONS date, year, month, quarter ) RELATIONSHIPS orders.order_date = date_dim.date (LEFT JOIN)
The LEFT JOIN ensures all dates appear, with NULL/zero for missing facts.
Databricks Unity Catalog Metric Views: Architecture Deep Dive
Launch: Public Preview June 2025, GA late 2025
Databricks entered the semantic layer space later than Snowflake, but with a unique advantage: Unity Catalog, their governance layer, was already battle-tested. Metric Views extend Unity Catalog to cover metrics, not just tables.
The architectural philosophy: Unify data science, ML, and BI under one semantic layer. Unlike BI-focused semantic layers, Databricks designed Metric Views to work equally well in:
- SQL dashboards
- Python/R notebooks
- ML feature engineering
- LLM-powered AI agents
How It Works: The MEASURE() Function and Spark Query Rewriting
Databricks Metric Views are defined in YAML or SQL DDL:
YAML definition:
yamlmetric_view: name: sales_metrics source_table: main.sales.orders joins: - table: main.sales.customers on: orders.customer_id = customers.customer_id - table: main.sales.products on: orders.product_id = products.product_id measures: - name: total_revenue expr: SUM(order_amount) - name: avg_order_value expr: SUM(order_amount) / COUNT(DISTINCT order_id) dimensions: - name: customer_region expr: customers.region - name: product_category expr: products.category
Register in Unity Catalog:
sqlCREATE METRIC VIEW main.sales.sales_metrics FROM 'dbfs:/metrics/sales_metrics.yaml';
Querying uses a special MEASURE() function:
sqlSELECT customer_region, MEASURE(total_revenue), MEASURE(avg_order_value) FROM main.sales.sales_metrics WHERE product_category = 'Electronics' GROUP BY customer_region;
Query compilation (Spark SQL + Catalyst optimizer):
- Parse the
MEASURE()syntax - Expand measures into their definitions:
MEASURE(total_revenue)→SUM(order_amount)MEASURE(avg_order_value)→SUM(order_amount) / COUNT(DISTINCT order_id)
- Inject joins from metric view spec:
orders → customersoncustomer_idorders → productsonproduct_id
- Optimize via Catalyst:
- Predicate pushdown:
WHERE product_category = 'Electronics'→ pushed toproductstable scan - Join reordering: Small dimension tables broadcast to executors
- Partition pruning: Skip irrelevant Delta Lake files
- Predicate pushdown:
- Execute with Photon (vectorized engine for analytical queries)
Generated physical plan (conceptual):
sqlSELECT c.region AS customer_region, SUM(o.order_amount) AS total_revenue, SUM(o.order_amount) / COUNT(DISTINCT o.order_id) AS avg_order_value FROM main.sales.orders o JOIN main.sales.customers c ON o.customer_id = c.customer_id JOIN main.sales.products p ON o.product_id = p.product_id WHERE p.category = 'Electronics' GROUP BY c.region;
Like Snowflake, no execution overhead—the metric view generates efficient native SQL.
Aggregations and Drill-Down: Dynamic Grouping with Star Schema Support
Databricks' approach mirrors Snowflake's: define measures and dimensions separately, combine at query time.
Key innovation: Multi-level joins for dimensional hierarchies.
Example star schema:
Orders (fact) → Stores → Regions → Countries
Metric view YAML:
yamljoins: - table: dim_stores on: orders.store_id = dim_stores.store_id - table: dim_regions on: dim_stores.region_id = dim_regions.region_id - table: dim_countries on: dim_regions.country_id = dim_countries.country_id
Query "total_revenue by country":
sqlSELECT country_name, MEASURE(total_revenue) FROM sales_metrics GROUP BY country_name;
Databricks traces the join path:
orders → stores → regions → countries
Generates three-hop join automatically. This prevents fan-out (the common mistake of joining facts directly to distant dimensions, causing duplicate rows).
Drill-down is implicit. Start with country-level:
sqlSELECT country_name, MEASURE(total_revenue) FROM sales_metrics GROUP BY country_name;
Drill into regions within a country:
sqlSELECT country_name, region_name, MEASURE(total_revenue) FROM sales_metrics WHERE country_name = 'United States' GROUP BY country_name, region_name;
Further drill into stores:
sqlSELECT region_name, store_name, MEASURE(total_revenue) FROM sales_metrics WHERE country_name = 'United States' GROUP BY region_name, store_name;
Each query dynamically aggregates to the requested grain. No pre-defined drill paths—the semantic graph supports any traversal.
Complex Formulas: Handling Ratios, Distinct Counts, and Derived Measures
Databricks explicitly designed Metric Views to handle "complex measures like ratios and distinct counts that cannot be safely re-aggregated".
The problem with naive views:
sql-- Traditional view (WRONG for re-aggregation) CREATE VIEW revenue_per_customer AS SELECT region, SUM(revenue) / COUNT(DISTINCT customer_id) AS rpc FROM orders GROUP BY region; -- Now try to roll up to country level: SELECT country, SUM(rpc) FROM revenue_per_customer ... -- WRONG! You're summing ratios, not recomputing ratio of sums
Metric Views solution: Separate measure definitions from aggregation grain.
yamlmeasures: - name: total_revenue expr: SUM(revenue) - name: unique_customers expr: COUNT(DISTINCT customer_id) - name: revenue_per_customer expr: total_revenue / unique_customers # Derived measure
Query at region level:
sqlSELECT region, MEASURE(revenue_per_customer) FROM sales_metrics GROUP BY region; -- Computes: SUM(revenue) / COUNT(DISTINCT customer_id) per region
Roll up to country level:
sqlSELECT country, MEASURE(revenue_per_customer) FROM sales_metrics GROUP BY country; -- Computes: SUM(revenue) / COUNT(DISTINCT customer_id) per country -- NOT: SUM(region_ratios)
The engine recomputes base measures at the requested grain, then applies the formula.
Distinct counts across joins are handled correctly:
yamlmeasures: - name: unique_products_sold expr: COUNT(DISTINCT product_id)
When querying by customer region (requiring orders → customers join):
sqlSELECT customer_region, MEASURE(unique_products_sold) FROM sales_metrics GROUP BY customer_region;
Databricks ensures the distinct count happens after the join, not before:
sql-- Generated SQL SELECT c.region, COUNT(DISTINCT o.product_id) -- Distinct after join FROM orders o JOIN customers c ON o.customer_id = c.customer_id GROUP BY c.region;
If distinct count happened before the join, you'd count products globally, not per region.
Complex time-based calculations are supported via Spark SQL expressions:
yamlmeasures: - name: rolling_30day_revenue expr: SUM(revenue) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
Though window functions in measures are less common (typically done in dimensions or post-query).
Security and Governance: Unity Catalog Integration
Unity Catalog is Databricks' governance foundation. Metric Views inherit all governance features:
Access control:
sqlGRANT SELECT ON METRIC VIEW main.sales.sales_metrics TO `marketing_team`;
Fine-grained: Users need permission on the metric view, not on underlying tables.
Row-level security (via Unity Catalog policies):
sql-- Apply row filter to base table CREATE ROW FILTER region_filter ON main.sales.orders RETURNS region = current_user_region();
Queries via metric views automatically apply the filter. A user in APAC sees only APAC orders in all metrics.
Column masking:
sqlALTER TABLE main.sales.customers SET COLUMN customer_email MASK hash(customer_email);
If a metric or dimension references customer_email, non-privileged users see hashed values.
Certification: Mark metrics as production-ready:
sqlALTER METRIC VIEW main.sales.sales_metrics SET TBLPROPERTIES ('certified' = 'true', 'owner' = 'finance_team');
In Databricks UI, certified metrics display a trust badge. Analysts know these are approved for reporting.
Tagging (for discovery and compliance):
sqlALTER METRIC VIEW main.sales.sales_metrics SET TAGS ('domain' = 'finance', 'pii' = 'false', 'refresh_frequency' = 'daily');
Query metrics by tag:
sql-- Find all finance domain metrics SELECT metric_view_name, tags FROM system.information_schema.metric_views WHERE tags.domain = 'finance';
Lineage: Unity Catalog UI shows end-to-end lineage:
S3 raw data → Bronze tables → Silver tables → Gold tables → Metric View → Tableau Dashboard
Click on sales_metrics metric view:
- Upstream dependencies:
main.sales.orders,main.sales.customers,main.sales.products - Downstream consumers:
Finance Dashboard,Marketing Notebook,Churn Model
Audit logging:
sqlSELECT user_name, request_params.metric_view, query_text FROM system.access.audit WHERE request_params.metric_view = 'main.sales.sales_metrics' ORDER BY event_time DESC;
Track every metric query: who, when, which measures/dimensions, performance.
Performance Optimization: Spark SQL + Photon + Delta Lake
Databricks Metric Views execute on Databricks SQL warehouses, combining:
- Spark SQL: Distributed query engine with Catalyst optimizer
- Photon: Vectorized execution (2-3× faster than standard Spark)
- Delta Lake: Columnar storage with ACID, time travel, and Z-ordering
Performance strategies:
1. Data skipping: Delta Lake tracks min/max/null count stats per file (Parquet). Queries with filters skip irrelevant files:
sqlWHERE order_date >= '2025-01-01' -- Delta skips all files with max(order_date) < 2025-01-01
2. Z-ordering: Optimize file layout for common filter columns:
sqlOPTIMIZE main.sales.orders ZORDER BY (order_date, customer_region);
Queries filtered by date or region read fewer files.
3. Broadcast joins: Small dimension tables (<10MB) broadcast to all executors:
sql-- Spark detects customers table is small -- Broadcasts it to all nodes instead of shuffling orders JOIN customers ...
4. Adaptive Query Execution (AQE): Spark adjusts plan mid-execution based on runtime stats:
- Convert sort-merge join to broadcast join if table is smaller than expected
- Coalesce small shuffle partitions to reduce overhead
- Optimize skew joins (if one partition is 100× larger, split it further)
5. Delta cache: Local SSD cache on warehouse nodes. Frequently accessed data (dimension tables) stays cached, reducing cloud storage reads.
6. Query result cache: Identical queries return instantly if data unchanged (like Snowflake).
Real-world performance (Databricks internal benchmarks):
- 10M row fact table, 100K dimension rows: 1-2 seconds (Small warehouse)
- 500M row fact table, 5M dimension rows: 5-8 seconds (Medium warehouse)
- 10B row fact table, 100M dimension rows: 30-60 seconds (Large warehouse with Photon)
Pre-aggregation option: For very expensive metrics queried frequently, create materialized views:
sqlCREATE MATERIALIZED VIEW monthly_revenue AS SELECT DATE_TRUNC('month', order_date) AS month, customer_region, MEASURE(total_revenue) AS revenue FROM main.sales.sales_metrics GROUP BY month, customer_region;
But this sacrifices flexibility (locked into month + region grain).
Databricks' design philosophy: Query raw data by default (leverage Spark + Delta's power), pre-aggregate only for proven hotspots.
Trade-Offs of the Warehouse-Native Lakehouse Approach
✅ Advantages:
- Unified data science + BI + ML: Same metrics in SQL dashboards, notebooks, and feature stores
- Unity Catalog governance: Enterprise-grade RBAC, lineage, certification
- High performance: Photon + Delta Lake handle billion-row datasets efficiently
- Domain-driven organization: Group metrics by business domain (Finance, Marketing, Sales)
- No additional infrastructure: Metrics are just catalog objects
❌ Limitations:
- Databricks-only: Metrics don't work outside Databricks (can't query from Snowflake)
- MEASURE() syntax: Required function (can't treat metric views like normal tables)
- Join limitations: Can't freely join metric views to other tables in one query (must use CTEs or include tables in metric view definition)
- Unity Catalog requirement: Must migrate from legacy Hive Metastore
- Time grain handling: No automatic time dimension shortcuts (must manually define
order_month,order_quarterdimensions) - New feature, limited docs: Best practices still emerging (as of late 2025)
Slowly Changing Dimensions: Like Snowflake, not automatically handled. Model Type-2 joins manually:
yamljoins: - table: dim_customers_history on: | orders.customer_id = dim_customers_history.customer_id AND orders.order_date BETWEEN dim_customers_history.effective_start_date AND dim_customers_history.effective_end_date
Sparse data management: Query-time filtering handles sparsity naturally (only existing data is processed). For time-series gaps, use a date dimension:
yamldimensions: - name: calendar_date expr: date_dim.date joins: - table: dim_dates on: orders.order_date = dim_dates.date type: left # Ensures all dates appear, even with zero orders
Warehouse-Native Summary: What These Architectures Share
Both Snowflake and Databricks converged on similar principles:
- Metadata as database objects: Semantic definitions live in the catalog (Information Schema / Unity Catalog)
- Query-time rewriting: Translate semantic queries into optimized native SQL/Spark SQL
- No separate cache: Rely on warehouse caching (result cache, local SSD cache)
- Governance through the warehouse: Leverage existing RBAC, row filters, audit logs
- Performance through the engine: Trust the warehouse optimizer + columnar storage
Why this approach won mindshare (2024-2025):
- Simplicity: No separate service to deploy, secure, or scale
- Trust: Metrics computed directly on source data (no stale cache issues)
- Governance: One security model to learn and audit
- Cost: No additional license (semantic layer is included with warehouse)
When this approach struggles:
- Multi-cloud: If data lives in multiple warehouses (Snowflake + BigQuery), warehouse-native doesn't span them
- BI tool limitations: Many tools can't yet natively "discover" semantic views (requires custom SQL or waiting for integrations)
- Advanced caching needs: If you need aggressive pre-aggregation for extreme performance, warehouse-native alone may not suffice (though you can layer materialized views on top)
Approach 2: Transformation-Layer Semantic Models (dbt MetricFlow)
The Design Philosophy: Metrics as Code in the Transformation Layer
While Snowflake and Databricks embedded semantics into the warehouse, dbt Labs took a radically different approach: define metrics as version-controlled code in the transformation layer.
The reasoning: Data teams already use dbt to transform raw data into analytics-ready tables. Why introduce a separate tool for metrics? Instead, extend dbt's existing workflow to include metric definitions alongside table transformations.
Launch: dbt MetricFlow public preview June 2024, GA October 2024
Key architectural decision: Metrics live in YAML files in your dbt project, version-controlled in Git alongside your transformation SQL.
How It Works: YAML Definitions + SQL Generation
A dbt semantic model defines entities, dimensions, and measures:
yaml# models/metrics/sales_semantic_model.yml semantic_models: - name: orders model: ref('fct_orders') entities: - name: order_id type: primary - name: customer_id type: foreign dimensions: - name: order_date type: time type_params: time_granularity: day - name: customer_region expr: dim_customers.region measures: - name: order_count agg: count - name: total_revenue agg: sum expr: order_amount - name: avg_order_value agg: ratio numerator: total_revenue denominator: order_count
Query via dbt Semantic Layer API:
python# Python SDK from dbt_sl import DbtSemanticLayerClient client = DbtSemanticLayerClient(environment="prod") result = client.query( metrics=["total_revenue", "avg_order_value"], group_by=["customer_region", "order_date__month"] )
What happens under the hood:
- Request parsing: API receives metrics + dimensions
- SQL generation: MetricFlow generates optimized SQL:
sql-- Generated by MetricFlow SELECT c.region AS customer_region, DATE_TRUNC('month', o.order_date) AS order_date__month, SUM(o.order_amount) AS total_revenue, SUM(o.order_amount) / COUNT(*) AS avg_order_value FROM analytics.fct_orders o JOIN analytics.dim_customers c ON o.customer_id = c.customer_id GROUP BY 1, 2
- Execution: SQL runs on your data warehouse (Snowflake, BigQuery, Databricks, Redshift)
- Result return: Data flows back through the API to the requesting tool
Critical difference from warehouse-native: The semantic layer is middleware. It sits between BI tools and the warehouse, translating metric requests into SQL.
The Git-Based Governance Model
dbt MetricFlow's killer feature: metrics are code, which means:
Version control:
bashgit log models/metrics/sales_semantic_model.yml # See complete history of metric definition changes
Code review:
yaml# Pull request: Change revenue metric definition - measures: - - name: total_revenue - agg: sum - expr: order_amount + measures: + - name: total_revenue + agg: sum + expr: order_amount - discount_amount # Now net of discounts
Analytics engineer reviews: "This changes Q4 revenue by $2M, approved after confirming with finance team."
CI/CD integration:
yaml# .github/workflows/dbt_ci.yml - name: Test metric changes run: | dbt parse # Validates YAML syntax dbt test # Runs data quality tests dbt-sl compare-metrics --base main --head feature-branch # Shows how metric values change with new definitions
Automatic documentation:
bashdbt docs generate # Creates searchable catalog of all metrics with: # - Definition (SQL) # - Lineage (which models it depends on) # - Ownership (who maintains it) # - Change log (git history)
Why this matters: Metrics aren't hidden in a database. They're transparent, auditable code that follows software engineering best practices.
Integration with dbt's Transformation Pipeline
The transformation layer approach creates a unified workflow:
Step 1: Transform raw data (traditional dbt):
sql-- models/staging/stg_orders.sql SELECT order_id, customer_id, order_date, order_amount FROM raw.orders WHERE order_status != 'cancelled'
Step 2: Build fact table (traditional dbt):
sql-- models/marts/fct_orders.sql SELECT o.order_id, o.customer_id, o.order_date, o.order_amount, c.region AS customer_region FROM {{ ref('stg_orders') }} o JOIN {{ ref('dim_customers') }} c USING (customer_id)
Step 3: Define metrics on fact table (MetricFlow):
yamlsemantic_models: - name: orders model: ref('fct_orders') # Points to Step 2 measures: - name: total_revenue agg: sum expr: order_amount
One DAG, one pipeline: Metrics inherit dbt's dependency graph. If stg_orders changes, both fct_orders and downstream metrics are refreshed.
Query-Time Aggregation and BI Tool Integration
dbt MetricFlow queries are always aggregated on-demand (like warehouse-native approaches, unlike pre-aggregated OLAP cubes).
BI tool integration happens via:
- Tableau via Semantic Layer API: Tableau connects to dbt Cloud API endpoint, discovers metrics
- Looker Studio: Custom connector queries MetricFlow programmatically
- Hex/Mode: Python SDK in notebooks
- Excel/Spreadsheets: REST API endpoints
Example REST API request:
bashcurl -X POST https://semantic-layer.cloud.getdbt.com/api/graphql \ -H "Authorization: Bearer $DBT_API_KEY" \ -d '{ "metrics": ["total_revenue"], "groupBy": ["customer_region"], "where": "order_date >= 2025-01-01" }'
Performance characteristics:
- Latency: API overhead (~100-300ms) + warehouse query time
- Caching: dbt Cloud caches results for 24 hours by default (configurable)
- Scaling: Limited by your warehouse capacity (MetricFlow generates SQL, warehouse executes it)
Trade-Offs of the Transformation-Layer Approach
✅ Advantages:
- Warehouse-agnostic: Works on Snowflake, BigQuery, Databricks, Redshift, Postgres
- Git-based governance: Version control, code review, CI/CD for metrics
- Unified workflow: Metrics live in the same repo as transformations
- dbt ecosystem: Leverage existing dbt tests, documentation, packages
- No vendor lock-in: Metrics are portable YAML files
❌ Limitations:
- API dependency: BI tools must integrate with dbt Cloud API (or self-host dbt Semantic Layer server)
- Query latency: Network hop adds 100-300ms vs warehouse-native
- Limited caching control: Can't fine-tune caching per metric (without enterprise features)
- dbt Cloud cost: Semantic Layer requires Team or Enterprise plan
- Learning curve: Teams must learn both dbt and MetricFlow YAML spec
Approach 3: OLAP-Style Semantic Acceleration (Cube.dev)
The Design Philosophy: Intelligent Caching Layer
Cube.dev (formerly Cube.js) represents the third architectural paradigm: an intelligent caching and pre-aggregation layer between BI tools and data warehouses.
Unlike warehouse-native (compute in database) or transformation-layer (generate SQL via API), Cube adds a dedicated stateful service that:
- Stores pre-aggregated data
- Routes queries to cache or warehouse
- Manages incremental refresh
The core bet: For interactive dashboards, on-demand aggregation is too slow. Pre-aggregating common query patterns delivers sub-second response times at scale.
Launch: Open-sourced 2019, Cube Cloud GA 2021
How It Works: Data Modeling + Pre-Aggregation Engine
Cube models are defined in JavaScript:
javascript// schema/Orders.js cube('Orders', { sql: `SELECT * FROM analytics.fct_orders`, joins: { Customers: { sql: `${Orders}.customer_id = ${Customers}.customer_id`, relationship: 'belongsTo' } }, measures: { totalRevenue: { sql: 'order_amount', type: 'sum' }, avgOrderValue: { sql: 'order_amount', type: 'avg' } }, dimensions: { customerRegion: { sql: `${Customers}.region`, type: 'string' }, orderDate: { sql: 'order_date', type: 'time' } }, preAggregations: { dailyByRegion: { measures: [Orders.totalRevenue, Orders.avgOrderValue], dimensions: [Orders.customerRegion], timeDimension: Orders.orderDate, granularity: 'day', refreshKey: { every: '1 hour' } } } });
Pre-aggregation logic:
- Build phase: Cube runs SQL on the warehouse to create aggregated tables:
sql-- Generated pre-aggregation query SELECT DATE_TRUNC('day', order_date) AS order_date_day, c.region AS customer_region, SUM(order_amount) AS total_revenue, AVG(order_amount) AS avg_order_value, COUNT(*) AS record_count FROM analytics.fct_orders o JOIN analytics.dim_customers c ON o.customer_id = c.customer_id GROUP BY 1, 2
-
Storage: Results stored in:
- Cube Store (Cube's embedded columnar database)
- OR warehouse tables (e.g., Snowflake tables managed by Cube)
- OR external cache (Redis)
-
Query routing: When a dashboard queries "total revenue by region for last 30 days":
- Cube checks if
dailyByRegionpre-agg covers it → Yes, serve from cache (10-50ms) - If not covered → Generate SQL, query warehouse (2-10 seconds)
- Cube checks if
-
Incremental refresh: Every hour, Cube runs:
sqlSELECT ... WHERE order_date >= (SELECT MAX(order_date) FROM pre_agg_table)
Only fetches new data, appends to pre-aggregation.
The Query Planning Engine: Cache vs Warehouse Decisions
Cube's query planner determines whether to use cache or hit the warehouse:
Decision tree:
- Does a pre-aggregation cover the requested measures + dimensions + time range?
- Is the pre-aggregation fresh (last refresh within
refreshKeyinterval)? - If yes → Serve from cache
- If no → Query warehouse directly
Example scenarios:
| Query | Pre-Agg Exists? | Covers Query? | Result |
|---|---|---|---|
| Revenue by region, last 30 days | ✅ Yes (dailyByRegion) | ✅ Yes | Cache hit (20ms) |
| Revenue by region + product category | ✅ Yes (dailyByRegion) | ❌ No (missing product dimension) | Warehouse query (5s) |
| Revenue for single order ID | ✅ Yes | ❌ No (grain too coarse) | Warehouse query (2s) |
Hybrid queries: If a query is partially covered, Cube can:
- Fetch aggregates from cache
- Join with dimension tables from warehouse
- Return combined result
Cube Store: The Custom Columnar Engine
Cube Store is Cube's embedded OLAP engine (written in Rust):
Architecture:
- Columnar format: Apache Parquet files
- Distributed: Horizontal scaling with data partitioning
- Query engine: Vectorized execution (similar to DuckDB/ClickHouse)
Why build a custom engine? Warehouse pre-aggregation tables have limitations:
- Cost: Storing many rollup tables in Snowflake is expensive
- Query speed: Even querying small aggregated tables has 500ms-1s overhead
- Concurrency: Warehouse queries queue under high load
Cube Store delivers:
- Sub-50ms queries for pre-aggregated data
- No warehouse cost for cached queries
- Higher concurrency (thousands of queries/second)
Performance comparison (10M row dataset, querying daily aggregates):
- Snowflake table: 800ms average
- Cube Store: 25ms average
- Cost: $0.03/query (Snowflake) vs $0.00/query (Cube Store)
Real-Time Data and Streaming Integrations
Cube supports streaming pre-aggregations for real-time dashboards:
javascriptpreAggregations: { realtimeSales: { measures: [Orders.totalRevenue], dimensions: [Orders.customerRegion], timeDimension: Orders.orderDate, granularity: 'minute', refreshKey: { every: '10 seconds', // Near-real-time refresh incremental: true // Only fetch last 10 seconds of data } } }
Integration with Kafka/streaming platforms:
- Cube can ingest from Kafka topics directly
- OR query change data capture (CDC) tables in warehouse
- Append to rolling window pre-aggregations
Use case: Live operations dashboard showing last-hour metrics, refreshing every 10 seconds.
Trade-Offs of the OLAP-Acceleration Approach
✅ Advantages:
- Extreme performance: Sub-second queries for common patterns
- Cost savings: Reduce warehouse compute by 60-90% (cached queries free)
- High concurrency: Serve thousands of users without warehouse scaling
- Warehouse-agnostic: Works with any SQL database
- Developer-friendly: JavaScript-based modeling
❌ Limitations:
- Additional infrastructure: Must deploy and maintain Cube service
- Pre-agg management: Requires tuning which aggregations to create
- Stale data risk: Cache can be minutes/hours behind (depending on refresh interval)
- Storage overhead: Pre-aggregations consume disk space
- Modeling complexity: Must define measures, dimensions, joins, AND pre-aggs
Hybrid Approaches: Looker and Omni
Looker: LookML + Database Computation
Looker sits between transformation-layer and warehouse-native approaches:
Architecture:
- LookML: Proprietary modeling language (similar to dbt YAML)
- Query execution: Generates SQL, runs in warehouse (Looker is stateless)
- Caching: Optional persistent derived tables (PDTs) in warehouse
Key features:
- Modeling layer: Define metrics, dimensions, joins in LookML
- Explore UI: Business users drag-and-drop to build queries
- Derived tables: Pre-aggregate hot paths, stored as warehouse tables
Trade-offs:
- ✅ Tight integration with Looker BI tool
- ✅ No separate infrastructure (query warehouse directly)
- ❌ Vendor lock-in (LookML only works in Looker)
- ❌ Google-owned (synergy with BigQuery, but multi-cloud concerns)
Omni: Embedded Semantic Layer in BI Tool
Omni is a newer BI platform with a built-in semantic layer:
Architecture:
- Omni data models: YAML-based metric definitions (inspired by dbt)
- Query execution: Native SQL generation, warehouse execution
- Integration: Tightly coupled with Omni's visualization layer
Positioning: "dbt MetricFlow + Looker" combined into one product.
Trade-offs:
- ✅ Unified BI + semantic layer (no separate tools)
- ✅ Git-based workflow (models stored in version control)
- ❌ Omni-only (can't use metrics outside Omni BI)
- ❌ Younger product (smaller ecosystem than Looker/Tableau)
Architectural Trade-Offs: Decision Matrix
| Criteria | Warehouse-Native (Snowflake/Databricks) | Transformation-Layer (dbt MetricFlow) | OLAP-Acceleration (Cube.dev) |
|---|---|---|---|
| Performance | Good (2-10s for complex queries) | Good (warehouse-dependent) | Excellent (sub-second cached) |
| Cost | Moderate (warehouse compute costs) | Moderate (warehouse compute costs) | Low (cached queries free) |
| Freshness | Real-time (query latest data) | Real-time (query latest data) | Delayed (refresh interval) |
| Infrastructure | None (built into warehouse) | Minimal (API server) | Significant (Cube cluster + storage) |
| Warehouse Support | Single vendor | Multi-warehouse | Multi-warehouse |
| Governance | Native (warehouse RBAC/audit) | Git-based + dbt Cloud | Application-level RBAC |
| Version Control | Limited (export/import YAML) | Native (Git-first) | Native (JS models in Git) |
| BI Tool Support | Emerging (limited integrations) | API-based (requires connector) | Broad (REST/GraphQL APIs) |
| Developer Experience | SQL-based (familiar) | YAML + dbt ecosystem | JavaScript (flexible) |
| Learning Curve | Low (if you know SQL) | Moderate (learn dbt + MetricFlow) | Moderate (learn Cube schema) |
| Best For | Single-warehouse shops, governance-heavy | Multi-cloud, dbt-native teams | High-concurrency dashboards, cost optimization |
Conclusion: Choosing Your Semantic Layer Architecture
The semantic layer market in 2025 is multi-paradigm. There's no one-size-fits-all answer—the right choice depends on your architecture, team, and use cases.
Decision Framework
Choose Warehouse-Native (Snowflake/Databricks) if:
- ✅ You're all-in on one warehouse platform
- ✅ Governance and security are top priorities
- ✅ You want zero operational overhead
- ✅ You need real-time data (no caching delays)
- ✅ Your team prefers SQL over YAML/JavaScript
Choose Transformation-Layer (dbt MetricFlow) if:
- ✅ You're already using dbt for transformations
- ✅ You need multi-warehouse support (Snowflake + BigQuery)
- ✅ Git-based workflows are essential
- ✅ You want vendor portability (avoid lock-in)
- ✅ Your team values code review and CI/CD for metrics
Choose OLAP-Acceleration (Cube.dev) if:
- ✅ Sub-second query performance is critical
- ✅ You have high-concurrency dashboards (100+ concurrent users)
- ✅ Warehouse costs are a major concern
- ✅ You're willing to manage additional infrastructure
- ✅ Stale data (minutes/hours lag) is acceptable for most queries
Hybrid Strategies
Many teams use combinations:
Example 1: dbt + Snowflake Semantic Views
- Use dbt for transformations and data quality tests
- Define metrics in Snowflake Semantic Views (for zero-latency governance)
- Export dbt metric definitions → Snowflake (via CI/CD)
Example 2: dbt + Cube
- dbt builds fact/dimension tables
- Cube pre-aggregates for dashboards
- dbt MetricFlow for ad-hoc SQL queries
Example 3: Databricks + Looker
- Store data in Delta Lake
- Define metrics in Databricks Unity Catalog
- Query via Looker (using Databricks SQL connector)
The 2026 Outlook
The semantic layer space is rapidly evolving. Trends to watch:
- Standardization: Open Semantic Layer Interoperability (OSI) initiative aims to make metrics portable across tools
- AI integration: LLMs querying semantic layers in natural language ("What was Q3 revenue in EMEA?")
- Real-time convergence: Streaming semantic layers for operational analytics
- Cost pressure: More teams adopting pre-aggregation to control warehouse costs
The bottom line: The "semantic layer problem" isn't solved by one architecture. It's solved by choosing the architecture that aligns with your data platform, team skills, and business requirements.
The industry has built three viable paths. Your job is to pick the one that fits your journey.

