Traditional OLAP systems excel at structured queries but struggle when faced with PDFs, transcripts, markdown documents, and other unstructured formats. The gap between relational operations and semantic processing creates operational bottlenecks that prevent organizations from extracting value from their most critical data assets.
This guide shows how to enable OLAP-style analytical capabilities for unstructured data using inference-first architectures and semantic operators that treat LLM operations as first-class DataFrame primitives.
The Infrastructure Gap in Unstructured Data Analytics
Traditional data platforms were architected for SQL queries, batch ETL processes, and structured schemas. When teams attempt to process unstructured data through these systems, they encounter fundamental limitations:
Impedance mismatch between systems: Moving data between custom LLM scripts, warehouses, and inference infrastructure creates duplication and chaos. Each handoff introduces latency and potential failure points.
Brittle integration patterns: Organizations resort to UDFs, hacky microservices, and fragile glue code that query optimizers cannot inspect or optimize. These implementations lock in decisions at development time rather than adapting execution strategies dynamically.
Lack of semantic understanding: Standard DataFrame operations work on exact matches and numeric calculations. They cannot evaluate meaning, determine similarity, or extract structured information from natural language.
The solution requires purpose-built infrastructure that brings semantic processing natively into the analytical layer, enabling deterministic workflows on non-deterministic models.
Semantic Operators as DataFrame Primitives
Fenic extends the DataFrame abstraction with semantic operators that understand meaning rather than just values. These operators function as first-class primitives within the query engine, enabling the same composability and optimization that made traditional DataFrames indispensable.
Core Semantic Operations
Structured extraction with semantic.extract: Transform unstructured text into typed data using Pydantic schemas, eliminating brittle prompt engineering:
pythonfrom pydantic import BaseModel from typing import List, Literal class Issue(BaseModel): category: Literal["bug", "feature_request", "question"] severity: Literal["low", "medium", "high", "critical"] description: str class Ticket(BaseModel): customer_tier: Literal["free", "pro", "enterprise"] region: Literal["us", "eu", "apac"] issues: List[Issue] tickets = ( df .with_column("extracted", fc.semantic.extract(fc.col("raw_ticket"), Ticket)) .unnest("extracted") .filter(fc.col("region") == "apac") .explode("issues") ) bugs = tickets.filter(fc.col("issues.category") == "bug")
The schema provides type-safe extraction where you define your desired output structure once and get validated results consistently.
Natural language filtering with semantic.predicate: Apply content-based filters that evaluate meaning rather than exact string matches:
pythonapplicants = df.filter( (fc.col("yoe") > 5) & fc.semantic.predicate( "Has MCP Protocol experience? Resume: {{resume}}", resume=fc.col("resume"), ) )
This combines traditional boolean logic with semantic understanding. The query engine can optimize both together, potentially filtering on cheap boolean conditions before invoking expensive LLM predicates.
Semantic joins for meaning-based matching: Join DataFrames based on semantic similarity rather than exact values:
pythonprompt = """ Is this candidate a good fit for the job? Candidate Background: {{left_on}} Job Requirements: {{right_on}} Use the following criteria to make your decision: - Technical skills alignment - Experience level appropriateness - Domain knowledge overlap """ joined = ( applicants.semantic.join( other=jobs, predicate=prompt, left_on=fc.col("resume"), right_on=fc.col("job_description"), ) .order_by("application_date") .limit(5) )
Unlike fuzzy string matching that measures character similarity, semantic joins understand domain-specific criteria and make contextual decisions.
Inference-First Architecture for OLAP Workloads
The architectural difference between retrofitted and purpose-built systems determines whether OLAP-style operations on unstructured data succeed at scale.
Query Optimization for Semantic Operations
Traditional platforms treat LLM calls as external black-box UDFs that query optimizers cannot inspect or optimize. Inference-first architectures embed LLM operations directly into the query engine as first-class citizens.
When the query optimizer sees semantic.extract() or semantic.join(), it understands this is an inference operation with specific characteristics: high latency, token costs, batching benefits, and caching opportunities.
The optimizer can:
- Reorder operations to minimize data processed by expensive inference
- Batch requests across rows to amortize fixed costs
- Cache aggressively since deterministic operations with same inputs produce same outputs
- Parallelize intelligently across multiple providers or models
- Estimate costs accurately before execution
Research demonstrates that properly optimized semantic operators deliver substantial performance improvements over naive implementations through intelligent batching, model cascading, and proxy scoring techniques. Benchmarks show speedups reaching several hundred times faster while maintaining statistical accuracy guarantees.
Statistical Accuracy Guarantees
Production OLAP systems require confidence in result quality. Each semantic operator specifies behavior through a reference algorithm with configurable precision and recall targets plus probabilistic bounds.
This formal foundation enables:
- Validation that optimized implementations maintain accuracy compared to known-good baselines
- Transparent tradeoffs between speed, cost, and quality
- Auditable outputs critical for regulatory compliance
Organizations deploying semantic processing to production require these assurances that results meet quality standards, addressing a major barrier to adoption.
AI-Native Data Types for Structured Processing
OLAP-style analysis requires specialized operations for different content formats. Fenic provides first-class support for AI-native data types beyond standard strings and numerics.
MarkdownType for Document Structure
Process markdown with structural awareness rather than treating it as plain text:
pythondf = ( df .with_column("raw_blog", fc.col("blog").cast(fc.MarkdownType)) .with_column( "chunks", fc.markdown.extract_header_chunks(fc.col("raw_blog"), header_level=2) ) .with_column("title", fc.json.jq(fc.markdown.to_json(fc.col("raw_blog")), ".title")) .explode("chunks") .with_column( "embeddings", fc.semantic.embed(fc.col("chunks.content")) ) )
The markdown.extract_header_chunks function preserves document structure for semantically meaningful chunks instead of naive character-count splitting. This maintains context boundaries and avoids splits mid-sentence.
TranscriptType for Speaker-Aware Analysis
Handle transcripts with native understanding of speakers and timestamps:
pythonfrom pydantic import BaseModel, Field class SegmentSchema(BaseModel): speaker: str = Field(description="Who is talking in this segment") start_time: float = Field(description="Start time (seconds)") end_time: float = Field(description="End time (seconds)") key_points: list[str] = Field(description="Bullet points for this segment") processed = ( df.select( "*", fc.text.recursive_token_chunk( fc.col("transcript"), chunk_size=1200, chunk_overlap_percentage=0 ).alias("chunks"), ) .explode("chunks") .select( fc.col("chunks").alias("chunk"), fc.semantic.extract(fc.col("chunk"), SegmentSchema).alias("segment"), ) )
Speaker identity and timestamps persist through transformations, enabling speaker-aware aggregations and conversation flow analysis without manual parsing.
JSONType for Nested Data Manipulation
Process nested JSON with JQ expressions for elegant data extraction:
python.with_column("author", fc.json.jq(fc.col("metadata"), ".author.name")) .with_column("tags", fc.json.jq(fc.col("metadata"), ".tags[]"))
This eliminates verbose dictionary navigation code and handles missing keys gracefully.
Building OLAP-Style Analytical Pipelines
Production analytical workloads require composing multiple operations into reliable, optimized pipelines. Here's a complete example demonstrating advanced patterns:
pythonfrom pathlib import Path import fenic as fc class EpisodeSummary(BaseModel): title: str guests: list[str] main_topics: list[str] actionable_insights: list[str] # Configure session with model aliases config = fc.SessionConfig( app_name="content_analysis", semantic=fc.SemanticConfig( language_models={ "cheap": fc.OpenAILanguageModel("gpt-4o-mini", rpm=500, tpm=200_000), "fast": fc.GoogleVertexLanguageModel("gemini-2.0-flash-lite", rpm=300, tpm=1_000_000), "powerful": fc.AnthropicLanguageModel("claude-opus-4-0", rpm=100, input_tpm=100_000, output_tpm=100_000), }, default_language_model="fast", ), ) session = fc.Session.get_or_create(config) # Load and process data df = session.create_dataframe({"content": [raw_text], "metadata": [meta_json]}) processed = ( df.select( "*", fc.semantic.extract(fc.col("metadata"), EpisodeSummary, model_alias="cheap").alias("episode"), fc.text.recursive_token_chunk(fc.col("content"), chunk_size=1200).alias("chunks"), ) .explode("chunks") .select( fc.col("chunks").alias("chunk"), fc.semantic.extract(fc.col("chunk"), SegmentSchema, model_alias="fast").alias("segment"), ) ) # Aggregate with semantic operations final = ( processed .select( "*", fc.semantic.map( "Summarize this segment in 2 sentences:\n{{chunk}}", chunk=fc.col("chunk"), model_alias="cheap" ).alias("summary") ) .group_by(fc.col("segment.speaker")) .agg( fc.semantic.reduce( "Combine these summaries into one clear paragraph", fc.col("summary"), model_alias="fast" ).alias("speaker_summary") ) ) final.write.parquet("analysis_results.parquet")
This pipeline demonstrates six composability patterns:
- Schema-driven extraction: Pydantic models define output structure
- Intelligent chunking: Semantic-aware text splitting respects context
- Explode for row multiplication: Transform single documents into multiple segment rows
- Nested structure access: Reference nested fields naturally
- Semantic aggregation: Group data and apply LLM operations across groups
- Mixed operations: Combine semantic and traditional DataFrame operations
Performance Optimization Strategies
OLAP workloads processing millions to billions of rows require aggressive optimization to remain cost-effective.
Model Cascading for Cost Reduction
Use lightweight proxy scorers to filter data before expensive model invocations:
python# Fast fuzzy scoring for blocking candidates = ( left_df.join(right_df) # Cross join .with_column( "fuzzy_score", fc.text.compute_fuzzy_ratio( fc.col("company_name"), fc.col("business_name"), method="jaro_winkler" ) ) .filter(fc.col("fuzzy_score") > 80) ) # Then expensive semantic matching on candidates final = candidates.semantic.join( predicate="Are these the same company? Left: {{left_name}}, Right: {{right_name}}", left_on=fc.col("company_description"), right_on=fc.col("business_description") )
This hybrid approach cuts costs by orders of magnitude compared to semantic joins on full cross-products.
Multi-Provider Model Selection
Configure different models for different task complexities:
pythonsemantic=fc.SemanticConfig( language_models={ "nano": fc.OpenAILanguageModel("gpt-4o-nano"), # Simple classification "mini": fc.OpenAILanguageModel("gpt-4o-mini"), # Extraction "full": fc.AnthropicLanguageModel("claude-opus-4") # Complex reasoning }, ) # Route operations to appropriate models df.semantic.classify(col, classes, model_alias="nano") df.semantic.extract(col, schema, model_alias="mini") df.semantic.map(complex_instruction, model_alias="full")
Cost differences between models often reach 10-100x. Strategic model selection reduces costs by 80% while maintaining quality for appropriate tasks.
Intelligent Caching
Cache expensive operations to avoid redundant processing:
python# Deduplicate before embedding unique_content = df.select("content").distinct() embedded = unique_content.with_column( "embedding", fc.semantic.embed("content") ).cache() # Join embeddings back to full dataset df_with_embeddings = df.join(embedded, on="content")
This pattern ensures you embed distinct content once rather than processing duplicate text repeatedly.
Production Deployment Considerations
Moving from prototype to production requires addressing reliability, observability, and cost management.
Automatic Batching and Rate Limiting
The framework automatically respects provider rate limits with configured rpm and tpm:
pythonconfig = fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "nano": fc.OpenAILanguageModel( "gpt-4o-nano", rpm=500, tpm=200_000 ), }, ), )
The engine tracks token usage in real-time, self-throttles when approaching limits, and maximizes throughput with async I/O and concurrent request batching. Built-in retry logic handles transient failures automatically.
Row-Level Lineage for Debugging
Every column and row has traceable origins, even from model output:
pythonresult = df.filter(...).semantic.extract(...).collect() lineage = result.lineage()
Lineage tracks individual row processing history through every transformation. When specific extractions fail or produce unexpected results, you can trace back through the pipeline to identify where issues originated.
Comprehensive Metrics and Cost Tracking
Built-in observability provides first-class visibility into LLM operations:
pythonresult = query.collect() metrics = result.metrics() print(f"Total tokens: {metrics.lm_metrics.total_tokens}") print(f"Total cost: ${metrics.lm_metrics.total_cost}") print(f"Execution time: {metrics.execution_time}s") # Operator-level metrics for op_metric in metrics.operator_metrics: if op_metric.cost > 10.0: print(f"Expensive operator: {op_metric.name}, Cost: ${op_metric.cost}")
Operator-level metrics show where time and money are spent, enabling targeted optimization.
Lakehouse-Native Architecture
Pure compute with no proprietary storage layer enables seamless integration with existing infrastructure:
python# Read from existing lakehouse df = session.read.parquet("s3://data-lake/raw/*.parquet") # Process with semantic operators processed = df.semantic.extract(...).filter(...) # Write back to lakehouse processed.write.parquet("s3://data-lake/processed/")
Full compatibility with Parquet, Iceberg, Delta Lake, and Lance means processed data works with Spark, Polars, DuckDB, and pandas without data movement.
Real-World Application: RudderStack Case Study
RudderStack implemented OLAP-style analysis of unstructured support tickets, sales calls, and product documentation using Typedef's semantic infrastructure.
The challenge: Evidence existed across PRDs, strategy docs, and tickets scattered across multiple systems. Unstructured inputs dominated at roughly 90%, overwhelming generic workflows. Mapping to a wide, evolving taxonomy was slow and error-prone.
The implementation:
- Ingest and normalize: Pull support ticket threads, sales call transcripts, usage data from warehouse; product docs and PRDs from Notion
- Semantic context model: Infer and enrich product taxonomy from docs; map tickets, issues, PRDs, and strategy to taxonomy; create semantic links with citations
- Analytical queries: Expose retrieval via MCP tools; classify new items to taxonomy; surface related duplicates and strategy references; propose prioritize/monitor/decline decisions with rationale
- Persistence: Store mappings, rationales, and decisions in warehouse for analytics
Results achieved:
- 95% reduction in PM time per triage
- 90% first-pass category acceptance
- Citations enabled prospect and community signals surfaced directly
- Faster follow-ups with reach, impacted accounts, and volumes visible
The warehouse-native approach kept the taxonomy current with doc changes. Semantic links provided explainability, higher accuracy, and faster reviews.
Implementation Roadmap
Organizations should adopt OLAP-style semantic analysis incrementally:
Phase 1: Foundation (Week 1-2)
- Install Fenic and configure one LLM provider
- Identify a small, representative dataset (100-1000 rows)
- Implement basic extraction or classification pipeline
- Validate results and measure costs
Phase 2: Optimization (Week 3-4)
- Add intelligent caching for expensive operations
- Implement model cascading with cheap proxy scorers
- Configure rate limits and batching parameters
- Add metrics collection and cost tracking
Phase 3: Production (Week 5-8)
- Scale to full dataset with monitoring
- Implement row-level lineage for debugging
- Set up automated pipeline execution
- Create dashboards for key metrics
Phase 4: Advanced Patterns (Ongoing)
- Add semantic joins for complex matching scenarios
- Implement semantic aggregations for summarization
- Deploy multi-provider strategies for cost optimization
- Integrate with existing BI and analytical tools
Key Takeaways
OLAP-style analysis of unstructured data requires purpose-built infrastructure that treats semantic operations as first-class primitives:
Semantic operators extend familiar DataFrame operations with AI-native capabilities like extraction, classification, and meaning-based joins. This provides the composability and reliability that made traditional DataFrames indispensable.
Inference-first architectures optimize LLM operations the same way databases optimize CPU and memory operations. Query engines can batch requests, cache results, and estimate costs when they understand semantic operations as first-class citizens.
Statistical accuracy guarantees and row-level lineage provide the reliability and debuggability required for production systems. Organizations can deploy semantic processing with confidence in result quality.
Strategic model selection and intelligent caching reduce costs by 80% or more while maintaining quality. Use cheap models for simple tasks, expensive models only for complex reasoning, and eliminate redundant processing through deduplication.
Lakehouse-native integration ensures semantic processing works with existing infrastructure. Read from and write to standard formats without data movement or proprietary storage layers.
The convergence of OLAP databases and semantic processing represents a fundamental shift in analytical infrastructure. Organizations adopting inference-first architectures gain the ability to run sophisticated analytical queries across both structured and unstructured data using familiar DataFrame abstractions enhanced with semantic intelligence.
Ready to implement OLAP-style queries on your unstructured data? Explore Fenic's documentation, join the community on Discord, or review implementation examples to get started. How to Enable Deep Analysis ... efcf08068bdbdcf98205824bd.md External Displaying How to Enable Deep Analysis of Unstructured Data w 2aadf41efcf08068bdbdcf98205824bd.md.

