How to Enable Deep Analysis of Unstructured Data with OLAP-Style Queries

Traditional OLAP systems excel at structured queries but struggle when faced with PDFs, transcripts, markdown documents, and other unstructured formats. The gap between relational operations and semantic processing creates operational bottlenecks that prevent organizations from extracting value from their most critical data assets.

This guide shows how to enable OLAP-style analytical capabilities for unstructured data using inference-first architectures and semantic operators that treat LLM operations as first-class DataFrame primitives.

The Infrastructure Gap in Unstructured Data Analytics

Traditional data platforms were architected for SQL queries, batch ETL processes, and structured schemas. When teams attempt to process unstructured data through these systems, they encounter fundamental limitations:

Impedance mismatch between systems: Moving data between custom LLM scripts, warehouses, and inference infrastructure creates duplication and chaos. Each handoff introduces latency and potential failure points.

Brittle integration patterns: Organizations resort to UDFs, hacky microservices, and fragile glue code that query optimizers cannot inspect or optimize. These implementations lock in decisions at development time rather than adapting execution strategies dynamically.

Lack of semantic understanding: Standard DataFrame operations work on exact matches and numeric calculations. They cannot evaluate meaning, determine similarity, or extract structured information from natural language.

The solution requires purpose-built infrastructure that brings semantic processing natively into the analytical layer, enabling deterministic workflows on non-deterministic models.

Semantic Operators as DataFrame Primitives

Fenic extends the DataFrame abstraction with semantic operators that understand meaning rather than just values. These operators function as first-class primitives within the query engine, enabling the same composability and optimization that made traditional DataFrames indispensable.

Core Semantic Operations

Structured extraction with semantic.extract: Transform unstructured text into typed data using Pydantic schemas, eliminating brittle prompt engineering:

python
from pydantic import BaseModel
from typing import List, Literal

class Issue(BaseModel):
    category: Literal["bug", "feature_request", "question"]
    severity: Literal["low", "medium", "high", "critical"]
    description: str

class Ticket(BaseModel):
    customer_tier: Literal["free", "pro", "enterprise"]
    region: Literal["us", "eu", "apac"]
    issues: List[Issue]

tickets = (
    df
    .with_column("extracted", fc.semantic.extract(fc.col("raw_ticket"), Ticket))
    .unnest("extracted")
    .filter(fc.col("region") == "apac")
    .explode("issues")
)

bugs = tickets.filter(fc.col("issues.category") == "bug")

The schema provides type-safe extraction where you define your desired output structure once and get validated results consistently.

Natural language filtering with semantic.predicate: Apply content-based filters that evaluate meaning rather than exact string matches:

python
applicants = df.filter(
    (fc.col("yoe") > 5) &
    fc.semantic.predicate(
        "Has MCP Protocol experience? Resume: {{resume}}",
        resume=fc.col("resume"),
    )
)

This combines traditional boolean logic with semantic understanding. The query engine can optimize both together, potentially filtering on cheap boolean conditions before invoking expensive LLM predicates.

Semantic joins for meaning-based matching: Join DataFrames based on semantic similarity rather than exact values:

python
prompt = """
Is this candidate a good fit for the job?
Candidate Background: {{left_on}}
Job Requirements: {{right_on}}
Use the following criteria to make your decision:
- Technical skills alignment
- Experience level appropriateness
- Domain knowledge overlap
"""

joined = (
    applicants.semantic.join(
        other=jobs,
        predicate=prompt,
        left_on=fc.col("resume"),
        right_on=fc.col("job_description"),
    )
    .order_by("application_date")
    .limit(5)
)

Unlike fuzzy string matching that measures character similarity, semantic joins understand domain-specific criteria and make contextual decisions.

Inference-First Architecture for OLAP Workloads

The architectural difference between retrofitted and purpose-built systems determines whether OLAP-style operations on unstructured data succeed at scale.

Query Optimization for Semantic Operations

Traditional platforms treat LLM calls as external black-box UDFs that query optimizers cannot inspect or optimize. Inference-first architectures embed LLM operations directly into the query engine as first-class citizens.

When the query optimizer sees semantic.extract() or semantic.join(), it understands this is an inference operation with specific characteristics: high latency, token costs, batching benefits, and caching opportunities.

The optimizer can:

Reorder operations to minimize data processed by expensive inference
Batch requests across rows to amortize fixed costs
Cache aggressively since deterministic operations with same inputs produce same outputs
Parallelize intelligently across multiple providers or models
Estimate costs accurately before execution

Research demonstrates that properly optimized semantic operators deliver substantial performance improvements over naive implementations through intelligent batching, model cascading, and proxy scoring techniques. Benchmarks show speedups reaching several hundred times faster while maintaining statistical accuracy guarantees.

Statistical Accuracy Guarantees

Production OLAP systems require confidence in result quality. Each semantic operator specifies behavior through a reference algorithm with configurable precision and recall targets plus probabilistic bounds.

This formal foundation enables:

Validation that optimized implementations maintain accuracy compared to known-good baselines
Transparent tradeoffs between speed, cost, and quality
Auditable outputs critical for regulatory compliance

Organizations deploying semantic processing to production require these assurances that results meet quality standards, addressing a major barrier to adoption.

AI-Native Data Types for Structured Processing

OLAP-style analysis requires specialized operations for different content formats. Fenic provides first-class support for AI-native data types beyond standard strings and numerics.

MarkdownType for Document Structure

Process markdown with structural awareness rather than treating it as plain text:

python
df = (
    df
    .with_column("raw_blog", fc.col("blog").cast(fc.MarkdownType))
    .with_column(
        "chunks",
        fc.markdown.extract_header_chunks(fc.col("raw_blog"), header_level=2)
    )
    .with_column("title", fc.json.jq(fc.markdown.to_json(fc.col("raw_blog")), ".title"))
    .explode("chunks")
    .with_column(
        "embeddings",
        fc.semantic.embed(fc.col("chunks.content"))
    )
)

The markdown.extract_header_chunks function preserves document structure for semantically meaningful chunks instead of naive character-count splitting. This maintains context boundaries and avoids splits mid-sentence.

TranscriptType for Speaker-Aware Analysis

Handle transcripts with native understanding of speakers and timestamps:

python
from pydantic import BaseModel, Field

class SegmentSchema(BaseModel):
    speaker: str = Field(description="Who is talking in this segment")
    start_time: float = Field(description="Start time (seconds)")
    end_time: float = Field(description="End time (seconds)")
    key_points: list[str] = Field(description="Bullet points for this segment")

processed = (
    df.select(
        "*",
        fc.text.recursive_token_chunk(
            fc.col("transcript"),
            chunk_size=1200,
            chunk_overlap_percentage=0
        ).alias("chunks"),
    )
    .explode("chunks")
    .select(
        fc.col("chunks").alias("chunk"),
        fc.semantic.extract(fc.col("chunk"), SegmentSchema).alias("segment"),
    )
)

Speaker identity and timestamps persist through transformations, enabling speaker-aware aggregations and conversation flow analysis without manual parsing.

JSONType for Nested Data Manipulation

Process nested JSON with JQ expressions for elegant data extraction:

python
.with_column("author", fc.json.jq(fc.col("metadata"), ".author.name"))
.with_column("tags", fc.json.jq(fc.col("metadata"), ".tags[]"))

This eliminates verbose dictionary navigation code and handles missing keys gracefully.

Building OLAP-Style Analytical Pipelines

Production analytical workloads require composing multiple operations into reliable, optimized pipelines. Here's a complete example demonstrating advanced patterns:

python
from pathlib import Path
import fenic as fc

class EpisodeSummary(BaseModel):
    title: str
    guests: list[str]
    main_topics: list[str]
    actionable_insights: list[str]

# Configure session with model aliases
config = fc.SessionConfig(
    app_name="content_analysis",
    semantic=fc.SemanticConfig(
        language_models={
            "cheap": fc.OpenAILanguageModel("gpt-4o-mini", rpm=500, tpm=200_000),
            "fast": fc.GoogleVertexLanguageModel("gemini-2.0-flash-lite", rpm=300, tpm=1_000_000),
            "powerful": fc.AnthropicLanguageModel("claude-opus-4-0", rpm=100, input_tpm=100_000, output_tpm=100_000),
        },
        default_language_model="fast",
    ),
)
session = fc.Session.get_or_create(config)

# Load and process data
df = session.create_dataframe({"content": [raw_text], "metadata": [meta_json]})
processed = (
    df.select(
        "*",
        fc.semantic.extract(fc.col("metadata"), EpisodeSummary, model_alias="cheap").alias("episode"),
        fc.text.recursive_token_chunk(fc.col("content"), chunk_size=1200).alias("chunks"),
    )
    .explode("chunks")
    .select(
        fc.col("chunks").alias("chunk"),
        fc.semantic.extract(fc.col("chunk"), SegmentSchema, model_alias="fast").alias("segment"),
    )
)

# Aggregate with semantic operations
final = (
    processed
    .select(
        "*",
        fc.semantic.map(
            "Summarize this segment in 2 sentences:\n{{chunk}}",
            chunk=fc.col("chunk"),
            model_alias="cheap"
        ).alias("summary")
    )
    .group_by(fc.col("segment.speaker"))
    .agg(
        fc.semantic.reduce(
            "Combine these summaries into one clear paragraph",
            fc.col("summary"),
            model_alias="fast"
        ).alias("speaker_summary")
    )
)

final.write.parquet("analysis_results.parquet")

This pipeline demonstrates six composability patterns:

Schema-driven extraction: Pydantic models define output structure
Intelligent chunking: Semantic-aware text splitting respects context
Explode for row multiplication: Transform single documents into multiple segment rows
Nested structure access: Reference nested fields naturally
Semantic aggregation: Group data and apply LLM operations across groups
Mixed operations: Combine semantic and traditional DataFrame operations

Performance Optimization Strategies

OLAP workloads processing millions to billions of rows require aggressive optimization to remain cost-effective.

Model Cascading for Cost Reduction

Use lightweight proxy scorers to filter data before expensive model invocations:

python
# Fast fuzzy scoring for blocking
candidates = (
    left_df.join(right_df)  # Cross join
    .with_column(
        "fuzzy_score",
        fc.text.compute_fuzzy_ratio(
            fc.col("company_name"),
            fc.col("business_name"),
            method="jaro_winkler"
        )
    )
    .filter(fc.col("fuzzy_score") > 80)
)

# Then expensive semantic matching on candidates
final = candidates.semantic.join(
    predicate="Are these the same company? Left: {{left_name}}, Right: {{right_name}}",
    left_on=fc.col("company_description"),
    right_on=fc.col("business_description")
)

This hybrid approach cuts costs by orders of magnitude compared to semantic joins on full cross-products.

Multi-Provider Model Selection

Configure different models for different task complexities:

python
semantic=fc.SemanticConfig(
    language_models={
        "nano": fc.OpenAILanguageModel("gpt-4o-nano"),      # Simple classification
        "mini": fc.OpenAILanguageModel("gpt-4o-mini"),      # Extraction
        "full": fc.AnthropicLanguageModel("claude-opus-4")  # Complex reasoning
    },
)

# Route operations to appropriate models
df.semantic.classify(col, classes, model_alias="nano")
df.semantic.extract(col, schema, model_alias="mini")
df.semantic.map(complex_instruction, model_alias="full")

Cost differences between models often reach 10-100x. Strategic model selection reduces costs by 80% while maintaining quality for appropriate tasks.

Intelligent Caching

Cache expensive operations to avoid redundant processing:

python
# Deduplicate before embedding
unique_content = df.select("content").distinct()
embedded = unique_content.with_column(
    "embedding",
    fc.semantic.embed("content")
).cache()

# Join embeddings back to full dataset
df_with_embeddings = df.join(embedded, on="content")

This pattern ensures you embed distinct content once rather than processing duplicate text repeatedly.

Production Deployment Considerations

Moving from prototype to production requires addressing reliability, observability, and cost management.

Automatic Batching and Rate Limiting

The framework automatically respects provider rate limits with configured rpm and tpm:

python
config = fc.SessionConfig(
    semantic=fc.SemanticConfig(
        language_models={
            "nano": fc.OpenAILanguageModel(
                "gpt-4o-nano",
                rpm=500,
                tpm=200_000
            ),
        },
    ),
)

The engine tracks token usage in real-time, self-throttles when approaching limits, and maximizes throughput with async I/O and concurrent request batching. Built-in retry logic handles transient failures automatically.

Row-Level Lineage for Debugging

Every column and row has traceable origins, even from model output:

python
result = df.filter(...).semantic.extract(...).collect()
lineage = result.lineage()

Lineage tracks individual row processing history through every transformation. When specific extractions fail or produce unexpected results, you can trace back through the pipeline to identify where issues originated.

Comprehensive Metrics and Cost Tracking

Built-in observability provides first-class visibility into LLM operations:

python
result = query.collect()
metrics = result.metrics()

print(f"Total tokens: {metrics.lm_metrics.total_tokens}")
print(f"Total cost: ${metrics.lm_metrics.total_cost}")
print(f"Execution time: {metrics.execution_time}s")

# Operator-level metrics
for op_metric in metrics.operator_metrics:
    if op_metric.cost > 10.0:
        print(f"Expensive operator: {op_metric.name}, Cost: ${op_metric.cost}")

Operator-level metrics show where time and money are spent, enabling targeted optimization.

Lakehouse-Native Architecture

Pure compute with no proprietary storage layer enables seamless integration with existing infrastructure:

python
# Read from existing lakehouse
df = session.read.parquet("s3://data-lake/raw/*.parquet")

# Process with semantic operators
processed = df.semantic.extract(...).filter(...)

# Write back to lakehouse
processed.write.parquet("s3://data-lake/processed/")

Full compatibility with Parquet, Iceberg, Delta Lake, and Lance means processed data works with Spark, Polars, DuckDB, and pandas without data movement.

Real-World Application: RudderStack Case Study

RudderStack implemented OLAP-style analysis of unstructured support tickets, sales calls, and product documentation using Typedef's semantic infrastructure.

The challenge: Evidence existed across PRDs, strategy docs, and tickets scattered across multiple systems. Unstructured inputs dominated at roughly 90%, overwhelming generic workflows. Mapping to a wide, evolving taxonomy was slow and error-prone.

The implementation:

Ingest and normalize: Pull support ticket threads, sales call transcripts, usage data from warehouse; product docs and PRDs from Notion
Semantic context model: Infer and enrich product taxonomy from docs; map tickets, issues, PRDs, and strategy to taxonomy; create semantic links with citations
Analytical queries: Expose retrieval via MCP tools; classify new items to taxonomy; surface related duplicates and strategy references; propose prioritize/monitor/decline decisions with rationale
Persistence: Store mappings, rationales, and decisions in warehouse for analytics

Results achieved:

95% reduction in PM time per triage
90% first-pass category acceptance
Citations enabled prospect and community signals surfaced directly
Faster follow-ups with reach, impacted accounts, and volumes visible

The warehouse-native approach kept the taxonomy current with doc changes. Semantic links provided explainability, higher accuracy, and faster reviews.

Implementation Roadmap

Organizations should adopt OLAP-style semantic analysis incrementally:

Phase 1: Foundation (Week 1-2)

Install Fenic and configure one LLM provider
Identify a small, representative dataset (100-1000 rows)
Implement basic extraction or classification pipeline
Validate results and measure costs

Phase 2: Optimization (Week 3-4)

Add intelligent caching for expensive operations
Implement model cascading with cheap proxy scorers
Configure rate limits and batching parameters
Add metrics collection and cost tracking

Phase 3: Production (Week 5-8)

Scale to full dataset with monitoring
Implement row-level lineage for debugging
Set up automated pipeline execution
Create dashboards for key metrics

Phase 4: Advanced Patterns (Ongoing)

Add semantic joins for complex matching scenarios
Implement semantic aggregations for summarization
Deploy multi-provider strategies for cost optimization
Integrate with existing BI and analytical tools

Key Takeaways

OLAP-style analysis of unstructured data requires purpose-built infrastructure that treats semantic operations as first-class primitives:

Semantic operators extend familiar DataFrame operations with AI-native capabilities like extraction, classification, and meaning-based joins. This provides the composability and reliability that made traditional DataFrames indispensable.

Inference-first architectures optimize LLM operations the same way databases optimize CPU and memory operations. Query engines can batch requests, cache results, and estimate costs when they understand semantic operations as first-class citizens.

Statistical accuracy guarantees and row-level lineage provide the reliability and debuggability required for production systems. Organizations can deploy semantic processing with confidence in result quality.

Strategic model selection and intelligent caching reduce costs by 80% or more while maintaining quality. Use cheap models for simple tasks, expensive models only for complex reasoning, and eliminate redundant processing through deduplication.

Lakehouse-native integration ensures semantic processing works with existing infrastructure. Read from and write to standard formats without data movement or proprietary storage layers.

The convergence of OLAP databases and semantic processing represents a fundamental shift in analytical infrastructure. Organizations adopting inference-first architectures gain the ability to run sophisticated analytical queries across both structured and unstructured data using familiar DataFrame abstractions enhanced with semantic intelligence.

Ready to implement OLAP-style queries on your unstructured data? Explore Fenic's documentation, join the community on Discord, or review implementation examples to get started. How to Enable Deep Analysis ... efcf08068bdbdcf98205824bd.md External Displaying How to Enable Deep Analysis of Unstructured Data w 2aadf41efcf08068bdbdcf98205824bd.md.