<< goback()

How to Create Composable Semantic Operators for Data Transformation

Typedef Team

How to Create Composable Semantic Operators for Data Transformation

Semantic operators are transforming how developers build AI pipelines. Typedef's Fenic framework treats LLM inference as a first-class operation within a familiar DataFrame API, enabling production-ready AI workflows with the same composability and reliability that made pandas indispensable. Rather than cobbling together brittle glue code and external API calls, developers can now build deterministic pipelines on non-deterministic models using declarative operations that the query engine fully understands and optimizes.

This represents a fundamental shift in AI infrastructure. Legacy data platforms were not designed for inference-first workloads. Fenic is built from the ground up with semantic operators as first-class citizens, so its query engine can directly optimize inference operations. Fenic's query engine is designed from scratch for AI workloads, with inference operations embedded directly into query operators, allowing optimization of these operations just like CPU or memory operations. The result: cleaner code, better performance, and infrastructure that finally matches the demands of production AI.

Let's explore how to leverage composable semantic operators for data transformation, from basic concepts to advanced production patterns.

Understanding Composable Semantic Operators in Fenic

Semantic operators are DataFrame operations that understand meaning, not just values. Unlike traditional DataFrame operations that work on exact matches and numeric calculations, semantic operators use LLMs to transform, filter, join, and aggregate data based on semantic understanding.

Fenic provides nine semantic operators as first-class DataFrame primitives:

  • semantic.extract transforms unstructured text into structured data using Pydantic schemas. This eliminates brittle prompt engineering by defining schemas once and getting validated results every time. Perfect for extracting entities, attributes, and relationships from documents.
  • semantic.map applies natural language transformations to data, enabling text generation, rewriting, translation, and summarization with simple instructions like "Summarize {text} in 2 sentences."
  • semantic.classify categorizes text with few-shot examples, providing consistent category assignments across datasets without training custom models.
  • semantic.join joins DataFrames on meaning rather than exact values. Instead of fuzzy string matching, it uses natural language predicates to determine if rows should match—ideal for matching job descriptions with resumes or linking related documents.
  • semantic.predicate creates natural language filters for row selection, enabling queries like "Does this feedback mention UI problems?" without complex regex patterns.
  • semantic.with_cluster_labels clusters rows by semantic similarity using embeddings, automatically grouping related content without predefined categories.
  • semantic.reduce aggregates grouped data with LLM operations, enabling semantic summarization across groups rather than just counting or averaging.
  • semantic.analyze_sentiment provides built-in sentiment analysis without external services.
  • semantic.embed generates embeddings for text columns.

The power comes from composability—these operators chain naturally with standard DataFrame operations. The query engine understands both traditional and semantic operations, enabling unified optimization across your entire pipeline.

Setting Up Fenic for Semantic Operations

Installation requires Python 3.10, 3.11, or 3.12:

bash
pip install fenic

Configure at least one LLM provider with environment variables:

bash
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GOOGLE_API_KEY="your-google-api-key"
export COHERE_API_KEY="your-cohere-api-key"

Initialize a session with semantic configuration:

python
import fenic as fc

config = fc.SessionConfig(
    app_name="my_app",
    semantic=fc.SemanticConfig(
        language_models={
            "nano": fc.OpenAILanguageModel(
                "gpt-4.1-nano",
                rpm=500,
                tpm=200_000
            ),
            "flash": fc.GoogleVertexLanguageModel(
                "gemini-2.0-flash-lite",
                rpm=300,
                tpm=150_000
            ),
        },
        default_language_model="flash",
    ),
)

session = fc.Session.get_or_create(config)

This configuration defines model aliases that abstract provider-specific details. Using aliases like "nano" and "flash" makes it trivial to swap models or providers without changing pipeline code—critical for production systems that need to optimize cost and performance dynamically.

Rate limiting parameters (rpm and tpm) prevent throttling by your provider. Fenic automatically batches requests, implements retry logic, and self-throttles to stay within limits while maximizing throughput with async I/O and concurrent request batching.

Creating Type-Safe Structured Extraction with semantic.extract

The semantic.extract operator transforms unstructured text into structured data using Pydantic schemas. This provides type-safe extraction where you define your desired output structure once and get validated results consistently.

Consider support ticket processing:

python
from pydantic import BaseModel, Field
from typing import List, Literal

class Issue(BaseModel):
    category: Literal["bug", "feature_request", "question"]
    severity: Literal["low", "medium", "high", "critical"]
    description: str

class Ticket(BaseModel):
    customer_tier: Literal["free", "pro", "enterprise"]
    region: Literal["us", "eu", "apac"]
    issues: List[Issue]

tickets = (
    df
    .with_column("extracted", fc.semantic.extract("raw_ticket", Ticket))
    .unnest("extracted")
    .filter(fc.col("region") == "apac")
    .explode("issues")
)

bugs = tickets.filter(fc.col("issues").category == "bug")

The schema guides LLM extraction with clear structure and constraints. Pydantic's Field descriptions provide additional context:

python
class SegmentSchema(BaseModel):
    speaker: str = Field(description="Who is talking in this segment")
    start_time: float = Field(description="Start time (seconds)")
    end_time: float = Field(description="End time (seconds)")
    key_points: list[str] = Field(description="Bullet points for this segment")

After extraction, the data becomes structured and you can filter, aggregate, and transform using standard DataFrame operations. The unnest operation flattens nested Pydantic models into columns, while explode expands lists into separate rows—enabling seamless transitions between nested and flat representations.

This pattern eliminates the "extract-then-parse-JSON-then-validate" dance common in traditional LLM pipelines. Fenic handles schema validation, error handling, and retries automatically, with row-level lineage for debugging when extractions fail.

Implementing Semantic Filtering with Natural Language Predicates

Semantic predicates enable filtering with natural language conditions instead of complex boolean logic:

python
applicants = df.filter(
    (fc.col("yoe") > 5) &
    fc.semantic.predicate(
        "Has MCP Protocol experience? Resume: {{resume}}",
        resume=fc.col("resume"),
    )
)

This combines traditional boolean logic with semantic understanding. The first condition uses standard column comparison, while the semantic predicate evaluates natural language criteria against unstructured text. The query engine optimizes both together—potentially filtering on the cheap boolean condition first before invoking the expensive LLM predicate.

Predicates accept template variables using Jinja syntax (double curly braces). The Jinja templating is Rust-powered as of version 0.3.0, providing dynamic, data-aware prompts with loops, conditionals, and arrays:

python
fc.semantic.predicate(
    """
    Does this feedback mention {{ search_term }}?
    {% if priority == "high" %}
    Only return true if it's a critical issue.
    {% endif %}

    Feedback: {{ feedback_text }}
    """,
    search_term=fc.lit("UI problems"),
    priority=fc.col("priority"),
    feedback_text=fc.col("raw_feedback")
)

The template is evaluated per row, allowing row-specific logic in your prompts. This enables sophisticated filtering that adapts to each record's characteristics while maintaining the declarative DataFrame abstraction.

Building Semantic Joins for Meaning-Based Matching

Traditional joins match on exact values or use fuzzy string similarity. Semantic joins determine matches based on meaning:

python
prompt = """
Is this candidate a good fit for the job?
Candidate Background: {{left_on}}
Job Requirements: {{right_on}}

Use the following criteria to make your decision:
- Technical skills alignment
- Experience level appropriateness
- Domain knowledge overlap
"""

joined = (
    applicants.semantic.join(
        other=jobs,
        predicate=prompt,
        left_on=fc.col("resume"),
        right_on=fc.col("job_description"),
    )
    .order_by("application_date")
    .limit(5)
)

The predicate receives both left and right row data as context, enabling sophisticated matching logic that considers multiple factors. Unlike fuzzy string matching that measures character similarity, semantic joins understand domain-specific criteria and make nuanced decisions.

Fenic optimizes semantic joins by batching LLM calls across candidate pairs, caching decisions for repeated comparisons, and potentially using embeddings for initial filtering before applying the expensive LLM predicate to top candidates.

This pattern works exceptionally well for:

  • Matching documents to queries in RAG systems
  • Linking related records across databases
  • Finding similar but not identical content
  • Deduplication based on semantic similarity rather than string distance

Processing AI-Native Data Types with Specialized Operators

Fenic goes beyond standard data types with first-class support for AI-native formats: MarkdownType, TranscriptType, JSONType, HTMLType, and EmbeddingType. These aren't just metadata tags—they unlock specialized operations.

Working with MarkdownType

python
df = (
    df
    .with_column("raw_blog", fc.col("blog").cast(fc.MarkdownType))
    .with_column(
        "chunks",
        fc.markdown.extract_header_chunks("raw_blog", header_level=2)
    )
    .with_column("title", fc.json.jq("raw_blog", ".title"))
    .explode("chunks")
    .with_column(
        "embeddings",
        fc.semantic.embed(fc.col("chunks").content)
    )
)

The markdown.extract_header_chunks function leverages document structure (sections, paragraphs, headings) for semantically meaningful chunks instead of naive character-count splitting. This dramatically improves RAG quality by preserving context boundaries and avoiding splits mid-sentence.

Processing TranscriptType with Speaker Awareness

TranscriptType handles SRT, WebVTT, and generic transcript formats with native understanding of speakers and timestamps:

python
# Load transcript (SRT, WebVTT, or generic format)
transcript_text = Path("data/transcript.json").read_text()
df = fc.DataFrame({"transcript": [transcript_text]})

processed = (
    df.select(
        "*",
        fc.text.recursive_token_chunk("transcript", chunk_size=1200, chunk_overlap_percentage=0).alias("chunks"),
    )
    .explode("chunks")
    .select(
        fc.col("chunks").alias("chunk"),
        fc.semantic.extract("chunk", SegmentSchema, model_alias="mini").alias("segment"),
    )
)

Fenic preserves speaker identity and timestamps through transformations, enabling speaker-aware analysis without manual parsing. You can aggregate by speaker, analyze conversation flows, or extract speaker-specific insights.

Nested JSON Manipulation with JQ Expressions

JSONType supports JQ expressions for elegant nested data manipulation:

python
.with_column("author", fc.json.jq("metadata", ".author.name"))
.with_column("tags", fc.json.jq("metadata", ".tags[]"))

This eliminates verbose Python dictionary navigation code and handles missing keys gracefully.

Orchestrating Complete Data Transformation Pipelines

Real production pipelines combine multiple semantic operators with traditional DataFrame operations. Here's a complete podcast processing pipeline demonstrating advanced patterns:

python
from pathlib import Path
from pydantic import BaseModel, Field

class SegmentSchema(BaseModel):
    speaker: str = Field(description="Who is talking in this segment")
    start_time: float = Field(description="Start time (seconds)")
    end_time: float = Field(description="End time (seconds)")
    key_points: list[str] = Field(description="Bullet points for this segment")

class EpisodeSummary(BaseModel):
    title: str
    guests: list[str]
    main_topics: list[str]
    actionable_insights: list[str]

# Initialize session with model alias
config = fc.SessionConfig(
    app_name="podcast_analysis",
    semantic=fc.SemanticConfig(
        language_models={
            "mini": fc.OpenAILanguageModel(
                model_name="gpt-4o-mini",
                rpm=300,
                tpm=150_000
            )
        }
    ),
)
session = fc.Session.get_or_create(config)

# Load raw data
data_dir = Path("data")
transcript_text = (data_dir / "transcript.json").read_text()
meta_text = (data_dir / "meta.json").read_text()
df = fc.DataFrame({"meta": [meta_text], "transcript": [transcript_text]})

# Extract metadata and segment transcript
processed = (
    df.select(
        "*",
        fc.semantic.extract("meta", EpisodeSummary, model_alias="mini").alias("episode"),
        fc.text.recursive_token_chunk("transcript", chunk_size=1200, chunk_overlap_percentage=0).alias("chunks"),
    )
    .explode("chunks")
    .select(
        fc.col("chunks").alias("chunk"),
        fc.semantic.extract("chunk", SegmentSchema, model_alias="mini").alias("segment"),
    )
)

# Create abstracts per segment and aggregate by speaker
final = (
    processed
    .select(
        "*",
        fc.semantic.map(
            "Summarize this segment in 2 sentences:\n{{chunk}}",
            chunk=fc.col("chunk"),
            model_alias="mini"
        ).alias("segment_summary")
    )
    .group_by(fc.col("segment.speaker"))
    .agg(
        fc.semantic.reduce(
            "Combine these summaries into one clear paragraph",
            fc.col("segment_summary"),
            model_alias="mini"
        ).alias("speaker_summary")
    )
)

final.show(truncate=120)
final.write.parquet("podcast_summaries.parquet")
session.stop()

This pipeline demonstrates six key composability patterns:

Pattern 1: Schema-driven extraction - Pydantic models define output structure for consistent parsing.

Pattern 2: Intelligent chunking - Semantic-aware text splitting respects structure and context.

Pattern 3: Explode for row multiplication - Transform single transcript into multiple segment rows.

Pattern 4: Nested structure access - Reference nested fields like segment.speaker naturally.

Pattern 5: Semantic aggregation - Group data and apply LLM operations across groups.

Pattern 6: Mixed operations - Combine semantic and traditional DataFrame operations in one pipeline.

The pipeline reads raw text, extracts structure, transforms content, aggregates semantically, and writes results—all declaratively expressed with automatic optimization, batching, and error handling.

Configuring Advanced Model Profiles and Multi-Provider Strategies

Production systems need flexibility in model selection. Fenic supports model profiles that configure the same model with different settings:

python
config = fc.SessionConfig(
    semantic=fc.SemanticConfig(
        language_models={
            "claude": fc.AnthropicLanguageModel(
                model_name="claude-opus-4-0",
                rpm=100,
                input_tpm=100,
                output_tpm=100,
                profiles={
                    "thinking_disabled": fc.AnthropicLanguageModel.Profile(),
                    "fast": fc.AnthropicLanguageModel.Profile(
                        thinking_token_budget=1024
                    ),
                    "thorough": fc.AnthropicLanguageModel.Profile(
                        thinking_token_budget=4096
                    )
                },
                default_profile="fast"
            )
        },
        default_language_model="claude"
    )
)

# Use default "fast" profile
fc.semantic.map(
    "Construct a formal proof of the {{hypothesis}}.",
    hypothesis=fc.col("hypothesis"),
    model_alias="claude"
)

# Override to use "thorough" profile for complex reasoning
fc.semantic.map(
    "Construct a formal proof of the {{hypothesis}}.",
    hypothesis=fc.col("hypothesis"),
    model_alias=fc.ModelAlias(name="claude", profile="thorough")
)

For Anthropic models, profiles configure thinking token budgets (1024-8192 tokens). For OpenAI o-series models, profiles set reasoning effort (low, medium, high). This enables dynamic model selection based on task complexity without changing pipeline code.

Multi-provider strategies optimize cost and reliability:

python
semantic=fc.SemanticConfig(
    language_models={
        "cheap": fc.OpenAILanguageModel("gpt-4o-mini", rpm=500, tpm=200_000),
        "fast": fc.GoogleVertexLanguageModel("gemini-2.0-flash-lite", rpm=300),
        "powerful": fc.AnthropicLanguageModel("claude-opus-4-0", rpm=100),
    },
    default_language_model="cheap",
)

Use cheap models for simple classification, fast models for bulk processing, and powerful models for complex reasoning. The query optimizer can theoretically route operations to appropriate models based on complexity, though explicit selection gives you full control.

Leveraging Declarative Tools for Optimization and Debugging

Fenic's declarative API enables automatic optimization that imperative code cannot achieve. When operations are expressed declaratively, the query engine sees the entire pipeline and optimizes globally.

Query Optimization Patterns

Lazy evaluation defers execution until you call show(), collect(), or write operations. This allows the optimizer to:

  • Reorder operations to minimize expensive LLM calls
  • Push filters down to reduce data volume before inference
  • Batch LLM calls across rows efficiently
  • Cache repeated operations automatically
  • Estimate costs before execution

Traditional imperative LLM code processes one row at a time with explicit API calls. Fenic's declarative approach enables the engine to batch hundreds of requests, dramatically reducing latency and improving throughput.

Row-Level Lineage for Debugging

Every column and row has traceable origins, even from model output:

python
result = df.filter(...).semantic.extract(...).collect()
lineage = result.lineage()

Lineage tracks individual row processing history through every transformation, critical for debugging when specific extractions fail or produce unexpected results. You can trace back through the pipeline to identify where issues originated.

Metrics and Observability

python
result = query.collect()
metrics = result.metrics()

print(f"Total tokens: {metrics.lm_metrics.total_tokens}")
print(f"Total cost: ${metrics.lm_metrics.total_cost}")
print(f"Execution time: {metrics.execution_time}s")

Built-in token counting and cost tracking provide first-class observability into LLM operations. Operator-level metrics show where time and money are spent, enabling targeted optimization.

The declarative model plus rich metrics creates a feedback loop for optimization that's impossible with imperative LLM code scattered across microservices.

Architecting Production-Ready Semantic Pipelines

Production AI systems require reliability, observability, and cost management. Fenic provides infrastructure built for production from day one.

Separation of Batch Preprocessing and Real-Time Agents

Fenic excels at heavy lifting in batch pipelines that prepare clean, structured data for real-time agents:

python
# Batch preprocessing pipeline (runs offline)
enriched_data = (
    raw_documents
    .with_column("raw_md", fc.col("content").cast(fc.MarkdownType))
    .with_column("chunks", fc.markdown.extract_header_chunks("raw_md", header_level=2))
    .explode("chunks")
    .with_column("embedding", fc.semantic.embed(fc.col("chunks").content))
    .with_column(
        "metadata",
        fc.semantic.extract("chunks", DocumentMetadata, model_alias="cheap")
    )
)

enriched_data.write.parquet("s3://my-bucket/enriched/")

Agents then query the enriched data without expensive inference at request time. This architecture provides:

  • More predictable and responsive agents - No LLM latency in user-facing paths
  • Better resource utilization - Batch processing amortizes fixed costs
  • Cleaner separation - Planning/orchestration decoupled from execution
  • Easier debugging - Preprocessing happens once, can be validated offline

Rate Limiting and Self-Throttling

Fenic automatically respects provider rate limits with configured rpm (requests per minute) and tpm (tokens per minute):

python
"nano": fc.OpenAILanguageModel(
    "gpt-4.1-nano",
    rpm=500,
    tpm=200_000
)

The engine tracks token usage in real-time and self-throttles when approaching limits. Async I/O with concurrent request batching maximizes throughput while staying within constraints. Built-in retry logic handles transient failures automatically.

Intelligent Caching Strategies

Cache expensive operations explicitly:

python
df_cached = df.filter(...).semantic.extract(...).cache()

# Subsequent operations use cached results
result1 = df_cached.filter(condition1).collect()
result2 = df_cached.filter(condition2).collect()

The engine also caches identical inference calls automatically within a session, preventing redundant API calls when the same prompt with the same input appears multiple times.

Lakehouse-Native Architecture

Fenic is pure compute with no proprietary storage layer. Read from and write to existing lakehouses without data movement:

python
df = session.read.parquet("s3://data-lake/raw/*.parquet")

processed = df.semantic.extract(...).filter(...)

processed.write.parquet("s3://data-lake/processed/")

Full compatibility with Parquet, Iceberg, Delta Lake, and Lance enables seamless integration with existing infrastructure. Built on Apache Arrow for ecosystem interoperability—processed data works with Spark, Polars, DuckDB, and pandas.

Best Practices for Semantic Operator Development

Choose Appropriate Model Sizes Strategically

Use smaller, cheaper models for simple tasks and reserve expensive models for complex reasoning:

python
semantic=fc.SemanticConfig(
    language_models={
        "nano": fc.OpenAILanguageModel("gpt-4o-nano"),      # Classification
        "mini": fc.OpenAILanguageModel("gpt-4o-mini"),      # Extraction
        "full": fc.AnthropicLanguageModel("claude-opus-4")  # Complex reasoning
    }
)

# Simple classification
.semantic.classify(col, classes, model_alias="nano")

# Structured extraction
.semantic.extract(col, schema, model_alias="mini")

# Multi-step reasoning
.semantic.map(complex_instruction, model_alias="full")

The cost difference between models is often 10-100x. Strategic model selection can reduce costs by 80% while maintaining quality for appropriate tasks.

Design Pydantic Schemas with Clear Descriptions

Schema field descriptions guide LLM extraction:

python
class Transaction(BaseModel):
    merchant: str = Field(description="The business name where transaction occurred")
    category: Literal["grocery", "dining", "transport", "entertainment", "other"] = Field(
        description="Transaction category based on merchant type and purchase details"
    )
    amount: float = Field(description="Transaction amount in USD")
    is_recurring: bool = Field(
        description="True if this appears to be a recurring/subscription charge"
    )

Clear descriptions with examples and constraints improve extraction accuracy significantly. Literal types constrain outputs to valid categories, reducing hallucination.

Test Pipelines Incrementally with Small Datasets

Develop and test with small representative samples before scaling:

python
# Development: 100 rows
df_sample = df.limit(100)
result = df_sample.semantic.extract(...).collect()
print(f"Cost for 100 rows: ${result.metrics().lm_metrics.total_cost}")

# Validate results, then scale
df_full.semantic.extract(...).write.parquet("output/")

Fenic's lazy evaluation and metrics make it trivial to estimate costs and validate logic before processing millions of rows.

Leverage Fuzzy Matching for Preprocessing

Version 0.3.0 added built-in fuzzy string matching with six algorithms (Levenshtein, Jaro-Winkler, and others). Use fuzzy matching for initial candidate selection before expensive semantic joins:

python
# Fast fuzzy scoring for blocking
candidates = (
    left_df.join(right_df)  # Cross join
    .with_column(
        "fuzzy_score",
        fc.text.compute_fuzzy_ratio(
            fc.col("company_name"),
            fc.col("business_name"),
            method="jaro_winkler"
        )
    )
    .filter(fc.col("fuzzy_score") > 80)  # Score is 0-100
)

# Then expensive semantic matching on candidates
final = candidates.semantic.join(
    predicate="Are these the same company? Left: {{left_name}}, Right: {{right_name}}",
    left_on=fc.col("company_description"),
    right_on=fc.col("business_description")
)

This hybrid approach reduces costs by orders of magnitude compared to semantic joins on full cross-products.

Monitor and Optimize Based on Metrics

Regularly review pipeline metrics to identify optimization opportunities:

python
result = pipeline.collect()
metrics = result.metrics()

# Identify expensive operators
for op_metric in metrics.operator_metrics:
    if op_metric.cost > 10.0:  # $10+ operators
        print(f"Expensive operator: {op_metric.name}, Cost: ${op_metric.cost}")

# Check model usage distribution
print(f"Mini usage: {metrics.lm_metrics['mini'].total_tokens} tokens")
print(f"Full usage: {metrics.lm_metrics['full'].total_tokens} tokens")

Use insights to shift operations to cheaper models, add caching, or restructure pipelines for efficiency.

Integration Patterns with Existing Infrastructure

Seamless Lakehouse Integration

Fenic reads and writes standard formats without requiring data movement or proprietary storage:

python
# Read from Delta Lake
df = session.read.format("delta").load("s3://lake/raw_data")

# Process with semantic operators
processed = df.semantic.extract(...).filter(...)

# Write back to Iceberg
processed.write.format("iceberg").mode("append").save("s3://lake/processed")

The lakehouse-native architecture means Fenic works with your existing infrastructure—no data duplication, no new storage systems, no lock-in.

Agent Preprocessing Pipelines

Use Fenic to create structured context for agents that eliminates heavy inference from request paths:

python
agent_context = (
    documents
    .with_column("extracted", fc.semantic.extract(content_col, StructuredMetadata))
    .with_column("embedding", fc.semantic.embed("processed_content"))
    .with_column(
        "summary",
        fc.semantic.map(
            "Summarize in 100 words: {{content}}",
            content=fc.col("content"),
            model_alias="mini"
        )
    )
)

agent_context.write.parquet("agent_knowledge_base/")

# Online: Agent queries preprocessed data
# (No expensive inference in user-facing path)

This separation enables predictable, responsive agents while leveraging Fenic's strength in batch processing.

Hybrid Pipelines with Traditional ETL

Mix Fenic with existing Spark/Airflow workflows:

python
# Spark preprocessing
spark_df.write.parquet("s3://interim/")

# Fenic semantic enrichment
enriched = (
    session.read.parquet("s3://interim/")
    .semantic.extract(...)
    .semantic.classify(...)
)
enriched.write.parquet("s3://final/")

# Continue with downstream Spark/dbt processing

Fenic excels at semantic operations while standard tools handle traditional ETL. Use the right tool for each step.

The Inference-First Architecture Advantage

Traditional data platforms treat LLM calls as external black-box UDFs that query optimizers cannot inspect or optimize. Fenic's inference-first approach embeds LLM operations directly into the query engine as first-class citizens.

When the query optimizer sees semantic.extract() or semantic.join(), it understands this is an inference operation with specific characteristics: high latency, token costs, batching benefits, and caching opportunities. The optimizer can:

  • Reorder operations to minimize data processed by expensive inference
  • Batch requests across rows to amortize fixed costs
  • Cache aggressively since deterministic operations with same inputs produce same outputs
  • Parallelize intelligently across multiple providers or models
  • Estimate costs accurately before execution

This is impossible when LLM calls are hidden in UDFs or microservices. Fenic's semantic operators make inference visible to the optimizer, enabling optimizations that dramatically improve performance and reduce costs.

The declarative API also provides auditability and reproducibility. Every operation is explicitly defined with inputs, prompts, and model configurations tracked automatically. Row-level lineage traces data flow through transformations, critical for debugging and compliance.

Combined with native AI data types (Markdown, Transcript, JSON with structure awareness), automatic batch optimization, multi-provider support, and production-grade error handling, Fenic represents infrastructure purpose-built for AI workloads rather than retrofitted legacy systems.

Conclusion: Composability Enables Production AI at Scale

Composable semantic operators transform LLM pipelines from brittle glue code into robust, optimizable data transformations. By treating inference as a first-class operation within a familiar DataFrame API, Fenic enables developers to build production AI systems with the same rigor and reliability that data engineers have applied to traditional pipelines for decades.

The key principles: declarative operations enable optimization, type-safe schemas eliminate brittle prompts, intelligent batching reduces costs, and row-level lineage makes debugging tractable. When semantic operators compose naturally with traditional DataFrame operations, you stop choosing between structured and unstructured data—you build unified pipelines that handle both.

For teams building semantic extraction, content classification, RAG systems, or agent preprocessing pipelines, composable semantic operators provide the foundation for reliable, scalable, cost-effective AI infrastructure. The DataFrame abstraction proves as powerful for AI workloads as it was for traditional data engineering, now extended with semantic intelligence.

Start with simple operations like semantic.extract or semantic.classify on small datasets, validate results and costs, then scale to production with confidence that the infrastructure handles batching, optimization, error handling, and observability automatically. That's the power of composability—building complex systems from simple, reliable primitives that work together seamlessly.

Learn more about Typedef | Explore Fenic on GitHub | Read the Fenic announcement

Share this page
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.