How to Manage Multiple LLM Providers in a Single Inference Framework

Production AI applications need flexibility. You might want OpenAI's GPT-5 for complex reasoning, Claude for long-context analysis, Gemini for cost-effective batch processing, or OpenRouter for dynamic provider routing. Managing these providers separately creates operational overhead, but Fenic's unified inference framework lets you configure, route, and optimize across providers from a single codebase.

This guide shows you how to configure multiple LLM providers in Fenic, implement rate limiting, switch models dynamically, and build production-grade pipelines that leverage the strengths of different providers.

Why Use Multiple Providers

Different providers excel at different tasks:

OpenAI models deliver strong reasoning capabilities with models like GPT-5 and the o-series
Anthropic's Claude handles extended context windows and complex analysis
Google's Gemini offers fast, cost-effective inference with native file processing
OpenRouter provides automatic failover and cost/latency-based routing across providers

A single provider limits your options. Multiple providers let you optimize for cost, latency, quality, and reliability per workload.

Setting Up Your First Multi-Provider Configuration

Fenic manages all provider configurations through a SemanticConfig object within your session. Start by configuring multiple language models with their rate limits.

python
import fenic as fc
from fenic.api.session.config import (
    SessionConfig,
    SemanticConfig,
    OpenAILanguageModel,
    AnthropicLanguageModel,
    GoogleDeveloperLanguageModel
)

config = SessionConfig(
    app_name="multi_provider_app",
    semantic=SemanticConfig(
        language_models={
            "gpt4": OpenAILanguageModel(
                model_name="gpt-4.1-nano",
                rpm=100,
                tpm=100
            ),
            "claude": AnthropicLanguageModel(
                model_name="claude-3-5-haiku-latest",
                rpm=100,
                input_tpm=100,
                output_tpm=100
            ),
            "gemini": GoogleDeveloperLanguageModel(
                model_name="gemini-2.0-flash",
                rpm=100,
                tpm=1000
            )
        },
        default_language_model="gpt4"
    )
)

session = fc.Session.get_or_create(config)

Each model receives a string alias ("gpt4", "claude", "gemini") that you'll use to reference it in semantic operations. The default_language_model determines which model runs when you don't explicitly specify one.

Rate Limiting Across Providers

Provider rate limits vary by API tier. Fenic enforces these limits at the framework level, preventing 429 errors and ensuring reliable pipeline execution.

Provider-Specific Rate Limit Patterns

Different providers use different rate limit structures:

OpenAI and Google use combined token limits:

python
OpenAILanguageModel(
    model_name="gpt-4.1-nano",
    rpm=100,      # Requests per minute
    tpm=10000     # Tokens per minute (input + output)
)

Anthropic separates input and output token limits:

python
AnthropicLanguageModel(
    model_name="claude-opus-4-0",
    rpm=100,
    input_tpm=50000,   # Input tokens per minute
    output_tpm=25000   # Output tokens per minute
)

Fenic tracks token usage automatically and throttles requests to stay within your configured limits. This prevents quota exhaustion and ensures predictable pipeline behavior. According to the Fenic 0.5.0 release notes, the framework includes quota handling guardrails that fail fast with clear errors rather than silently retrying.

Switching Models Per Operation

Reference any configured model by its alias in semantic operations. This lets you route different workloads to different providers based on their strengths.

python
# Use GPT-4 for complex extraction
structured_data = df.select(
    fc.semantic.extract(
        fc.col("document"),
        response_format=ComplexSchema,
        model_alias="gpt4"
    )
)

# Use Gemini for cost-effective summarization
summaries = df.select(
    fc.semantic.summarize(
        fc.col("content"),
        model_alias="gemini"
    )
)

# Use Claude for sentiment analysis
sentiment = df.select(
    fc.semantic.analyze_sentiment(
        fc.col("feedback"),
        model_alias="claude"
    )
)

The HN research agent built by Typedef demonstrates this pattern: it uses one model for batch summarization and another for synthesis, letting each provider handle what it does best.

Model Profiles for Fine-Grained Control

Profiles let you configure multiple execution modes for the same model without duplicating configuration. This is useful for reasoning models where you want to balance speed versus quality.

python
config = SemanticConfig(
    language_models={
        "o4": OpenAILanguageModel(
            model_name="o4-mini",
            rpm=1000,
            tpm=1000000,
            profiles={
                "fast": OpenAILanguageModel.Profile(reasoning_effort="low"),
                "thorough": OpenAILanguageModel.Profile(reasoning_effort="high")
            },
            default_profile="fast"
        ),
        "claude": AnthropicLanguageModel(
            model_name="claude-opus-4-0",
            rpm=100,
            input_tpm=100000,
            output_tpm=50000,
            profiles={
                "thinking_disabled": AnthropicLanguageModel.Profile(),
                "fast": AnthropicLanguageModel.Profile(thinking_token_budget=1024),
                "deep": AnthropicLanguageModel.Profile(thinking_token_budget=4096)
            },
            default_profile="fast"
        )
    },
    default_language_model="o4"
)

session = fc.Session.get_or_create(config)

Use profiles by passing a ModelAlias object instead of a string:

python
import fenic as fc
from fenic.core.types.semantic import ModelAlias

# Use the default "fast" profile
result = df.select(
    fc.semantic.map(
        "Analyze {{ text }}",
        text=fc.col("content"),
        model_alias="o4"
    )
)

# Override with the "thorough" profile for specific operations
detailed_result = df.select(
    fc.semantic.map(
        "Analyze {{ text }}",
        text=fc.col("content"),
        model_alias=ModelAlias(name="o4", profile="thorough")
    )
)

Profiles reduce configuration duplication and make it easy to experiment with different parameter combinations.

OpenRouter for Dynamic Provider Routing

OpenRouter aggregates multiple providers behind a single API. Fenic supports OpenRouter with provider-aware routing, letting you specify whether requests should route based on price, latency, or specific provider constraints.

python
from fenic.api.session.config import OpenRouterLanguageModel

config = SemanticConfig(
    language_models={
        "cheap": OpenRouterLanguageModel(
            model_name="openai/gpt-oss-20b",
            profiles={
                "default": OpenRouterLanguageModel.Profile(
                    provider=OpenRouterLanguageModel.Provider(
                        sort="price"  # Route to cheapest provider
                    )
                )
            },
            default_profile="default"
        ),
        "fast": OpenRouterLanguageModel(
            model_name="anthropic/claude-sonnet-4-0-latest",
            profiles={
                "default": OpenRouterLanguageModel.Profile(
                    provider=OpenRouterLanguageModel.Provider(
                        sort="latency",  # Route to fastest provider
                        only=["Anthropic"]  # Ensure direct Anthropic routing
                    )
                )
            },
            default_profile="default"
        )
    },
    default_language_model="cheap"
)

OpenRouter's provider routing lets you optimize for different dimensions without changing your code. The Fenic 0.5.0 release added full OpenRouter support with structured outputs and flexible provider configuration.

Setting Environment Variables

Fenic reads API keys from environment variables. Set the keys for all providers you plan to use:

bash
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...
export COHERE_API_KEY=...
export OPENROUTER_API_KEY=sk-or-...

The Fenic 0.4.0 release introduced provider key validation at session creation. If keys are missing or invalid, Fenic fails immediately with clear errors rather than waiting until runtime.

Batch Processing with Concurrent Inference

Fenic handles batch inference automatically through its DataFrame API. When you apply semantic operations to a column, Fenic batches requests, manages rate limits, and processes rows concurrently.

python
# Read documents
docs = session.read.docs("data/*.md", content_type="markdown")

# Extract structured data with automatic batching
results = docs.select(
    fc.col("file_path"),
    fc.semantic.extract(
        fc.col("content"),
        response_format=DocumentMetadata,
        model_alias="gpt4"
    ).alias("metadata")
)

# Fenic automatically:
# - Batches API requests
# - Respects rate limits
# - Handles retries
# - Manages concurrency
results.show()

For custom async operations, use async UDFs with configurable concurrency:

python
import aiohttp
from fenic.api.functions import async_udf
from fenic.core.types import IntegerType

@async_udf(
    return_type=IntegerType,
    max_concurrency=20,  # Control parallelism
    timeout_seconds=5,
    num_retries=2
)
async def fetch_score(user_id: int) -> int:
    async with aiohttp.ClientSession() as session:
        async with session.get(f"https://api.example.com/score/{user_id}") as resp:
            data = await resp.json()
            return data["score"]

df = df.select(
    fc.col("user_id"),
    fetch_score(fc.col("user_id")).alias("score")
)

The async UDF feature in Fenic 0.4.0 enables controlled concurrency for I/O-bound workloads while maintaining DataFrame semantics.

Monitoring Inference Costs and Performance

Fenic tracks inference metrics locally. Query the built-in metrics table to analyze costs, latency, and token usage per model.

python
metrics = session.table("fenic_system.query_metrics")

# View recent queries with their costs
metrics.select(
    "session_id",
    "execution_id",
    "total_lm_cost",
    "total_lm_requests",
    "execution_time_ms",
    "end_ts"
).show()

# Analyze total costs over time
cost_analysis = session.sql("""
    SELECT
        CAST(SUM(total_lm_cost) AS DOUBLE) AS total_cost,
        CAST(SUM(total_lm_requests) AS DOUBLE) AS total_requests,
        COUNT(*) AS query_count
    FROM {df}
    WHERE total_lm_cost > 0
""", df=metrics)
cost_analysis.show()

This visibility helps you optimize provider selection based on actual performance data.

Provider-Specific Capabilities

Different providers offer unique features. Fenic exposes these through the semantic API.

PDF Processing with Gemini

Google's Gemini models support native file processing. Use semantic.parse_pdf to convert PDFs to markdown:

python
from fenic.api.session.config import (
    SessionConfig,
    SemanticConfig,
    GoogleDeveloperLanguageModel
)

session = fc.Session.get_or_create(
    SessionConfig(
        app_name="pdf_processing",
        semantic=SemanticConfig(
            language_models={
                "gemini": GoogleDeveloperLanguageModel(
                    model_name="gemini-2.0-flash",
                    rpm=100,
                    tpm=1000
                )
            },
            default_language_model="gemini"
        )
    )
)

pdfs = session.read.pdf_metadata("data/**/*.pdf", recursive=True)
markdown = pdfs.select(
    fc.col("file_path"),
    fc.semantic.parse_pdf(
        fc.col("file_path"),
        page_separator="--- PAGE {page} ---",
        describe_images=True
    ).alias("markdown")
)

The PDF parsing capability in Fenic 0.5.0 uses Gemini's native file API for efficient processing.

Structured Output Strategies

Configure how Fenic requests structured outputs:

python
OpenRouterLanguageModel(
    model_name="anthropic/claude-sonnet-4-0-latest",
    structured_output_strategy="prefer_response_format",  # or "prefer_tools"
    profiles={...}
)

This controls whether Fenic uses the provider's native response format API or tool-calling for structured outputs.

Real-World Pattern: Research Agent with Multiple Providers

The HN research agent demonstrates production patterns for multi-provider inference:

python
# Configure providers for different workloads
config = SemanticConfig(
    language_models={
        "summarizer": GoogleDeveloperLanguageModel(
            model_name="gemini-2.0-flash",  # Fast, cheap for batch summaries
            rpm=100,
            tpm=10000
        ),
        "synthesizer": AnthropicLanguageModel(
            model_name="claude-opus-4-0",  # Deep reasoning for synthesis
            rpm=50,
            input_tpm=50000,
            output_tpm=25000
        )
    },
    default_language_model="summarizer"
)

# Batch summarization with Gemini
summaries = stories.select(
    fc.semantic.map(
        "Summarize: {{ text }}",
        text=fc.col("text"),
        model_alias="summarizer"
    ).alias("summary")
)

# Final synthesis with Claude
report = summaries.select(
    fc.semantic.map(
        "Synthesize these findings: {{ summary }}",
        summary=fc.col("summary"),
        model_alias="synthesizer"
    ).alias("report")
)

This pattern offloads bulk inference to a cost-effective provider while reserving expensive models for high-value synthesis tasks.

Best Practices

Set appropriate rate limits: Configure RPM and TPM based on your API tier. Underestimating limits causes throttling; overestimating risks quota exhaustion.

Use model aliases consistently: Pick descriptive names like "fast-summarizer" or "complex-reasoner" rather than provider names. This makes it easier to swap providers later.

Monitor metrics regularly: Check the metrics table to identify cost spikes or latency issues. Use this data to adjust provider selection.

Validate API keys at startup: Fenic validates keys during session creation. Run a quick health check before production deployments to catch configuration issues early.

Leverage profiles for experimentation: Test different reasoning efforts or thinking budgets without creating separate model configurations.

Match providers to workloads: Use fast, cheap models for bulk processing. Reserve expensive models for complex reasoning or synthesis tasks.

Consider OpenRouter for flexibility: If you need dynamic routing or want to avoid vendor lock-in, configure OpenRouter with appropriate routing policies.

Error Handling and Reliability

Fenic includes several reliability features:

Token capacity guardrails: Fenic validates that requests fit within model limits before sending them, preventing mid-pipeline failures.

Automatic retries: Transient errors trigger exponential backoff. Configure retry behavior through async UDF parameters.

Fail-fast on quota exhaustion: OpenAI 429 "quota exhausted" errors fail immediately with clear messages rather than retrying indefinitely.

Batch client stability: Provider-specific fixes ensure predictable behavior across OpenAI, Anthropic, Google, and OpenRouter clients.

These features are documented in the Fenic 0.5.0 release notes.

Going Further

Multi-provider management in Fenic extends beyond configuration. The framework supports:

Embedding models from multiple providers alongside language models
Cloud execution for distributed processing with cloud configuration
MCP server integration for exposing Fenic tools to agents and assistants
Declarative tool creation for building agent tools backed by DataFrame queries

For more advanced patterns, see:

Summary

Fenic's multi-provider support gives you the flexibility to optimize every workload. Configure providers through SemanticConfig, enforce rate limits automatically, switch models with aliases, and track performance through built-in metrics.

Start with a simple multi-provider setup, use the right model for each task, and let Fenic handle the operational complexity. Install Fenic with pip install fenic and check the official documentation to get started. How to Manage Multiple LLM ... fcf08032a9cae9ee6bcf2a2e.md External Displaying How to Manage Multiple LLM Providers in a Single I 2a6df41efcf08032a9cae9ee6bcf2a2e.md.