<< goback()

How to Tackle Chunking and Context Windows in LLM Data Pipelines

Typedef Team

How to Tackle Chunking and Context Windows in LLM Data Pipelines

Context windows and chunking represent two of the most practical constraints in building LLM data pipelines. Every model has a finite token limit, and every document eventually exceeds it. The question isn't whether you'll hit these limits but how you'll handle them when you do.

This guide walks through concrete strategies for managing these constraints using Fenic, a DataFrame framework built for AI applications. You'll learn when to chunk, how to chunk intelligently, and how to work within context window limits without sacrificing data quality or pipeline reliability.

The Context Window Constraint

Context windows define the maximum number of tokens a model can process in a single request. GPT-4 handles 128K tokens. Claude processes 200K. Gemini can manage 2M tokens. These numbers sound large until you try to process a 500-page technical manual or a month of support tickets.

The constraint manifests in three ways:

  • Input limits cap how much text you can send to a model. A single API call fails if your prompt exceeds the model's context window, regardless of how well-crafted your prompt might be
  • Output constraints restrict generation length. Models allocate tokens between input and output. A prompt consuming 100K tokens leaves little room for generation, even if the model theoretically supports 128K total tokens
  • Cost and latency scale with token count. Processing 100K tokens costs more and takes longer than processing 10K tokens. Unnecessarily large inputs waste resources even when they fit within limits

The Fenic framework addresses these constraints by treating token management as a first-class data operation, not an afterthought bolted onto traditional data processing tools.

Chunking Strategies for Different Content Types

Fenic provides multiple chunking approaches through its text processing functions. The right strategy depends on your content structure and downstream operations.

Character-Based Chunking

Character chunking splits text into fixed-size segments by character count. Use this when dealing with unstructured content where semantic boundaries don't matter or when you need predictable chunk sizes.

python
import fenic as fc

# Chunk documents into 1000-character segments with 20% overlap
df = session.create_dataframe({
    "doc_id": ["doc1", "doc2"],
    "content": ["long text...", "more text..."]
})

chunked = df.select(
    fc.col("doc_id"),
    fc.text.character_chunk(
        fc.col("content"),
        chunk_size=1000,
        chunk_overlap_percentage=20
    ).alias("chunks")
)

The overlap parameter helps preserve context across chunk boundaries. A 20% overlap means adjacent chunks share 200 characters when chunk_size is 1000, reducing the risk of splitting important information across boundaries where neither chunk has complete context.

Token-Based Chunking

Token chunking respects model tokenization, making it the preferred approach when working directly with LLM APIs. Token counts vary by model and tokenizer, but token-based chunking ensures your chunks won't exceed model limits.

python
# Chunk by tokens with explicit control over size
chunked = df.select(
    fc.col("doc_id"),
    fc.text.token_chunk(
        fc.col("content"),
        chunk_size=512,
        chunk_overlap_percentage=10
    ).alias("chunks")
)

Set chunk sizes below your model's context window to leave room for prompts and system instructions. A 512-token chunk in a 4K context window leaves 3,500+ tokens for your prompt template and generation.

Word-Based Chunking

Word chunking provides a middle ground between character and token approaches. It's faster than token counting while more semantically aware than character splitting.

python
# Chunk by word count
chunked = df.select(
    fc.col("doc_id"),
    fc.text.word_chunk(
        fc.col("content"),
        chunk_size=200,
        chunk_overlap_percentage=15
    ).alias("chunks")
)

Word chunking works well for content where word boundaries align with meaningful breaks, like articles or support tickets. It's less useful for code or highly technical content where word boundaries might split identifiers or syntax.

Structure-Aware Chunking for Markdown

For structured content like documentation, header-based chunking preserves semantic boundaries while respecting context limits. This approach splits documents at heading levels, keeping related content together.

python
# Split markdown by heading level
docs = session.read.docs("docs/**/*.md", content_type="markdown", recursive=True)

chunked = docs.select(
    fc.col("file_path"),
    fc.markdown.extract_header_chunks(
        fc.col("content"),
        header_level=2
    ).alias("chunks")
).explode("chunks")

# Access chunk components
processed = chunked.select(
    fc.col("file_path"),
    fc.col("chunks").heading.alias("section_title"),
    fc.col("chunks").content.alias("section_text"),
    fc.col("chunks").full_path.alias("breadcrumb")
)

Header-based chunking includes parent heading context automatically. A chunk from a level-3 heading knows its level-2 parent, providing hierarchical context without manual tracking. The full_path field gives you a breadcrumb trail showing where each chunk sits in the document structure.

Managing Context Windows at Scale

Chunking solves the input size problem, but you still need strategies for processing chunks efficiently while staying within token limits.

Deduplicate Before Processing

Identical or near-identical content wastes tokens and money. Deduplicate before sending anything to a model.

python
# Deduplicate based on content hash  
# Note: Use a UDF or SQL expression for hashing since fc.hash() is not available
unique_chunks = (
    chunked
    .drop_duplicates(["content"])  # Direct deduplication on content
)

For the log clustering example covered in Typedef's documentation, deduplication reduced processing volume by 60% in production workloads. Similar errors generate similar log messages, and you only need to process each unique pattern once.

Batch by Token Budget

Group chunks into batches that respect your model's context window. Fenic's group operations make this straightforward.

python
# Calculate token counts per chunk
with_tokens = unique_chunks.with_column(
    "token_count",
    fc.text.count_tokens(fc.col("content"))
)

# Process in batches that fit within context window
MAX_TOKENS = 4000  # Leave room for prompt

batched = (
    with_tokens
    .with_column("running_tokens", fc.sum("token_count").over(
        fc.Window.order_by("doc_id")
    ))
    .with_column("batch_id", (fc.col("running_tokens") / MAX_TOKENS).cast(fc.IntegerType))
    .group_by("batch_id")
)

This pattern ensures each batch stays within token limits while maximizing throughput. Fenic's window functions handle the running token calculation efficiently across large datasets.

Use Semantic Reduction for Aggregation

When you need to consolidate information from multiple chunks, semantic reduction combines them intelligently while respecting output token limits.

python
# Reduce chunks within token budget
summaries = (
    chunked
    .group_by("doc_id")
    .agg(
        fc.semantic.reduce(
            "Summarize the key points from these document sections",
            fc.col("chunks").content,
            order_by=fc.col("chunks").level,
            max_output_tokens=512
        ).alias("summary")
    )
)

The semantic.reduce operator handles token management internally. It sorts chunks by the order_by parameter, fits as many as possible within the context window, and generates a coherent summary without exceeding max_output_tokens.

Handling PDF Documents and Long-Form Content

PDFs present unique challenges for context window management. Page counts don't correlate directly with token counts, and layout affects parsing efficiency.

Parse PDFs with Page Control

The semantic.parse_pdf function converts PDFs to markdown while giving you explicit control over page segmentation.

python
# Load PDF metadata first
pdfs = session.read.pdf_metadata("reports/**/*.pdf", recursive=True)

# Filter by page count to avoid massive documents
reasonable_pdfs = pdfs.filter(fc.col("page_count") < 100)

# Parse with page separators
markdown = reasonable_pdfs.select(
    fc.col("file_path"),
    fc.semantic.parse_pdf(
        fc.col("file_path"),
        page_separator="--- PAGE {page} ---",
        describe_images=True
    ).alias("markdown_content")
)

The page_separator parameter inserts markers at page boundaries, enabling you to split parsed content into manageable chunks without losing page context. The {page} placeholder gets replaced with actual page numbers, making it easy to track which chunk came from which page.

Chunk Parsed Content Intelligently

After parsing, apply structure-aware chunking to split the content at natural boundaries.

python
# Split parsed PDFs into logical sections
sections = markdown.select(
    fc.col("file_path"),
    fc.markdown.extract_header_chunks(
        fc.col("markdown_content"),
        header_level=2
    ).alias("sections")
).explode("sections")

# Process sections that fit within context window
TOKEN_LIMIT = 4000

processable = (
    sections
    .with_column("token_count", fc.text.count_tokens(
        fc.col("sections").content
    ))
    .filter(fc.col("token_count") < TOKEN_LIMIT)
)

This approach combines PDF parsing with intelligent chunking. You parse once, then split the structured output at semantic boundaries that respect both document structure and context window constraints.

Optimizing Token Usage with Embeddings

Embeddings provide a token-efficient alternative for many operations. Instead of sending full text to an LLM repeatedly, generate embeddings once and use them for similarity comparisons, clustering, and retrieval.

Generate Embeddings for Chunks

Create vector representations of your chunks to enable semantic operations without consuming context windows.

python
# Generate embeddings for all chunks
embedded = chunked.select(
    fc.col("doc_id"),
    fc.col("chunk_id"),
    fc.col("content"),
    fc.semantic.embed(fc.col("content")).alias("embedding")
)

Embeddings consume tokens once during generation but enable unlimited comparisons afterward. A 1,000-token chunk generates a fixed-size embedding vector regardless of chunk length, and comparing embeddings costs essentially nothing.

Cluster Similar Content

Use embeddings to group similar chunks, reducing redundant LLM processing.

python
# Cluster chunks by semantic similarity
clustered = embedded.semantic.with_cluster_labels(
    by=fc.col("embedding"),
    num_clusters=50,
    label_column="cluster_id",
    centroid_column="centroid"
)

# Process one representative chunk per cluster
representatives = (
    clustered
    .with_column("distance", fc.embedding.compute_similarity(
        fc.col("embedding"),
        fc.col("centroid"),
        metric="cosine"
    ))
    .sort(fc.col("distance"))
    .drop_duplicates(["cluster_id"])
)

This pattern appeared in the log triage example. Instead of processing 10,000 similar log messages, you cluster them into 100 groups and process only representative messages, cutting token usage by 99%.

Retrieve Relevant Context

When working with large document collections, use embeddings to retrieve only relevant chunks before sending anything to an LLM.

python
# Generate query embedding
query_df = session.create_dataframe({
    "query": ["How do I configure authentication?"]
})

query_embedded = query_df.select(
    fc.semantic.embed(fc.col("query")).alias("query_embedding")
)

# Find similar chunks
relevant = (
    embedded
    .join(query_embedded, how="cross")
    .with_column("similarity", fc.embedding.compute_similarity(
        fc.col("embedding"),
        fc.col("query_embedding"),
        metric="cosine"
    ))
    .sort(fc.col("similarity").desc())
    .limit(10)
)

This retrieval-augmented approach sends only the most relevant chunks to your LLM, staying well within context limits while maintaining answer quality. Ten carefully selected chunks provide better results than hundreds of marginally relevant chunks, even when both fit within the context window.

Cost and Latency Considerations

Token management directly impacts both costs and pipeline performance. Fenic's architecture optimizes both through automatic batching and intelligent request planning.

Configure Rate Limits Appropriately

Set token-per-minute (TPM) and request-per-minute (RPM) limits based on your provider tier and workload characteristics.

python
config = fc.SessionConfig(
    app_name="doc_processor",
    semantic=fc.SemanticConfig(
        language_models={
            "gpt4": fc.OpenAILanguageModel(
                model_name="gpt-4.1-nano",
                rpm=500,
                tpm=200_000
            ),
            "gemini": fc.GoogleDeveloperLanguageModel(
                model_name="gemini-2.0-flash",
                rpm=100,
                tpm=1_000_000
            )
        },
        default_language_model="gemini"
    )
)

session = fc.Session.get_or_create(config)

Fenic automatically throttles requests to stay within these limits, preventing rate limit errors and optimizing throughput. The framework batches requests intelligently, grouping operations that can share context to reduce total token consumption.

Monitor Token Usage

Track token consumption across your pipeline to identify optimization opportunities.

python
# Process with metrics tracking
result = df.select(
    fc.semantic.extract(fc.col("content"), YourSchema)
).collect()

# Access metrics after execution via the metrics table
metrics_df = session.table("fenic_system.query_metrics")
# You can query this DataFrame to analyze metrics
# The QueryResult also contains metrics
print(f"Total input tokens: {result.lm_metrics.total_input_tokens}")
print(f"Total output tokens: {result.lm_metrics.total_output_tokens}")
print(f"Total cost: ${result.lm_metrics.estimated_cost}")

The metrics system provides granular visibility into token consumption at the operator level. You can see exactly which operations consume the most tokens and optimize accordingly.

Cache Expensive Operations

Persist intermediate results to avoid recomputing expensive operations during development and iteration.

python
# Cache embeddings for reuse
embedded = (
    chunked
    .select(fc.semantic.embed(fc.col("content")).alias("embedding"))
    .persist()
)

# Use cached embeddings in multiple operations
clusters = embedded.semantic.with_cluster_labels(by="embedding", num_clusters=10)
similar = embedded.filter(/* similarity condition */)

The persist() method caches results after first computation. Subsequent operations use cached values instead of regenerating embeddings, dramatically reducing token consumption during iterative development.

Practical Patterns for Production Pipelines

These patterns combine chunking and context window management into production-ready workflows.

Progressive Processing Pipeline

Handle documents of varying sizes by routing them through appropriate processing paths.

python
# Classify documents by token count
docs = session.read.docs("data/**/*.md", content_type="markdown", recursive=True)

with_tokens = docs.with_column(
    "token_count",
    fc.text.count_tokens(fc.col("content"))
)

# Route based on size
small_docs = with_tokens.filter(fc.col("token_count") < 4000)
large_docs = with_tokens.filter(fc.col("token_count") >= 4000)

# Process small docs directly
small_results = small_docs.select(
    fc.semantic.extract(fc.col("content"), ExtractSchema)
)

# Chunk large docs first
large_chunked = large_docs.select(
    fc.markdown.extract_header_chunks("content", header_level=2)
).explode("chunks")

large_results = large_chunked.select(
    fc.semantic.extract(fc.col("chunks").content, ExtractSchema)
)

This routing pattern processes simple cases efficiently while handling large cases correctly. Small documents skip unnecessary chunking overhead, while large documents get split intelligently before processing.

Hierarchical Summarization

Summarize long documents by processing chunks independently, then aggregating summaries.

python
# Chunk document
chunked = df.select(
    fc.col("doc_id"),
    fc.markdown.extract_header_chunks("content", header_level=2).alias("chunks")
).explode("chunks")

# Summarize each chunk
chunk_summaries = chunked.select(
    fc.col("doc_id"),
    fc.semantic.summarize(
        fc.col("chunks").content,
        format=fc.KeyPoints(max_points=5)
    ).alias("chunk_summary")
)

# Aggregate chunk summaries into document summary
doc_summaries = (
    chunk_summaries
    .group_by("doc_id")
    .agg(
        fc.semantic.reduce(
            "Combine these section summaries into a coherent document summary",
            fc.col("chunk_summary"),
            max_output_tokens=1024
        ).alias("final_summary")
    )
)

Hierarchical summarization respects context windows at each level while producing high-quality summaries. Each chunk summary fits comfortably within model limits, and the final aggregation combines them without exceeding output constraints.

Semantic Search with Context Assembly

Build retrieval pipelines that find relevant chunks and assemble them into coherent context for generation.

python
# Index documents
indexed = docs.select(
    fc.col("doc_id"),
    fc.col("content"),
    fc.semantic.embed(fc.col("content")).alias("embedding")
).persist()

# Query processing function
def search_and_generate(query: str, max_context_tokens: int = 6000):
    # Embed query
    query_vec = session.create_dataframe({"q": [query]}).select(
        fc.semantic.embed(fc.col("q")).alias("qvec")
    ).collect()[0]["qvec"]

    # Find relevant chunks
    relevant = (
        indexed
        .with_column("similarity", fc.embedding.compute_similarity(
            fc.col("embedding"),
            fc.lit(query_vec),
            metric="cosine"
        ))
        .sort(fc.col("similarity").desc())
        .with_column("tokens", fc.text.count_tokens(fc.col("content")))
        .with_column("running_tokens", fc.sum("tokens").over(
            fc.Window.order_by(fc.col("similarity").desc())
        ))
        .filter(fc.col("running_tokens") <= max_context_tokens)
    )

    # Generate answer using assembled context
    return relevant.select(
        fc.semantic.map(
            "Answer this question using the provided context: {{query}}\n\nContext: {{context}}",
            query=fc.lit(query),
            context=fc.col("content")
        )
    )

This pattern retrieves only what fits within your context budget, ordered by relevance. The running token calculation ensures you never exceed limits while maximizing the useful context sent to the model.

When to Use Each Approach

Different scenarios require different chunking and context management strategies:

Use character chunking when:

  • Processing unstructured text where semantic boundaries don't exist
  • You need precisely controlled chunk sizes for downstream processing that expects fixed-size inputs

Use token chunking when:

  • Working directly with LLM APIs where precise token counts matter for cost calculation and rate limiting
  • You need guaranteed fit within specific model context windows

Use word chunking when:

  • You need a fast approximation and semantic boundaries align with word breaks
  • Exact token counts aren't critical, particularly useful for initial prototyping before optimizing for production

Use structure-aware chunking when:

  • Documents have clear hierarchical structure like markdown, HTML, or JSON
  • Preserving semantic relationships improves downstream task quality

Use embeddings when:

  • You need to compare, cluster, or retrieve content across large datasets without repeatedly consuming context windows
  • Building retrieval-augmented generation systems

Use semantic reduction when:

  • You need to consolidate information from multiple chunks while respecting output token limits
  • Building hierarchical summarization or aggregation tasks

Getting Started

Start building token-efficient pipelines with Fenic:

Install Fenic:

bash
pip install fenic

Set up your environment:

bash
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GOOGLE_API_KEY=...

Create your first chunking pipeline:

python
import fenic as fc

session = fc.Session.get_or_create(
    fc.SessionConfig(app_name="chunking_demo")
)

# Load and chunk documents
docs = session.read.docs("data/**/*.md", content_type="markdown", recursive=True)

chunked = docs.select(
    fc.col("file_path"),
    fc.markdown.extract_header_chunks("content", header_level=2).alias("chunks")
).explode("chunks")

# Generate embeddings
embedded = chunked.select(
    fc.col("file_path"),
    fc.col("chunks").content.alias("text"),
    fc.semantic.embed(fc.col("chunks").content).alias("embedding")
)

embedded.show()

For more examples and detailed documentation, explore Typedef's resources:

Share this page
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.