Context windows and chunking represent two of the most practical constraints in building LLM data pipelines. Every model has a finite token limit, and every document eventually exceeds it. The question isn't whether you'll hit these limits but how you'll handle them when you do.
This guide walks through concrete strategies for managing these constraints using Fenic, a DataFrame framework built for AI applications. You'll learn when to chunk, how to chunk intelligently, and how to work within context window limits without sacrificing data quality or pipeline reliability.
The Context Window Constraint
Context windows define the maximum number of tokens a model can process in a single request. GPT-4 handles 128K tokens. Claude processes 200K. Gemini can manage 2M tokens. These numbers sound large until you try to process a 500-page technical manual or a month of support tickets.
The constraint manifests in three ways:
- Input limits cap how much text you can send to a model. A single API call fails if your prompt exceeds the model's context window, regardless of how well-crafted your prompt might be
- Output constraints restrict generation length. Models allocate tokens between input and output. A prompt consuming 100K tokens leaves little room for generation, even if the model theoretically supports 128K total tokens
- Cost and latency scale with token count. Processing 100K tokens costs more and takes longer than processing 10K tokens. Unnecessarily large inputs waste resources even when they fit within limits
The Fenic framework addresses these constraints by treating token management as a first-class data operation, not an afterthought bolted onto traditional data processing tools.
Chunking Strategies for Different Content Types
Fenic provides multiple chunking approaches through its text processing functions. The right strategy depends on your content structure and downstream operations.
Character-Based Chunking
Character chunking splits text into fixed-size segments by character count. Use this when dealing with unstructured content where semantic boundaries don't matter or when you need predictable chunk sizes.
pythonimport fenic as fc # Chunk documents into 1000-character segments with 20% overlap df = session.create_dataframe({ "doc_id": ["doc1", "doc2"], "content": ["long text...", "more text..."] }) chunked = df.select( fc.col("doc_id"), fc.text.character_chunk( fc.col("content"), chunk_size=1000, chunk_overlap_percentage=20 ).alias("chunks") )
The overlap parameter helps preserve context across chunk boundaries. A 20% overlap means adjacent chunks share 200 characters when chunk_size is 1000, reducing the risk of splitting important information across boundaries where neither chunk has complete context.
Token-Based Chunking
Token chunking respects model tokenization, making it the preferred approach when working directly with LLM APIs. Token counts vary by model and tokenizer, but token-based chunking ensures your chunks won't exceed model limits.
python# Chunk by tokens with explicit control over size chunked = df.select( fc.col("doc_id"), fc.text.token_chunk( fc.col("content"), chunk_size=512, chunk_overlap_percentage=10 ).alias("chunks") )
Set chunk sizes below your model's context window to leave room for prompts and system instructions. A 512-token chunk in a 4K context window leaves 3,500+ tokens for your prompt template and generation.
Word-Based Chunking
Word chunking provides a middle ground between character and token approaches. It's faster than token counting while more semantically aware than character splitting.
python# Chunk by word count chunked = df.select( fc.col("doc_id"), fc.text.word_chunk( fc.col("content"), chunk_size=200, chunk_overlap_percentage=15 ).alias("chunks") )
Word chunking works well for content where word boundaries align with meaningful breaks, like articles or support tickets. It's less useful for code or highly technical content where word boundaries might split identifiers or syntax.
Structure-Aware Chunking for Markdown
For structured content like documentation, header-based chunking preserves semantic boundaries while respecting context limits. This approach splits documents at heading levels, keeping related content together.
python# Split markdown by heading level docs = session.read.docs("docs/**/*.md", content_type="markdown", recursive=True) chunked = docs.select( fc.col("file_path"), fc.markdown.extract_header_chunks( fc.col("content"), header_level=2 ).alias("chunks") ).explode("chunks") # Access chunk components processed = chunked.select( fc.col("file_path"), fc.col("chunks").heading.alias("section_title"), fc.col("chunks").content.alias("section_text"), fc.col("chunks").full_path.alias("breadcrumb") )
Header-based chunking includes parent heading context automatically. A chunk from a level-3 heading knows its level-2 parent, providing hierarchical context without manual tracking. The full_path field gives you a breadcrumb trail showing where each chunk sits in the document structure.
Managing Context Windows at Scale
Chunking solves the input size problem, but you still need strategies for processing chunks efficiently while staying within token limits.
Deduplicate Before Processing
Identical or near-identical content wastes tokens and money. Deduplicate before sending anything to a model.
python# Deduplicate based on content hash # Note: Use a UDF or SQL expression for hashing since fc.hash() is not available unique_chunks = ( chunked .drop_duplicates(["content"]) # Direct deduplication on content )
For the log clustering example covered in Typedef's documentation, deduplication reduced processing volume by 60% in production workloads. Similar errors generate similar log messages, and you only need to process each unique pattern once.
Batch by Token Budget
Group chunks into batches that respect your model's context window. Fenic's group operations make this straightforward.
python# Calculate token counts per chunk with_tokens = unique_chunks.with_column( "token_count", fc.text.count_tokens(fc.col("content")) ) # Process in batches that fit within context window MAX_TOKENS = 4000 # Leave room for prompt batched = ( with_tokens .with_column("running_tokens", fc.sum("token_count").over( fc.Window.order_by("doc_id") )) .with_column("batch_id", (fc.col("running_tokens") / MAX_TOKENS).cast(fc.IntegerType)) .group_by("batch_id") )
This pattern ensures each batch stays within token limits while maximizing throughput. Fenic's window functions handle the running token calculation efficiently across large datasets.
Use Semantic Reduction for Aggregation
When you need to consolidate information from multiple chunks, semantic reduction combines them intelligently while respecting output token limits.
python# Reduce chunks within token budget summaries = ( chunked .group_by("doc_id") .agg( fc.semantic.reduce( "Summarize the key points from these document sections", fc.col("chunks").content, order_by=fc.col("chunks").level, max_output_tokens=512 ).alias("summary") ) )
The semantic.reduce operator handles token management internally. It sorts chunks by the order_by parameter, fits as many as possible within the context window, and generates a coherent summary without exceeding max_output_tokens.
Handling PDF Documents and Long-Form Content
PDFs present unique challenges for context window management. Page counts don't correlate directly with token counts, and layout affects parsing efficiency.
Parse PDFs with Page Control
The semantic.parse_pdf function converts PDFs to markdown while giving you explicit control over page segmentation.
python# Load PDF metadata first pdfs = session.read.pdf_metadata("reports/**/*.pdf", recursive=True) # Filter by page count to avoid massive documents reasonable_pdfs = pdfs.filter(fc.col("page_count") < 100) # Parse with page separators markdown = reasonable_pdfs.select( fc.col("file_path"), fc.semantic.parse_pdf( fc.col("file_path"), page_separator="--- PAGE {page} ---", describe_images=True ).alias("markdown_content") )
The page_separator parameter inserts markers at page boundaries, enabling you to split parsed content into manageable chunks without losing page context. The {page} placeholder gets replaced with actual page numbers, making it easy to track which chunk came from which page.
Chunk Parsed Content Intelligently
After parsing, apply structure-aware chunking to split the content at natural boundaries.
python# Split parsed PDFs into logical sections sections = markdown.select( fc.col("file_path"), fc.markdown.extract_header_chunks( fc.col("markdown_content"), header_level=2 ).alias("sections") ).explode("sections") # Process sections that fit within context window TOKEN_LIMIT = 4000 processable = ( sections .with_column("token_count", fc.text.count_tokens( fc.col("sections").content )) .filter(fc.col("token_count") < TOKEN_LIMIT) )
This approach combines PDF parsing with intelligent chunking. You parse once, then split the structured output at semantic boundaries that respect both document structure and context window constraints.
Optimizing Token Usage with Embeddings
Embeddings provide a token-efficient alternative for many operations. Instead of sending full text to an LLM repeatedly, generate embeddings once and use them for similarity comparisons, clustering, and retrieval.
Generate Embeddings for Chunks
Create vector representations of your chunks to enable semantic operations without consuming context windows.
python# Generate embeddings for all chunks embedded = chunked.select( fc.col("doc_id"), fc.col("chunk_id"), fc.col("content"), fc.semantic.embed(fc.col("content")).alias("embedding") )
Embeddings consume tokens once during generation but enable unlimited comparisons afterward. A 1,000-token chunk generates a fixed-size embedding vector regardless of chunk length, and comparing embeddings costs essentially nothing.
Cluster Similar Content
Use embeddings to group similar chunks, reducing redundant LLM processing.
python# Cluster chunks by semantic similarity clustered = embedded.semantic.with_cluster_labels( by=fc.col("embedding"), num_clusters=50, label_column="cluster_id", centroid_column="centroid" ) # Process one representative chunk per cluster representatives = ( clustered .with_column("distance", fc.embedding.compute_similarity( fc.col("embedding"), fc.col("centroid"), metric="cosine" )) .sort(fc.col("distance")) .drop_duplicates(["cluster_id"]) )
This pattern appeared in the log triage example. Instead of processing 10,000 similar log messages, you cluster them into 100 groups and process only representative messages, cutting token usage by 99%.
Retrieve Relevant Context
When working with large document collections, use embeddings to retrieve only relevant chunks before sending anything to an LLM.
python# Generate query embedding query_df = session.create_dataframe({ "query": ["How do I configure authentication?"] }) query_embedded = query_df.select( fc.semantic.embed(fc.col("query")).alias("query_embedding") ) # Find similar chunks relevant = ( embedded .join(query_embedded, how="cross") .with_column("similarity", fc.embedding.compute_similarity( fc.col("embedding"), fc.col("query_embedding"), metric="cosine" )) .sort(fc.col("similarity").desc()) .limit(10) )
This retrieval-augmented approach sends only the most relevant chunks to your LLM, staying well within context limits while maintaining answer quality. Ten carefully selected chunks provide better results than hundreds of marginally relevant chunks, even when both fit within the context window.
Cost and Latency Considerations
Token management directly impacts both costs and pipeline performance. Fenic's architecture optimizes both through automatic batching and intelligent request planning.
Configure Rate Limits Appropriately
Set token-per-minute (TPM) and request-per-minute (RPM) limits based on your provider tier and workload characteristics.
pythonconfig = fc.SessionConfig( app_name="doc_processor", semantic=fc.SemanticConfig( language_models={ "gpt4": fc.OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=500, tpm=200_000 ), "gemini": fc.GoogleDeveloperLanguageModel( model_name="gemini-2.0-flash", rpm=100, tpm=1_000_000 ) }, default_language_model="gemini" ) ) session = fc.Session.get_or_create(config)
Fenic automatically throttles requests to stay within these limits, preventing rate limit errors and optimizing throughput. The framework batches requests intelligently, grouping operations that can share context to reduce total token consumption.
Monitor Token Usage
Track token consumption across your pipeline to identify optimization opportunities.
python# Process with metrics tracking result = df.select( fc.semantic.extract(fc.col("content"), YourSchema) ).collect() # Access metrics after execution via the metrics table metrics_df = session.table("fenic_system.query_metrics") # You can query this DataFrame to analyze metrics # The QueryResult also contains metrics print(f"Total input tokens: {result.lm_metrics.total_input_tokens}") print(f"Total output tokens: {result.lm_metrics.total_output_tokens}") print(f"Total cost: ${result.lm_metrics.estimated_cost}")
The metrics system provides granular visibility into token consumption at the operator level. You can see exactly which operations consume the most tokens and optimize accordingly.
Cache Expensive Operations
Persist intermediate results to avoid recomputing expensive operations during development and iteration.
python# Cache embeddings for reuse embedded = ( chunked .select(fc.semantic.embed(fc.col("content")).alias("embedding")) .persist() ) # Use cached embeddings in multiple operations clusters = embedded.semantic.with_cluster_labels(by="embedding", num_clusters=10) similar = embedded.filter(/* similarity condition */)
The persist() method caches results after first computation. Subsequent operations use cached values instead of regenerating embeddings, dramatically reducing token consumption during iterative development.
Practical Patterns for Production Pipelines
These patterns combine chunking and context window management into production-ready workflows.
Progressive Processing Pipeline
Handle documents of varying sizes by routing them through appropriate processing paths.
python# Classify documents by token count docs = session.read.docs("data/**/*.md", content_type="markdown", recursive=True) with_tokens = docs.with_column( "token_count", fc.text.count_tokens(fc.col("content")) ) # Route based on size small_docs = with_tokens.filter(fc.col("token_count") < 4000) large_docs = with_tokens.filter(fc.col("token_count") >= 4000) # Process small docs directly small_results = small_docs.select( fc.semantic.extract(fc.col("content"), ExtractSchema) ) # Chunk large docs first large_chunked = large_docs.select( fc.markdown.extract_header_chunks("content", header_level=2) ).explode("chunks") large_results = large_chunked.select( fc.semantic.extract(fc.col("chunks").content, ExtractSchema) )
This routing pattern processes simple cases efficiently while handling large cases correctly. Small documents skip unnecessary chunking overhead, while large documents get split intelligently before processing.
Hierarchical Summarization
Summarize long documents by processing chunks independently, then aggregating summaries.
python# Chunk document chunked = df.select( fc.col("doc_id"), fc.markdown.extract_header_chunks("content", header_level=2).alias("chunks") ).explode("chunks") # Summarize each chunk chunk_summaries = chunked.select( fc.col("doc_id"), fc.semantic.summarize( fc.col("chunks").content, format=fc.KeyPoints(max_points=5) ).alias("chunk_summary") ) # Aggregate chunk summaries into document summary doc_summaries = ( chunk_summaries .group_by("doc_id") .agg( fc.semantic.reduce( "Combine these section summaries into a coherent document summary", fc.col("chunk_summary"), max_output_tokens=1024 ).alias("final_summary") ) )
Hierarchical summarization respects context windows at each level while producing high-quality summaries. Each chunk summary fits comfortably within model limits, and the final aggregation combines them without exceeding output constraints.
Semantic Search with Context Assembly
Build retrieval pipelines that find relevant chunks and assemble them into coherent context for generation.
python# Index documents indexed = docs.select( fc.col("doc_id"), fc.col("content"), fc.semantic.embed(fc.col("content")).alias("embedding") ).persist() # Query processing function def search_and_generate(query: str, max_context_tokens: int = 6000): # Embed query query_vec = session.create_dataframe({"q": [query]}).select( fc.semantic.embed(fc.col("q")).alias("qvec") ).collect()[0]["qvec"] # Find relevant chunks relevant = ( indexed .with_column("similarity", fc.embedding.compute_similarity( fc.col("embedding"), fc.lit(query_vec), metric="cosine" )) .sort(fc.col("similarity").desc()) .with_column("tokens", fc.text.count_tokens(fc.col("content"))) .with_column("running_tokens", fc.sum("tokens").over( fc.Window.order_by(fc.col("similarity").desc()) )) .filter(fc.col("running_tokens") <= max_context_tokens) ) # Generate answer using assembled context return relevant.select( fc.semantic.map( "Answer this question using the provided context: {{query}}\n\nContext: {{context}}", query=fc.lit(query), context=fc.col("content") ) )
This pattern retrieves only what fits within your context budget, ordered by relevance. The running token calculation ensures you never exceed limits while maximizing the useful context sent to the model.
When to Use Each Approach
Different scenarios require different chunking and context management strategies:
Use character chunking when:
- Processing unstructured text where semantic boundaries don't exist
- You need precisely controlled chunk sizes for downstream processing that expects fixed-size inputs
Use token chunking when:
- Working directly with LLM APIs where precise token counts matter for cost calculation and rate limiting
- You need guaranteed fit within specific model context windows
Use word chunking when:
- You need a fast approximation and semantic boundaries align with word breaks
- Exact token counts aren't critical, particularly useful for initial prototyping before optimizing for production
Use structure-aware chunking when:
- Documents have clear hierarchical structure like markdown, HTML, or JSON
- Preserving semantic relationships improves downstream task quality
Use embeddings when:
- You need to compare, cluster, or retrieve content across large datasets without repeatedly consuming context windows
- Building retrieval-augmented generation systems
Use semantic reduction when:
- You need to consolidate information from multiple chunks while respecting output token limits
- Building hierarchical summarization or aggregation tasks
Getting Started
Start building token-efficient pipelines with Fenic:
Install Fenic:
bashpip install fenic
Set up your environment:
bashexport OPENAI_API_KEY=... export ANTHROPIC_API_KEY=... export GOOGLE_API_KEY=...
Create your first chunking pipeline:
pythonimport fenic as fc session = fc.Session.get_or_create( fc.SessionConfig(app_name="chunking_demo") ) # Load and chunk documents docs = session.read.docs("data/**/*.md", content_type="markdown", recursive=True) chunked = docs.select( fc.col("file_path"), fc.markdown.extract_header_chunks("content", header_level=2).alias("chunks") ).explode("chunks") # Generate embeddings embedded = chunked.select( fc.col("file_path"), fc.col("chunks").content.alias("text"), fc.semantic.embed(fc.col("chunks").content).alias("embedding") ) embedded.show()
For more examples and detailed documentation, explore Typedef's resources:
- Fenic open source announcement
- Log clustering and triage guide
- Building reliable AI pipelines
- Latest release notes How to Tackle Chunking and C ... efcf0803b98e6d249aad65d42.md External Displaying How to Tackle Chunking and Context Windows in LLM 2a6df41efcf0803b98e6d249aad65d42.md.

