Most organizations store valuable data in PDFs and HTML pages. Product documentation, customer support tickets, meeting transcripts, legal contracts, and research papers exist in these formats. Traditional data infrastructure treats them as afterthoughts, forcing teams to build fragile preprocessing pipelines before extracting meaningful insights.
The problem extends beyond format conversion. When PDFs and HTML become second-class citizens in your data stack, you end up with brittle glue code, manual parsing steps, and separate inference infrastructure. Each document requires OCR models, custom extraction scripts, and constant data movement between systems.
This guide demonstrates how to process HTML and PDFs as native data types in production pipelines using Fenic, the open-source DataFrame framework from Typedef. You'll learn to read, transform, and extract structured information from documents using DataFrame operations enhanced with semantic intelligence.
First-Class Data Types in Fenic
Traditional data frameworks retrofit document processing as external operations. You call a UDF, parse the output, and hope the schema matches. Fenic takes a different approach: it extends the DataFrame abstraction with specialized types for text-heavy workloads.
The framework includes native support for:
- MarkdownType: Structured text with headers, lists, and formatting
- HtmlType: Raw HTML markup with DOM structure
- JsonType: Nested data with JQ query support
- TranscriptType: Time-stamped speaker content
- EmbeddingType: Vector representations for semantic operations
These types come with dedicated column functions that recognize document structure. When you cast a column to MarkdownType, you gain access to operations like markdown.extract_header_chunks() that preserve hierarchical relationships. Cast to HTMLType and you can manipulate DOM elements directly in your pipeline.
The practical benefit is immediate. Instead of writing custom parsers and maintaining separate preprocessing code, you use declarative operations that compose naturally with other DataFrame transformations.
Processing PDFs with Parse and Metadata Extraction
Fenic 0.5.0 introduced production-ready PDF processing through two complementary functions: semantic.parse_pdf() for content extraction and read.pdf_metadata() for document properties.
Reading PDF Metadata for Document Discovery
Before parsing content, inspect document properties to filter, route, or batch process files efficiently. The pdf_metadata reader returns a DataFrame with size, page count, author, creation dates, encryption status, signatures, and image counts.
pythonimport fenic as fc session = fc.Session.get_or_create( fc.SessionConfig( app_name="document_pipeline", semantic=fc.SemanticConfig( language_models={ "gemini": fc.GoogleDeveloperLanguageModel( model_name="gemini-2.0-flash", rpm=100, tpm=1000 ) }, default_language_model="gemini" ) ) ) # Read metadata from all PDFs recursively pdf_meta = session.read.pdf_metadata( "data/documents/**/*.pdf", recursive=True ) # Filter for processing based on document properties processable = pdf_meta.filter( (fc.col("page_count") < 50) & (fc.col("encrypted") == False) ) processable.show()
This metadata-first approach prevents unnecessary parsing of encrypted, oversized, or irrelevant documents. Route technical manuals differently than legal contracts based on page count thresholds, or skip processing password-protected files entirely.
Parsing PDFs to Markdown
The semantic.parse_pdf() function extracts text and converts it to clean markdown with configurable page separators and image descriptions. This capability includes optimized token accounting for better batching and cost management.
python# Parse PDFs into markdown with page markers markdown_docs = pdf_meta.select( fc.col("file_path"), fc.semantic.parse_pdf( fc.col("file_path"), page_separator="--- PAGE {page} ---", describe_images=True ).alias("content") ) # Extract structured data from parsed content from pydantic import BaseModel, Field from typing import List class PolicySection(BaseModel): section_title: str = Field(description="Title of the policy section") requirements: List[str] = Field(description="List of requirements") effective_date: str = Field(description="When this policy takes effect") policies = markdown_docs.with_column( "structured", fc.semantic.extract( fc.col("content"), PolicySection ) ).unnest("structured") policies.show()
Page separators improve throughput by fitting more content per LLM request while maintaining document context. Image descriptions are valuable when processing technical documentation or product catalogs where visual content carries critical information.
Combining Metadata and Content Processing
Production pipelines combine metadata filtering with content extraction to process only relevant documents. This pattern reduces costs and improves pipeline reliability.
python# Complete workflow: filter by metadata, then parse content relevant_docs = ( session.read.pdf_metadata("data/contracts/**/*.pdf", recursive=True) .filter( (fc.col("page_count") >= 5) & (fc.col("page_count") <= 100) & (fc.col("creation_date") > "2023-01-01") ) .select( fc.col("file_path"), fc.col("page_count"), fc.semantic.parse_pdf( fc.col("file_path"), page_separator="\\n\\n--- PAGE {page} ---\\n\\n" ).alias("markdown_content") ) ) # Extract contract terms using semantic operations contract_terms = relevant_docs.with_column( "terms", fc.semantic.extract( fc.col("markdown_content"), ContractTerms # Your Pydantic schema ) )
This approach scales to thousands of documents while maintaining type safety and lineage tracking throughout the pipeline.
Working with HTML as a Native Data Type
HTML processing follows the same pattern as PDFs. Cast content to HTMLType and apply specialized operations without leaving DataFrame semantics.
python# Read HTML files from directory html_docs = session.read.docs( "data/web_scrapes/", content_type="html", recursive=True ) # HTML is now a first-class column type html_docs.show()
Converting HTML to Markdown for Semantic Processing
HTML often contains navigation, styling, and scripts that obscure content structure. Converting to markdown strips presentation layer overhead while preserving semantic meaning.
python# Cast HTML to markdown for cleaner semantic operations clean_content = html_docs.with_column( "markdown", fc.col("content").cast(fc.MarkdownType) ) # Extract header-based chunks from markdown chunked = clean_content.with_column( "sections", fc.markdown.extract_header_chunks( "markdown", header_level=2 ) ).explode("sections") # Each chunk maintains header hierarchy chunked.select( fc.col("file_path"), fc.col("sections").content.alias("section_content"), fc.col("sections").heading.alias("section_heading"), fc.col("sections").full_path.alias("section_path") ).show()
The extract_header_chunks() function respects markdown structure. When you split a document at headers, each chunk includes its hierarchical context (h1 > h2 > h3), making downstream semantic operations more accurate.
Extracting Structured Data from HTML Content
After converting HTML to markdown, use semantic extraction to pull structured information based on Pydantic schemas.
pythonfrom pydantic import BaseModel from typing import List class ProductInfo(BaseModel): name: str price: float features: List[str] availability: str # Extract product information from HTML content products = ( html_docs .with_column("markdown", fc.col("content").cast(fc.MarkdownType)) .with_column( "product_data", fc.semantic.extract("markdown", ProductInfo) ) .unnest("product_data") ) # Filter and analyze structured results available_products = products.filter( fc.col("availability") == "in_stock" )
This pattern works for any HTML source: scraped web pages, exported documentation, saved email newsletters, or archived blog posts. The semantic extraction handles variations in HTML structure as long as the information exists in the content.
Converting Between Data Types
Fenic's type system enables seamless conversion between formats. Cast operations preserve content while adapting it for different processing needs.
Common Type Conversions
python# HTML to Markdown (most common) df = df.with_column( "markdown", fc.col("html_content").cast(fc.MarkdownType) ) # String to JSON for structured parsing df = df.with_column( "json_data", fc.col("raw_json_string").cast(fc.JsonType) ) # Markdown to String for simple text operations df = df.with_column( "plain_text", fc.col("markdown").cast(fc.StringType) )
Working with JSON Columns
JSON data requires different operations than markdown or HTML. Use JQ expressions to query nested structures directly in DataFrame operations.
python# Extract nested fields from JSON using JQ df = df.with_column( "title", fc.json.jq("json_column", ".metadata.title") ).with_column( "tags", fc.json.jq("json_column", ".tags[]") )
The json.jq() function brings JQ query capabilities to DataFrame columns. Traverse nested structures, filter arrays, and transform data without writing custom Python code.
Building Production Document Pipelines
Production pipelines require more than extraction. They need batching, error handling, cost management, and monitoring. Fenic provides these capabilities through its inference-first architecture.
Multi-Stage Processing with Type Casting
Documents often require multiple transformation stages. Fenic's type system makes these pipelines explicit and testable.
pythonpipeline = ( session.read.pdf_metadata("data/reports/**/*.pdf", recursive=True) # Stage 1: Filter by metadata .filter(fc.col("page_count") > 5) # Stage 2: Parse to markdown .select( fc.col("file_path"), fc.semantic.parse_pdf(fc.col("file_path")).alias("markdown") ) # Stage 3: Extract sections .with_column( "sections", fc.markdown.extract_header_chunks("markdown", header_level=2) ) .explode("sections") # Stage 4: Semantic extraction .with_column( "insights", fc.semantic.extract( fc.col("sections").content, ReportInsights ) ) .unnest("insights") ) results = pipeline.collect()
Each stage transforms data while maintaining lineage. If extraction fails on specific sections, trace back through the pipeline to identify which PDF and page caused the issue.
Handling Rate Limits and Batch Inference
Fenic automatically batches LLM calls and manages rate limits across providers. Configure model profiles to control throughput and costs.
pythonconfig = fc.SessionConfig( app_name="production_pipeline", semantic=fc.SemanticConfig( language_models={ "fast": fc.OpenAILanguageModel( "gpt-4o-mini", rpm=500, tpm=200000 ), "accurate": fc.AnthropicLanguageModel( "claude-sonnet-4.5", rpm=100, tpm=80000 ) }, default_language_model="fast" ) ) session = fc.Session.get_or_create(config) # Fast model for classification classified = df.with_column( "category", fc.semantic.classify( fc.col("content"), classes=["technical", "business", "legal"], model_alias="fast" ) ) # Accurate model for extraction extracted = classified.with_column( "structured_data", fc.semantic.extract( fc.col("content"), ComplexSchema, model_alias="accurate" ) )
The framework handles retries, exponential backoff, and concurrent request batching. Define transformations while Fenic manages reliability.
Cost Tracking and Monitoring
Track inference costs and latency through the built-in metrics table. This feature was introduced in Fenic 0.4.0 for production observability.
python# Query metrics after pipeline execution metrics = session.table("fenic_system.query_metrics") cost_analysis = ( metrics .select("model", "latency_ms", "cost_usd", "tokens_used") .order_by("cost_usd", ascending=False) ) cost_analysis.show()
This telemetry helps optimize model selection. If classification costs exceed extraction costs, switch to a cheaper model for that operation or adjust batching parameters.
Implementation Patterns for Production
The following patterns demonstrate how organizations process HTML and PDFs at scale using Fenic.
Content Classification Pipeline
Process documents from multiple sources, classify by topic, and route for specialized handling.
pythonfrom pydantic import BaseModel from typing import Literal, List class DocumentCategory(BaseModel): primary_topic: Literal["technical", "sales", "support", "legal"] confidence: float key_entities: List[str] # Unified pipeline for PDFs and HTML documents = ( # Load PDFs session.read.pdf_metadata("data/pdfs/**/*.pdf", recursive=True) .select( fc.col("file_path"), fc.lit("pdf").alias("source_type"), fc.semantic.parse_pdf(fc.col("file_path")).alias("content") ) .union( # Load HTML documents session.read.docs("data/html/**/*.html", content_type="html", recursive=True) .select( fc.col("file_path"), fc.lit("html").alias("source_type"), fc.col("content").cast(fc.MarkdownType).alias("content") ) ) ) # Classify all documents classified = documents.with_column( "classification", fc.semantic.extract("content", DocumentCategory) ).unnest("classification") # Route by category technical_docs = classified.filter( fc.col("primary_topic") == "technical" )
This pattern enables processing heterogeneous document collections through a single pipeline while maintaining type safety and semantic consistency.
Knowledge Base Construction
Extract structured information from documentation to build searchable knowledge bases.
pythonclass KBEntry(BaseModel): title: str summary: str key_concepts: List[str] related_topics: List[str] kb_pipeline = ( session.read.pdf_metadata("docs/**/*.pdf", recursive=True) .select( fc.col("file_path"), fc.semantic.parse_pdf( fc.col("file_path"), describe_images=True ).alias("markdown") ) .with_column( "chunks", fc.markdown.extract_header_chunks("markdown", header_level=2) ) .explode("chunks") .with_column( "kb_entry", fc.semantic.extract( fc.col("chunks").content, KBEntry ) ) .unnest("kb_entry") .with_column( "embedding", fc.semantic.embed(fc.col("summary")) ) ) # Persist for semantic search kb_pipeline.write.save_as_table("knowledge_base")
The resulting table supports semantic search through embedding similarity while maintaining structured metadata for filtering and faceting.
Compliance Document Processing
Extract clauses and requirements from legal documents with citation tracking. Typedef demonstrated similar capabilities in their RudderStack case study, where the platform processed unstructured inputs for product triage.
pythonclass ComplianceClause(BaseModel): clause_type: str requirement: str applies_to: List[str] effective_date: str page_reference: int compliance_pipeline = ( session.read.pdf_metadata("contracts/**/*.pdf", recursive=True) .select( fc.col("file_path"), fc.semantic.parse_pdf( fc.col("file_path"), page_separator="\\n--- PAGE {page} ---\\n" ).alias("content") ) .with_column( "clauses", fc.semantic.extract("content", ComplianceClause) ) .unnest("clauses") ) # Filter high-priority requirements critical_clauses = compliance_pipeline.filter( fc.col("clause_type").isin(["data_retention", "security", "audit"]) )
Page separators in the parse operation enable accurate page references in extracted clauses, meeting audit requirements.
Best Practices for Document Processing at Scale
Successful production deployments follow these patterns for reliability and performance.
Start with Metadata Filtering
Always read PDF metadata before parsing content. This operation can reduce processing costs by 80% or more by eliminating irrelevant documents early.
python# Correct: Filter before parsing relevant = ( session.read.pdf_metadata("data/**/*.pdf", recursive=True) .filter( (fc.col("page_count") <= 50) & (fc.col("encrypted") == False) & (fc.col("creation_date") > "2024-01-01") ) .select( fc.col("file_path"), fc.semantic.parse_pdf(fc.col("file_path")).alias("content") ) ) # Incorrect: Parse everything then filter all_parsed = ( session.read.pdf_metadata("data/**/*.pdf", recursive=True) .select( fc.col("file_path"), fc.semantic.parse_pdf(fc.col("file_path")).alias("content") ) .filter(fc.col("file_path").contains("relevant")) # Too late )
Use Appropriate Chunk Sizes
Document chunking affects both cost and accuracy. Smaller chunks cost less but may lose context. Larger chunks preserve context but approach context window limits.
python# Technical documentation: smaller chunks (more granular headers) technical = df.with_column( "sections", fc.markdown.extract_header_chunks("content", header_level=3) ) # Narrative content: larger chunks (top-level headers) narrative = df.with_column( "sections", fc.markdown.extract_header_chunks("content", header_level=1) )
Test different chunk sizes on sample data to find the optimal balance for your use case.
Leverage Type Casting for Clean Pipelines
Convert HTML to markdown early in pipelines to simplify downstream operations. Raw HTML adds parsing overhead due to markup noise.
python# Convert HTML to markdown immediately after loading clean_pipeline = ( session.read.docs("data/*.html", content_type="html", recursive=True) .with_column("markdown", fc.col("content").cast(fc.MarkdownType)) .drop("content") # Work with markdown from here .with_column( "extracted", fc.semantic.extract("markdown", Schema) ) )
Implement Staged Processing
Break extractions into stages with intermediate persistence. This enables faster iteration and easier debugging.
python# Stage 1: Parse and chunk parsed = ( session.read.pdf_metadata("data/**/*.pdf", recursive=True) .filter(fc.col("page_count") < 100) .select( fc.col("file_path"), fc.semantic.parse_pdf(fc.col("file_path")).alias("markdown") ) .with_column( "chunks", fc.markdown.extract_header_chunks("markdown", header_level=2) ) ) # Persist intermediate results parsed.write.save_as_view("parsed_documents") # Stage 2: Extract structure (can iterate separately) extracted = ( session.view("parsed_documents") .explode("chunks") .with_column( "structured", fc.semantic.extract( fc.col("chunks").content, DetailedSchema ) ) )
Views enable rapid experimentation with extraction schemas without re-parsing source documents.
Monitor Extraction Quality
Use classification confidence scores and validation rules to detect extraction issues early.
pythonfrom pydantic import BaseModel, Field, validator class ValidatedExtraction(BaseModel): title: str date: str @validator('date') def validate_date_format(cls, v): # Pydantic validation ensures data quality return v # Extract with validation results = df.with_column( "validated", fc.semantic.extract("content", ValidatedExtraction) ) # Track failures failed = results.filter(fc.col("validated").isNull()) success_rate = ( results.count() - failed.count() ) / results.count() print(f"Extraction success rate: {success_rate:.2%}")
Moving from Prototype to Production
Fenic enables local development with zero-change deployment to Typedef Cloud. This seamless transition was a core design goal for the framework.
Local Development Pattern
python# Local development session local_config = fc.SessionConfig( app_name="document_pipeline", semantic=fc.SemanticConfig( language_models={ "dev": fc.OpenAILanguageModel("gpt-4o-mini", rpm=50, tpm=10000) }, default_language_model="dev" ) ) local_session = fc.Session.get_or_create(local_config) # Build and test pipeline locally pipeline = build_document_pipeline(local_session) results = pipeline.limit(10).collect() # Test on small sample
Production Deployment
python# Production session with cloud config prod_config = fc.SessionConfig( app_name="document_pipeline", semantic=fc.SemanticConfig( language_models={ "prod": fc.OpenAILanguageModel("gpt-4o", rpm=500, tpm=200000) }, default_language_model="prod" ), cloud=fc.CloudConfig( # Typedef Cloud configuration endpoint="https://api.typedef.ai", api_key=os.getenv("TYPEDEF_API_KEY") ) ) prod_session = fc.Session.get_or_create(prod_config) # Same pipeline code, cloud execution pipeline = build_document_pipeline(prod_session) results = pipeline.collect() # Runs on Typedef Cloud
The pipeline code remains identical. Only session configuration changes between environments.
Next Steps
Processing HTML and PDFs as first-class data types eliminates the fragile preprocessing layer in traditional AI pipelines. Fenic's type system extends DataFrame operations with document-aware semantics, enabling you to read, transform, and extract structure using familiar patterns.
The framework handles rate limiting, batching, retries, and cost tracking while you focus on defining transformations. Parse PDFs to markdown, extract structured data with Pydantic schemas, and compose operations declaratively. Develop locally, deploy to production without code changes.
This approach scales from prototype to production without architectural rewrites. Start with Fenic's open-source framework for local development, then leverage Typedef Cloud for enterprise scale and collaboration.
Learn more about building reliable AI pipelines with semantic operators or explore how to eliminate fragile glue code in your data processing workflows. How to Process HTML and PD ... cf080a0ba54f8c9c6078c4e.md External Displaying How to Process HTML and PDFs as First-Class Citize 290df41efcf080a0ba54f8c9c6078c4e.md.

