<< goback()

How to Process HTML and PDFs as First-Class Citizens in Data Pipelines

Typedef Team

How to Process HTML and PDFs as First-Class Citizens in Data Pipelines

Most organizations store valuable data in PDFs and HTML pages. Product documentation, customer support tickets, meeting transcripts, legal contracts, and research papers exist in these formats. Traditional data infrastructure treats them as afterthoughts, forcing teams to build fragile preprocessing pipelines before extracting meaningful insights.

The problem extends beyond format conversion. When PDFs and HTML become second-class citizens in your data stack, you end up with brittle glue code, manual parsing steps, and separate inference infrastructure. Each document requires OCR models, custom extraction scripts, and constant data movement between systems.

This guide demonstrates how to process HTML and PDFs as native data types in production pipelines using Fenic, the open-source DataFrame framework from Typedef. You'll learn to read, transform, and extract structured information from documents using DataFrame operations enhanced with semantic intelligence.

First-Class Data Types in Fenic

Traditional data frameworks retrofit document processing as external operations. You call a UDF, parse the output, and hope the schema matches. Fenic takes a different approach: it extends the DataFrame abstraction with specialized types for text-heavy workloads.

The framework includes native support for:

  • MarkdownType: Structured text with headers, lists, and formatting
  • HtmlType: Raw HTML markup with DOM structure
  • JsonType: Nested data with JQ query support
  • TranscriptType: Time-stamped speaker content
  • EmbeddingType: Vector representations for semantic operations

These types come with dedicated column functions that recognize document structure. When you cast a column to MarkdownType, you gain access to operations like markdown.extract_header_chunks() that preserve hierarchical relationships. Cast to HTMLType and you can manipulate DOM elements directly in your pipeline.

The practical benefit is immediate. Instead of writing custom parsers and maintaining separate preprocessing code, you use declarative operations that compose naturally with other DataFrame transformations.

Processing PDFs with Parse and Metadata Extraction

Fenic 0.5.0 introduced production-ready PDF processing through two complementary functions: semantic.parse_pdf() for content extraction and read.pdf_metadata() for document properties.

Reading PDF Metadata for Document Discovery

Before parsing content, inspect document properties to filter, route, or batch process files efficiently. The pdf_metadata reader returns a DataFrame with size, page count, author, creation dates, encryption status, signatures, and image counts.

python
import fenic as fc

session = fc.Session.get_or_create(
    fc.SessionConfig(
        app_name="document_pipeline",
        semantic=fc.SemanticConfig(
            language_models={
                "gemini": fc.GoogleDeveloperLanguageModel(
                    model_name="gemini-2.0-flash",
                    rpm=100,
                    tpm=1000
                )
            },
            default_language_model="gemini"
        )
    )
)

# Read metadata from all PDFs recursively
pdf_meta = session.read.pdf_metadata(
    "data/documents/**/*.pdf",
    recursive=True
)

# Filter for processing based on document properties
processable = pdf_meta.filter(
    (fc.col("page_count") < 50) &
    (fc.col("encrypted") == False)
)

processable.show()

This metadata-first approach prevents unnecessary parsing of encrypted, oversized, or irrelevant documents. Route technical manuals differently than legal contracts based on page count thresholds, or skip processing password-protected files entirely.

Parsing PDFs to Markdown

The semantic.parse_pdf() function extracts text and converts it to clean markdown with configurable page separators and image descriptions. This capability includes optimized token accounting for better batching and cost management.

python
# Parse PDFs into markdown with page markers
markdown_docs = pdf_meta.select(
    fc.col("file_path"),
    fc.semantic.parse_pdf(
        fc.col("file_path"),
        page_separator="--- PAGE {page} ---",
        describe_images=True
    ).alias("content")
)

# Extract structured data from parsed content
from pydantic import BaseModel, Field
from typing import List

class PolicySection(BaseModel):
    section_title: str = Field(description="Title of the policy section")
    requirements: List[str] = Field(description="List of requirements")
    effective_date: str = Field(description="When this policy takes effect")

policies = markdown_docs.with_column(
    "structured",
    fc.semantic.extract(
        fc.col("content"),
        PolicySection
    )
).unnest("structured")

policies.show()

Page separators improve throughput by fitting more content per LLM request while maintaining document context. Image descriptions are valuable when processing technical documentation or product catalogs where visual content carries critical information.

Combining Metadata and Content Processing

Production pipelines combine metadata filtering with content extraction to process only relevant documents. This pattern reduces costs and improves pipeline reliability.

python
# Complete workflow: filter by metadata, then parse content
relevant_docs = (
    session.read.pdf_metadata("data/contracts/**/*.pdf", recursive=True)
    .filter(
        (fc.col("page_count") >= 5) &
        (fc.col("page_count") <= 100) &
        (fc.col("creation_date") > "2023-01-01")
    )
    .select(
        fc.col("file_path"),
        fc.col("page_count"),
        fc.semantic.parse_pdf(
            fc.col("file_path"),
            page_separator="\\n\\n--- PAGE {page} ---\\n\\n"
        ).alias("markdown_content")
    )
)

# Extract contract terms using semantic operations
contract_terms = relevant_docs.with_column(
    "terms",
    fc.semantic.extract(
        fc.col("markdown_content"),
        ContractTerms  # Your Pydantic schema
    )
)

This approach scales to thousands of documents while maintaining type safety and lineage tracking throughout the pipeline.

Working with HTML as a Native Data Type

HTML processing follows the same pattern as PDFs. Cast content to HTMLType and apply specialized operations without leaving DataFrame semantics.

python
# Read HTML files from directory
html_docs = session.read.docs(
    "data/web_scrapes/",
    content_type="html",
    recursive=True
)

# HTML is now a first-class column type
html_docs.show()

Converting HTML to Markdown for Semantic Processing

HTML often contains navigation, styling, and scripts that obscure content structure. Converting to markdown strips presentation layer overhead while preserving semantic meaning.

python
# Cast HTML to markdown for cleaner semantic operations
clean_content = html_docs.with_column(
    "markdown",
    fc.col("content").cast(fc.MarkdownType)
)

# Extract header-based chunks from markdown
chunked = clean_content.with_column(
    "sections",
    fc.markdown.extract_header_chunks(
        "markdown",
        header_level=2
    )
).explode("sections")

# Each chunk maintains header hierarchy
chunked.select(
    fc.col("file_path"),
    fc.col("sections").content.alias("section_content"),
    fc.col("sections").heading.alias("section_heading"),
    fc.col("sections").full_path.alias("section_path")
).show()

The extract_header_chunks() function respects markdown structure. When you split a document at headers, each chunk includes its hierarchical context (h1 > h2 > h3), making downstream semantic operations more accurate.

Extracting Structured Data from HTML Content

After converting HTML to markdown, use semantic extraction to pull structured information based on Pydantic schemas.

python
from pydantic import BaseModel
from typing import List

class ProductInfo(BaseModel):
    name: str
    price: float
    features: List[str]
    availability: str

# Extract product information from HTML content
products = (
    html_docs
    .with_column("markdown", fc.col("content").cast(fc.MarkdownType))
    .with_column(
        "product_data",
        fc.semantic.extract("markdown", ProductInfo)
    )
    .unnest("product_data")
)

# Filter and analyze structured results
available_products = products.filter(
    fc.col("availability") == "in_stock"
)

This pattern works for any HTML source: scraped web pages, exported documentation, saved email newsletters, or archived blog posts. The semantic extraction handles variations in HTML structure as long as the information exists in the content.

Converting Between Data Types

Fenic's type system enables seamless conversion between formats. Cast operations preserve content while adapting it for different processing needs.

Common Type Conversions

python
# HTML to Markdown (most common)
df = df.with_column(
    "markdown",
    fc.col("html_content").cast(fc.MarkdownType)
)

# String to JSON for structured parsing
df = df.with_column(
    "json_data",
    fc.col("raw_json_string").cast(fc.JsonType)
)

# Markdown to String for simple text operations
df = df.with_column(
    "plain_text",
    fc.col("markdown").cast(fc.StringType)
)

Working with JSON Columns

JSON data requires different operations than markdown or HTML. Use JQ expressions to query nested structures directly in DataFrame operations.

python
# Extract nested fields from JSON using JQ
df = df.with_column(
    "title",
    fc.json.jq("json_column", ".metadata.title")
).with_column(
    "tags",
    fc.json.jq("json_column", ".tags[]")
)

The json.jq() function brings JQ query capabilities to DataFrame columns. Traverse nested structures, filter arrays, and transform data without writing custom Python code.

Building Production Document Pipelines

Production pipelines require more than extraction. They need batching, error handling, cost management, and monitoring. Fenic provides these capabilities through its inference-first architecture.

Multi-Stage Processing with Type Casting

Documents often require multiple transformation stages. Fenic's type system makes these pipelines explicit and testable.

python
pipeline = (
    session.read.pdf_metadata("data/reports/**/*.pdf", recursive=True)
    # Stage 1: Filter by metadata
    .filter(fc.col("page_count") > 5)
    # Stage 2: Parse to markdown
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(fc.col("file_path")).alias("markdown")
    )
    # Stage 3: Extract sections
    .with_column(
        "sections",
        fc.markdown.extract_header_chunks("markdown", header_level=2)
    )
    .explode("sections")
    # Stage 4: Semantic extraction
    .with_column(
        "insights",
        fc.semantic.extract(
            fc.col("sections").content,
            ReportInsights
        )
    )
    .unnest("insights")
)

results = pipeline.collect()

Each stage transforms data while maintaining lineage. If extraction fails on specific sections, trace back through the pipeline to identify which PDF and page caused the issue.

Handling Rate Limits and Batch Inference

Fenic automatically batches LLM calls and manages rate limits across providers. Configure model profiles to control throughput and costs.

python
config = fc.SessionConfig(
    app_name="production_pipeline",
    semantic=fc.SemanticConfig(
        language_models={
            "fast": fc.OpenAILanguageModel(
                "gpt-4o-mini",
                rpm=500,
                tpm=200000
            ),
            "accurate": fc.AnthropicLanguageModel(
                "claude-sonnet-4.5",
                rpm=100,
                tpm=80000
            )
        },
        default_language_model="fast"
    )
)

session = fc.Session.get_or_create(config)

# Fast model for classification
classified = df.with_column(
    "category",
    fc.semantic.classify(
        fc.col("content"),
        classes=["technical", "business", "legal"],
        model_alias="fast"
    )
)

# Accurate model for extraction
extracted = classified.with_column(
    "structured_data",
    fc.semantic.extract(
        fc.col("content"),
        ComplexSchema,
        model_alias="accurate"
    )
)

The framework handles retries, exponential backoff, and concurrent request batching. Define transformations while Fenic manages reliability.

Cost Tracking and Monitoring

Track inference costs and latency through the built-in metrics table. This feature was introduced in Fenic 0.4.0 for production observability.

python
# Query metrics after pipeline execution
metrics = session.table("fenic_system.query_metrics")

cost_analysis = (
    metrics
    .select("model", "latency_ms", "cost_usd", "tokens_used")
    .order_by("cost_usd", ascending=False)
)

cost_analysis.show()

This telemetry helps optimize model selection. If classification costs exceed extraction costs, switch to a cheaper model for that operation or adjust batching parameters.

Implementation Patterns for Production

The following patterns demonstrate how organizations process HTML and PDFs at scale using Fenic.

Content Classification Pipeline

Process documents from multiple sources, classify by topic, and route for specialized handling.

python
from pydantic import BaseModel
from typing import Literal, List

class DocumentCategory(BaseModel):
    primary_topic: Literal["technical", "sales", "support", "legal"]
    confidence: float
    key_entities: List[str]

# Unified pipeline for PDFs and HTML
documents = (
    # Load PDFs
    session.read.pdf_metadata("data/pdfs/**/*.pdf", recursive=True)
    .select(
        fc.col("file_path"),
        fc.lit("pdf").alias("source_type"),
        fc.semantic.parse_pdf(fc.col("file_path")).alias("content")
    )
    .union(
        # Load HTML documents
        session.read.docs("data/html/**/*.html", content_type="html", recursive=True)
        .select(
            fc.col("file_path"),
            fc.lit("html").alias("source_type"),
            fc.col("content").cast(fc.MarkdownType).alias("content")
        )
    )
)

# Classify all documents
classified = documents.with_column(
    "classification",
    fc.semantic.extract("content", DocumentCategory)
).unnest("classification")

# Route by category
technical_docs = classified.filter(
    fc.col("primary_topic") == "technical"
)

This pattern enables processing heterogeneous document collections through a single pipeline while maintaining type safety and semantic consistency.

Knowledge Base Construction

Extract structured information from documentation to build searchable knowledge bases.

python
class KBEntry(BaseModel):
    title: str
    summary: str
    key_concepts: List[str]
    related_topics: List[str]

kb_pipeline = (
    session.read.pdf_metadata("docs/**/*.pdf", recursive=True)
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(
            fc.col("file_path"),
            describe_images=True
        ).alias("markdown")
    )
    .with_column(
        "chunks",
        fc.markdown.extract_header_chunks("markdown", header_level=2)
    )
    .explode("chunks")
    .with_column(
        "kb_entry",
        fc.semantic.extract(
            fc.col("chunks").content,
            KBEntry
        )
    )
    .unnest("kb_entry")
    .with_column(
        "embedding",
        fc.semantic.embed(fc.col("summary"))
    )
)

# Persist for semantic search
kb_pipeline.write.save_as_table("knowledge_base")

The resulting table supports semantic search through embedding similarity while maintaining structured metadata for filtering and faceting.

Compliance Document Processing

Extract clauses and requirements from legal documents with citation tracking. Typedef demonstrated similar capabilities in their RudderStack case study, where the platform processed unstructured inputs for product triage.

python
class ComplianceClause(BaseModel):
    clause_type: str
    requirement: str
    applies_to: List[str]
    effective_date: str
    page_reference: int

compliance_pipeline = (
    session.read.pdf_metadata("contracts/**/*.pdf", recursive=True)
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(
            fc.col("file_path"),
            page_separator="\\n--- PAGE {page} ---\\n"
        ).alias("content")
    )
    .with_column(
        "clauses",
        fc.semantic.extract("content", ComplianceClause)
    )
    .unnest("clauses")
)

# Filter high-priority requirements
critical_clauses = compliance_pipeline.filter(
    fc.col("clause_type").isin(["data_retention", "security", "audit"])
)

Page separators in the parse operation enable accurate page references in extracted clauses, meeting audit requirements.

Best Practices for Document Processing at Scale

Successful production deployments follow these patterns for reliability and performance.

Start with Metadata Filtering

Always read PDF metadata before parsing content. This operation can reduce processing costs by 80% or more by eliminating irrelevant documents early.

python
# Correct: Filter before parsing
relevant = (
    session.read.pdf_metadata("data/**/*.pdf", recursive=True)
    .filter(
        (fc.col("page_count") <= 50) &
        (fc.col("encrypted") == False) &
        (fc.col("creation_date") > "2024-01-01")
    )
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(fc.col("file_path")).alias("content")
    )
)

# Incorrect: Parse everything then filter
all_parsed = (
    session.read.pdf_metadata("data/**/*.pdf", recursive=True)
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(fc.col("file_path")).alias("content")
    )
    .filter(fc.col("file_path").contains("relevant"))  # Too late
)

Use Appropriate Chunk Sizes

Document chunking affects both cost and accuracy. Smaller chunks cost less but may lose context. Larger chunks preserve context but approach context window limits.

python
# Technical documentation: smaller chunks (more granular headers)
technical = df.with_column(
    "sections",
    fc.markdown.extract_header_chunks("content", header_level=3)
)

# Narrative content: larger chunks (top-level headers)
narrative = df.with_column(
    "sections",
    fc.markdown.extract_header_chunks("content", header_level=1)
)

Test different chunk sizes on sample data to find the optimal balance for your use case.

Leverage Type Casting for Clean Pipelines

Convert HTML to markdown early in pipelines to simplify downstream operations. Raw HTML adds parsing overhead due to markup noise.

python
# Convert HTML to markdown immediately after loading
clean_pipeline = (
    session.read.docs("data/*.html", content_type="html", recursive=True)
    .with_column("markdown", fc.col("content").cast(fc.MarkdownType))
    .drop("content")  # Work with markdown from here
    .with_column(
        "extracted",
        fc.semantic.extract("markdown", Schema)
    )
)

Implement Staged Processing

Break extractions into stages with intermediate persistence. This enables faster iteration and easier debugging.

python
# Stage 1: Parse and chunk
parsed = (
    session.read.pdf_metadata("data/**/*.pdf", recursive=True)
    .filter(fc.col("page_count") < 100)
    .select(
        fc.col("file_path"),
        fc.semantic.parse_pdf(fc.col("file_path")).alias("markdown")
    )
    .with_column(
        "chunks",
        fc.markdown.extract_header_chunks("markdown", header_level=2)
    )
)

# Persist intermediate results
parsed.write.save_as_view("parsed_documents")

# Stage 2: Extract structure (can iterate separately)
extracted = (
    session.view("parsed_documents")
    .explode("chunks")
    .with_column(
        "structured",
        fc.semantic.extract(
            fc.col("chunks").content,
            DetailedSchema
        )
    )
)

Views enable rapid experimentation with extraction schemas without re-parsing source documents.

Monitor Extraction Quality

Use classification confidence scores and validation rules to detect extraction issues early.

python
from pydantic import BaseModel, Field, validator

class ValidatedExtraction(BaseModel):
    title: str
    date: str

    @validator('date')
    def validate_date_format(cls, v):
        # Pydantic validation ensures data quality
        return v

# Extract with validation
results = df.with_column(
    "validated",
    fc.semantic.extract("content", ValidatedExtraction)
)

# Track failures
failed = results.filter(fc.col("validated").isNull())
success_rate = (
    results.count() - failed.count()
) / results.count()

print(f"Extraction success rate: {success_rate:.2%}")

Moving from Prototype to Production

Fenic enables local development with zero-change deployment to Typedef Cloud. This seamless transition was a core design goal for the framework.

Local Development Pattern

python
# Local development session
local_config = fc.SessionConfig(
    app_name="document_pipeline",
    semantic=fc.SemanticConfig(
        language_models={
            "dev": fc.OpenAILanguageModel("gpt-4o-mini", rpm=50, tpm=10000)
        },
        default_language_model="dev"
    )
)

local_session = fc.Session.get_or_create(local_config)

# Build and test pipeline locally
pipeline = build_document_pipeline(local_session)
results = pipeline.limit(10).collect()  # Test on small sample

Production Deployment

python
# Production session with cloud config
prod_config = fc.SessionConfig(
    app_name="document_pipeline",
    semantic=fc.SemanticConfig(
        language_models={
            "prod": fc.OpenAILanguageModel("gpt-4o", rpm=500, tpm=200000)
        },
        default_language_model="prod"
    ),
    cloud=fc.CloudConfig(
        # Typedef Cloud configuration
        endpoint="https://api.typedef.ai",
        api_key=os.getenv("TYPEDEF_API_KEY")
    )
)

prod_session = fc.Session.get_or_create(prod_config)

# Same pipeline code, cloud execution
pipeline = build_document_pipeline(prod_session)
results = pipeline.collect()  # Runs on Typedef Cloud

The pipeline code remains identical. Only session configuration changes between environments.

Next Steps

Processing HTML and PDFs as first-class data types eliminates the fragile preprocessing layer in traditional AI pipelines. Fenic's type system extends DataFrame operations with document-aware semantics, enabling you to read, transform, and extract structure using familiar patterns.

The framework handles rate limiting, batching, retries, and cost tracking while you focus on defining transformations. Parse PDFs to markdown, extract structured data with Pydantic schemas, and compose operations declaratively. Develop locally, deploy to production without code changes.

This approach scales from prototype to production without architectural rewrites. Start with Fenic's open-source framework for local development, then leverage Typedef Cloud for enterprise scale and collaboration.

Learn more about building reliable AI pipelines with semantic operators or explore how to eliminate fragile glue code in your data processing workflows. How to Process HTML and PD ... cf080a0ba54f8c9c6078c4e.md External Displaying How to Process HTML and PDFs as First-Class Citize 290df41efcf080a0ba54f8c9c6078c4e.md.

Share this page
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.