<< goback()

How to Implement Fuzzy Matching and Jinja Templating in AI DataFrames

Typedef Team

How to Implement Fuzzy Matching and Jinja Templating in AI DataFrames

Exact string matching fails in production. Typos, variations, and formatting inconsistencies break traditional DataFrame operations. Fenic solves this with native fuzzy matching and Jinja templating built directly into the DataFrame API.

This guide shows you how to implement both techniques in Fenic, Typedef's PySpark-inspired DataFrame framework for AI applications.

Fuzzy Matching with compute_fuzzy_ratio

Six Algorithms Available

Fenic provides six fuzzy similarity algorithms through compute_fuzzy_ratio. Each returns a score from 0-100 where 100 means identical strings.

Levenshtein - Counts insertions, deletions, and substitutions. Use for general-purpose fuzzy matching.

python
from fenic.api.functions.text import compute_fuzzy_ratio

compute_fuzzy_ratio(col("name1"), col("name2"), method="levenshtein")

Damerau-Levenshtein - Adds transposition detection to Levenshtein. Best for keyboard typos and user input.

python
compute_fuzzy_ratio(col("input"), col("reference"), method="damerau_levenshtein")

Indel - Only counts insertions and deletions (no substitutions). Solves Longest Common Subsequence problems.

python
compute_fuzzy_ratio(col("sequence1"), col("sequence2"), method="indel")

Jaro - Measures character matches and order. Effective for short strings like names.

python
compute_fuzzy_ratio(col("first_name"), col("match_name"), method="jaro")

Jaro-Winkler - Weights common prefixes higher than Jaro. Best for name matching and identifiers.

python
compute_fuzzy_ratio(col("customer"), col("canonical"), method="jaro_winkler")

Hamming - Counts differing positions in equal-length strings. Requires identical string lengths.

python
compute_fuzzy_ratio(col("code1"), col("code2"), method="hamming")

Basic Column-to-Column Matching

Compare two DataFrame columns row-by-row:

python
import fenic as fc
from fenic.api.functions.core import col

result = df.select(
    col("company_name"),
    col("entered_name"),
    compute_fuzzy_ratio(
        col("company_name"),
        col("entered_name"),
        method="levenshtein"
    ).alias("similarity_score")
)

Column-to-Literal Matching

Compare a column against a fixed string:

python
products = df.with_column(
    "match_score",
    compute_fuzzy_ratio(
        col("product_name"),
        "MacBook Pro",
        method="jaro_winkler"
    )
).filter(col("match_score") > 80)

Algorithm Selection Guide

Use CaseAlgorithmReason
General text matchingLevenshteinBalanced for all error types
User-entered formsDamerau-LevenshteinHandles transposition errors
Name matchingJaro-WinklerPrefix-weighted similarity
Fixed-length codesHammingPosition-based comparison
DNA sequencesIndelInsertion/deletion only

Deduplication Pattern

python
from fenic.api.functions.builtin import first
from fenic.api.functions.text import compute_fuzzy_ratio

deduplicated = (
    df.with_column(
        "canonical_score",
        compute_fuzzy_ratio(
            col("customer_name"),
            col("canonical_name"),
            method="damerau_levenshtein"
        )
    )
    .filter(col("canonical_score") > 90)
    .group_by("canonical_name")
    .agg(first(col("customer_id")).alias("merged_id"))
)

Jinja Templating with text.jinja

Supported Template Syntax

Fenic supports a focused subset of Jinja2 features:

Variable Substitution

{{ variable_name }}

Struct Field Access

{{ user.profile.email }}

Array Indexing (literal indices only)

{{ items[0] }}
{{ data["key"] }}

Conditionals

{% if condition %}
content when true
{% else %}
content when false
{% endif %}

Loops

{% for item in items %}
Process: {{ item.name }}
{% endfor %}

Loop Variables

{{ loop.index }}
{{ loop.first }}
{{ loop.last }}
{{ loop.length }}

Basic Template Rendering

python
from fenic.api.functions.core import col
from fenic.api.functions.text import jinja

prompts = df.select(
    jinja(
        "Analyze this product: {{ name }} priced at {{ price }}",
        name=col("product_name"),
        price=col("list_price")
    ).alias("analysis_prompt")
)

Null Handling with strict Parameter

strict=True (default) - Any null value makes the entire row null:

python
strict_prompts = df.select(
    jinja(
        "User: {{ username }} | Email: {{ email }}",
        strict=True,
        username=col("user"),
        email=col("contact_email")
    ).alias("user_info")
)

strict=False - Null values render as empty:

python
permissive_prompts = df.select(
    jinja(
        "{% if email %}Email: {{ email }}{% endif %}",
        strict=False,
        email=col("contact_email")
    ).alias("user_info")
)

Conditional Prompt Generation

python
template = """
{% if context %}
Context: {{ context }}
{% endif %}

Question: {{ query }}

{% if examples %}
Examples:
{% for ex in examples %}
Q: {{ ex.question }}
A: {{ ex.answer }}
{% endfor %}
{% endif %}
"""

formatted = df.select(
    jinja(
        template,
        query=col("user_question"),
        context=col("retrieved_context"),
        examples=col("few_shot_examples")
    ).alias("complete_prompt")
)

Column Expressions in Templates

Compute values before template rendering:

python
from fenic.api.functions.builtin import when

result = df.select(
    jinja(
        "Summarize this {{ content_type }} content: {{ text }}. Provide a {{ style }} summary.",
        text=col("document_text"),
        content_type=when(col("format") == "md", "markdown")
                      .when(col("format") == "html", "HTML")
                      .otherwise("plain text"),
        style=when(col("length") > 1000, "detailed")
                .otherwise("concise")
    ).alias("llm_prompt")
)

What's NOT Supported

Use column expressions instead of these Jinja features:

Jinja FeatureUse Instead
`{{ nameupper }}`
{{ len(items) }}item_count=fc.array_size(col("items"))
{% if price > 100 %}is_expensive=(col("price") > 100)
{{ price * quantity }}total=col("price") * col("quantity")
{{ items[i] }}item=fc.col("items").get_item(col("index"))

Jinja in Semantic Operations

All semantic operators accept Jinja templates for prompt construction.

semantic.map

Apply generation prompts to rows:

python
import fenic.api.functions.semantic as semantic

descriptions = df.with_column(
    "description",
    semantic.map(
        "Write a compelling description for {{ name }}: {{ details }}",
        name=col("product_name"),
        details=col("product_specs")
    )
)

semantic.predicate

Evaluate boolean conditions:

python
urgent_tickets = df.filter(
    semantic.predicate(
        """
        Subject: {{ subject }}
        Body: {{ body }}
        This ticket indicates an urgent issue requiring immediate attention.
        """,
        subject=col("ticket_subject"),
        body=col("ticket_body")
    )
)

semantic.join

Match rows between DataFrames:

python
matches = df_candidates.semantic.join(
    other=df_jobs,
    predicate="""
    Candidate Background: {{ left_on }}
    Job Requirements: {{ right_on }}
    The candidate possesses the qualifications for this position.
    """,
    left_on=col("candidate_resume"),
    right_on=col("job_requirements")
)

semantic.reduce

Aggregate with context:

python
summaries = df.group_by("department", "quarter").agg(
    semantic.reduce(
        "Summarize these {{ department }} reports from {{ quarter }}",
        col("report_text"),
        group_context={
            "department": col("department"),
            "quarter": col("quarter")
        }
    ).alias("summary")
)

Combining Fuzzy Matching and Jinja Templates

Two-Stage Matching Pipeline

Use fuzzy matching for pre-filtering, then semantic validation:

python
# Stage 1: Fuzzy pre-filter
candidates = (
    df_products.with_column(
        "fuzzy_score",
        compute_fuzzy_ratio(
            col("product_name"),
            col("reference_name"),
            method="jaro_winkler"
        )
    )
    .filter(col("fuzzy_score") > 70)
)

# Stage 2: Semantic validation
final_matches = candidates.filter(
    semantic.predicate(
        """
        Product: {{ product_name }}
        Reference: {{ reference_name }}
        These products are functionally equivalent despite name differences.
        """,
        product_name=col("product_name"),
        reference_name=col("reference_name")
    )
)

This reduces LLM calls by 60-80% compared to semantic-only matching.

Entity Resolution with Fuzzy Scores

Incorporate fuzzy scores into semantic reasoning:

python
resolved = df.with_column(
    "name_similarity",
    compute_fuzzy_ratio(
        col("customer_name"),
        col("canonical_name"),
        method="damerau_levenshtein"
    )
).with_column(
    "resolution_decision",
    semantic.map(
        """
        Customer Name: {{ customer_name }}
        Canonical Name: {{ canonical_name }}
        Fuzzy Similarity: {{ similarity }}

        Determine if these refer to the same entity. Consider:
        - Name variations (abbreviations, nicknames)
        - Common misspellings
        - The similarity score as supporting evidence

        Return: MATCH or NO_MATCH
        """,
        customer_name=col("customer_name"),
        canonical_name=col("canonical_name"),
        similarity=col("name_similarity")
    )
)

Score-Based Template Logic

Adapt prompts based on fuzzy scores:

python
classified = (
    df.with_column(
        "similarity",
        compute_fuzzy_ratio(col("product_name"), col("catalog_name"), method="levenshtein")
    )
    .with_column(
        "match_type",
        semantic.map(
            """
            {% if similarity > 90 %}
            High confidence match detected.
            {% elif similarity > 70 %}
            Moderate similarity - verify match carefully.
            {% else %}
            Low similarity - likely different products.
            {% endif %}

            Product: {{ product_name }}
            Catalog: {{ catalog_name }}

            Classify as: EXACT_MATCH, VARIANT_MATCH, or NO_MATCH
            """,
            similarity=col("similarity"),
            product_name=col("product_name"),
            catalog_name=col("catalog_name")
        )
    )
)

Production Implementation Patterns

Customer Deduplication

python
# Self-join for duplicate detection
pairs = df_customers.crossJoin(
    df_customers.alias("other")
).filter(col("customer_id") != col("other.customer_id"))

# Fuzzy filter
candidates = pairs.with_column(
    "name_similarity",
    compute_fuzzy_ratio(
        col("customer_name"),
        col("other.customer_name"),
        method="jaro_winkler"
    )
).filter(col("name_similarity") > 85)

# Semantic validation
duplicates = candidates.filter(
    semantic.predicate(
        """
        Customer 1: {{ name1 }}, {{ email1 }}, {{ phone1 }}
        Customer 2: {{ name2 }}, {{ email2 }}, {{ phone2 }}

        These records represent the same customer.
        """,
        name1=col("customer_name"),
        email1=col("email"),
        phone1=col("phone"),
        name2=col("other.customer_name"),
        email2=col("other.email"),
        phone2=col("other.phone")
    )
)

Product Catalog Matching

python
catalog_matching = (
    df_vendor_products
    .with_column(
        "name_score",
        compute_fuzzy_ratio(col("vendor_product_name"), col("master_product_name"), method="levenshtein")
    )
    .with_column(
        "sku_score",
        compute_fuzzy_ratio(col("vendor_sku"), col("master_sku"), method="hamming")
    )
    .filter((col("name_score") > 75) | (col("sku_score") > 80))
    .with_column(
        "match_confidence",
        semantic.map(
            """
            Vendor Product: {{ vendor_name }}
            Master Product: {{ master_name }}
            Name Similarity: {{ name_score }}
            SKU Similarity: {{ sku_score }}

            Assess whether these products are the same.
            Return: HIGH_CONFIDENCE, MEDIUM_CONFIDENCE, or LOW_CONFIDENCE
            """,
            vendor_name=col("vendor_product_name"),
            master_name=col("master_product_name"),
            name_score=col("name_score"),
            sku_score=col("sku_score")
        )
    )
)

Address Standardization

python
validated = (
    df_addresses
    .with_column(
        "street_similarity",
        compute_fuzzy_ratio(col("street_address"), col("canonical_street"), method="damerau_levenshtein")
    )
    .with_column(
        "city_similarity",
        compute_fuzzy_ratio(col("city"), col("canonical_city"), method="jaro_winkler")
    )
    .filter((col("street_similarity") > 80) & (col("city_similarity") > 85))
    .filter(
        semantic.predicate(
            """
            Address: {{ street }}, {{ city }}, {{ state }} {{ zip }}
            Canonical: {{ canonical_street }}, {{ canonical_city }}, {{ canonical_state }} {{ canonical_zip }}
            Street Match: {{ street_score }} | City Match: {{ city_score }}

            These addresses refer to the same physical location.
            """,
            street=col("street_address"),
            city=col("city"),
            state=col("state"),
            zip=col("zip_code"),
            canonical_street=col("canonical_street"),
            canonical_city=col("canonical_city"),
            canonical_state=col("canonical_state"),
            canonical_zip=col("canonical_zip"),
            street_score=col("street_similarity"),
            city_score=col("city_similarity")
        )
    )
)

Performance Optimization

Filter Before Semantic Operations

Apply deterministic filters first to reduce LLM calls:

python
# Efficient: Filter early
filtered = (
    df.filter(col("text_length") > 100)
      .filter(compute_fuzzy_ratio(col("name"), "target", method="jaro") > 80)
      .filter(semantic.predicate("{{ text }} contains technical content", text=col("content")))
)

Cache Expensive Operations

Cache after fuzzy matching or semantic operations:

python
fuzzy_cached = (
    df.with_column(
        "similarity",
        compute_fuzzy_ratio(col("name1"), col("name2"), method="levenshtein")
    )
    .cache()
)

# Multiple operations use cached results
high_matches = fuzzy_cached.filter(col("similarity") > 90)
medium_matches = fuzzy_cached.filter((col("similarity") > 70) & (col("similarity") <= 90))

Multi-Method Matching

Combine multiple algorithms for robust matching:

python
from fenic.api.functions.builtin import greatest

multi_method = (
    df.with_column("lev_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="levenshtein"))
      .with_column("jaro_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="jaro_winkler"))
      .with_column("indel_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="indel"))
      .with_column("max_score", greatest(col("lev_score"), col("jaro_score"), col("indel_score")))
      .filter(col("max_score") > 80)
)

Best Practices

Threshold Calibration

  • Start with threshold of 80 for Jaro-Winkler on names
  • Use 85-90 for Levenshtein on general text
  • Test with labeled data to optimize precision/recall
  • Lower thresholds for fuzzy pre-filtering (70-75)
  • Higher thresholds for final matching (85-95)

Template Testing

Test templates on small datasets during development:

python
test_df = session.create_dataframe({
    "name": ["Product A", "Product B"],
    "price": [100, 200]
})

test_result = test_df.select(
    jinja(
        "Product: {{ name }} | Price: ${{ price }}",
        name=col("name"),
        price=col("price")
    )
).show()

Schema Validation

Validate columns exist before template application:

python
required_columns = {"product_name", "price", "description"}
available_columns = set(df.columns)

if not required_columns.issubset(available_columns):
    missing = required_columns - available_columns
    raise ValueError(f"Missing required columns: {missing}")

Monitor Score Distributions

Track fuzzy score distributions in production:

python
from fenic.api.functions.builtin import mean, stddev, min, max

score_stats = (
    df.with_column("similarity", compute_fuzzy_ratio(col("f1"), col("f2"), method="levenshtein"))
      .select(
          mean(col("similarity")).alias("avg_similarity"),
          stddev(col("similarity")).alias("stddev_similarity"),
          min(col("similarity")).alias("min_similarity"),
          max(col("similarity")).alias("max_similarity")
      )
)

Unexpected shifts indicate:

  • Data quality degradation
  • Format changes in source systems
  • New data sources with different patterns
  • Need for threshold adjustment

Setup and Configuration

Install Fenic:

bash
pip install fenic

Configure session with model providers:

python
import fenic as fc
from fenic.api.session.session import Session
from fenic.api.session.config import SessionConfig, SemanticConfig, OpenAILanguageModel

config = SessionConfig(
    semantic=SemanticConfig(
        language_models={
            "gpt": OpenAILanguageModel(
                model_name="gpt-4.1-nano",
                rpm=500,
                tpm=200_000
            )
        },
        default_language_model="gpt"
    )
)

session = Session.get_or_create(config)

Create DataFrame and apply operations:

python
from fenic.api.functions.text import compute_fuzzy_ratio

df = session.create_dataframe({
    "name": ["John Smith", "Jon Smyth", "Jane Doe"],
    "reference": ["John Smith", "John Smith", "Jane Doe"]
})

result = df.with_column(
    "similarity",
    compute_fuzzy_ratio(col("name"), col("reference"), method="jaro_winkler")
)

result.show()

Key Takeaways

  • Use fuzzy matching for efficient pre-filtering before semantic operations
  • Jinja templates provide maintainable prompt engineering in DataFrame pipelines
  • Combine both techniques for 60-80% cost reduction versus semantic-only approaches
  • Cache intermediate results after expensive operations
  • Calibrate thresholds using labeled test data
  • Monitor score distributions in production for data quality signals

Additional Resources

Share this page
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.