Exact string matching fails in production. Typos, variations, and formatting inconsistencies break traditional DataFrame operations. Fenic solves this with native fuzzy matching and Jinja templating built directly into the DataFrame API.
This guide shows you how to implement both techniques in Fenic, Typedef's PySpark-inspired DataFrame framework for AI applications.
Fuzzy Matching with compute_fuzzy_ratio
Six Algorithms Available
Fenic provides six fuzzy similarity algorithms through compute_fuzzy_ratio. Each returns a score from 0-100 where 100 means identical strings.
Levenshtein - Counts insertions, deletions, and substitutions. Use for general-purpose fuzzy matching.
pythonfrom fenic.api.functions.text import compute_fuzzy_ratio compute_fuzzy_ratio(col("name1"), col("name2"), method="levenshtein")
Damerau-Levenshtein - Adds transposition detection to Levenshtein. Best for keyboard typos and user input.
pythoncompute_fuzzy_ratio(col("input"), col("reference"), method="damerau_levenshtein")
Indel - Only counts insertions and deletions (no substitutions). Solves Longest Common Subsequence problems.
pythoncompute_fuzzy_ratio(col("sequence1"), col("sequence2"), method="indel")
Jaro - Measures character matches and order. Effective for short strings like names.
pythoncompute_fuzzy_ratio(col("first_name"), col("match_name"), method="jaro")
Jaro-Winkler - Weights common prefixes higher than Jaro. Best for name matching and identifiers.
pythoncompute_fuzzy_ratio(col("customer"), col("canonical"), method="jaro_winkler")
Hamming - Counts differing positions in equal-length strings. Requires identical string lengths.
pythoncompute_fuzzy_ratio(col("code1"), col("code2"), method="hamming")
Basic Column-to-Column Matching
Compare two DataFrame columns row-by-row:
pythonimport fenic as fc from fenic.api.functions.core import col result = df.select( col("company_name"), col("entered_name"), compute_fuzzy_ratio( col("company_name"), col("entered_name"), method="levenshtein" ).alias("similarity_score") )
Column-to-Literal Matching
Compare a column against a fixed string:
pythonproducts = df.with_column( "match_score", compute_fuzzy_ratio( col("product_name"), "MacBook Pro", method="jaro_winkler" ) ).filter(col("match_score") > 80)
Algorithm Selection Guide
| Use Case | Algorithm | Reason |
|---|---|---|
| General text matching | Levenshtein | Balanced for all error types |
| User-entered forms | Damerau-Levenshtein | Handles transposition errors |
| Name matching | Jaro-Winkler | Prefix-weighted similarity |
| Fixed-length codes | Hamming | Position-based comparison |
| DNA sequences | Indel | Insertion/deletion only |
Deduplication Pattern
pythonfrom fenic.api.functions.builtin import first from fenic.api.functions.text import compute_fuzzy_ratio deduplicated = ( df.with_column( "canonical_score", compute_fuzzy_ratio( col("customer_name"), col("canonical_name"), method="damerau_levenshtein" ) ) .filter(col("canonical_score") > 90) .group_by("canonical_name") .agg(first(col("customer_id")).alias("merged_id")) )
Jinja Templating with text.jinja
Supported Template Syntax
Fenic supports a focused subset of Jinja2 features:
Variable Substitution
{{ variable_name }}
Struct Field Access
{{ user.profile.email }}
Array Indexing (literal indices only)
{{ items[0] }}
{{ data["key"] }}
Conditionals
{% if condition %}
content when true
{% else %}
content when false
{% endif %}
Loops
{% for item in items %}
Process: {{ item.name }}
{% endfor %}
Loop Variables
{{ loop.index }}
{{ loop.first }}
{{ loop.last }}
{{ loop.length }}
Basic Template Rendering
pythonfrom fenic.api.functions.core import col from fenic.api.functions.text import jinja prompts = df.select( jinja( "Analyze this product: {{ name }} priced at {{ price }}", name=col("product_name"), price=col("list_price") ).alias("analysis_prompt") )
Null Handling with strict Parameter
strict=True (default) - Any null value makes the entire row null:
pythonstrict_prompts = df.select( jinja( "User: {{ username }} | Email: {{ email }}", strict=True, username=col("user"), email=col("contact_email") ).alias("user_info") )
strict=False - Null values render as empty:
pythonpermissive_prompts = df.select( jinja( "{% if email %}Email: {{ email }}{% endif %}", strict=False, email=col("contact_email") ).alias("user_info") )
Conditional Prompt Generation
pythontemplate = """ {% if context %} Context: {{ context }} {% endif %} Question: {{ query }} {% if examples %} Examples: {% for ex in examples %} Q: {{ ex.question }} A: {{ ex.answer }} {% endfor %} {% endif %} """ formatted = df.select( jinja( template, query=col("user_question"), context=col("retrieved_context"), examples=col("few_shot_examples") ).alias("complete_prompt") )
Column Expressions in Templates
Compute values before template rendering:
pythonfrom fenic.api.functions.builtin import when result = df.select( jinja( "Summarize this {{ content_type }} content: {{ text }}. Provide a {{ style }} summary.", text=col("document_text"), content_type=when(col("format") == "md", "markdown") .when(col("format") == "html", "HTML") .otherwise("plain text"), style=when(col("length") > 1000, "detailed") .otherwise("concise") ).alias("llm_prompt") )
What's NOT Supported
Use column expressions instead of these Jinja features:
| Jinja Feature | Use Instead |
|---|---|
| `{{ name | upper }}` |
{{ len(items) }} | item_count=fc.array_size(col("items")) |
{% if price > 100 %} | is_expensive=(col("price") > 100) |
{{ price * quantity }} | total=col("price") * col("quantity") |
{{ items[i] }} | item=fc.col("items").get_item(col("index")) |
Jinja in Semantic Operations
All semantic operators accept Jinja templates for prompt construction.
semantic.map
Apply generation prompts to rows:
pythonimport fenic.api.functions.semantic as semantic descriptions = df.with_column( "description", semantic.map( "Write a compelling description for {{ name }}: {{ details }}", name=col("product_name"), details=col("product_specs") ) )
semantic.predicate
Evaluate boolean conditions:
pythonurgent_tickets = df.filter( semantic.predicate( """ Subject: {{ subject }} Body: {{ body }} This ticket indicates an urgent issue requiring immediate attention. """, subject=col("ticket_subject"), body=col("ticket_body") ) )
semantic.join
Match rows between DataFrames:
pythonmatches = df_candidates.semantic.join( other=df_jobs, predicate=""" Candidate Background: {{ left_on }} Job Requirements: {{ right_on }} The candidate possesses the qualifications for this position. """, left_on=col("candidate_resume"), right_on=col("job_requirements") )
semantic.reduce
Aggregate with context:
pythonsummaries = df.group_by("department", "quarter").agg( semantic.reduce( "Summarize these {{ department }} reports from {{ quarter }}", col("report_text"), group_context={ "department": col("department"), "quarter": col("quarter") } ).alias("summary") )
Combining Fuzzy Matching and Jinja Templates
Two-Stage Matching Pipeline
Use fuzzy matching for pre-filtering, then semantic validation:
python# Stage 1: Fuzzy pre-filter candidates = ( df_products.with_column( "fuzzy_score", compute_fuzzy_ratio( col("product_name"), col("reference_name"), method="jaro_winkler" ) ) .filter(col("fuzzy_score") > 70) ) # Stage 2: Semantic validation final_matches = candidates.filter( semantic.predicate( """ Product: {{ product_name }} Reference: {{ reference_name }} These products are functionally equivalent despite name differences. """, product_name=col("product_name"), reference_name=col("reference_name") ) )
This reduces LLM calls by 60-80% compared to semantic-only matching.
Entity Resolution with Fuzzy Scores
Incorporate fuzzy scores into semantic reasoning:
pythonresolved = df.with_column( "name_similarity", compute_fuzzy_ratio( col("customer_name"), col("canonical_name"), method="damerau_levenshtein" ) ).with_column( "resolution_decision", semantic.map( """ Customer Name: {{ customer_name }} Canonical Name: {{ canonical_name }} Fuzzy Similarity: {{ similarity }} Determine if these refer to the same entity. Consider: - Name variations (abbreviations, nicknames) - Common misspellings - The similarity score as supporting evidence Return: MATCH or NO_MATCH """, customer_name=col("customer_name"), canonical_name=col("canonical_name"), similarity=col("name_similarity") ) )
Score-Based Template Logic
Adapt prompts based on fuzzy scores:
pythonclassified = ( df.with_column( "similarity", compute_fuzzy_ratio(col("product_name"), col("catalog_name"), method="levenshtein") ) .with_column( "match_type", semantic.map( """ {% if similarity > 90 %} High confidence match detected. {% elif similarity > 70 %} Moderate similarity - verify match carefully. {% else %} Low similarity - likely different products. {% endif %} Product: {{ product_name }} Catalog: {{ catalog_name }} Classify as: EXACT_MATCH, VARIANT_MATCH, or NO_MATCH """, similarity=col("similarity"), product_name=col("product_name"), catalog_name=col("catalog_name") ) ) )
Production Implementation Patterns
Customer Deduplication
python# Self-join for duplicate detection pairs = df_customers.crossJoin( df_customers.alias("other") ).filter(col("customer_id") != col("other.customer_id")) # Fuzzy filter candidates = pairs.with_column( "name_similarity", compute_fuzzy_ratio( col("customer_name"), col("other.customer_name"), method="jaro_winkler" ) ).filter(col("name_similarity") > 85) # Semantic validation duplicates = candidates.filter( semantic.predicate( """ Customer 1: {{ name1 }}, {{ email1 }}, {{ phone1 }} Customer 2: {{ name2 }}, {{ email2 }}, {{ phone2 }} These records represent the same customer. """, name1=col("customer_name"), email1=col("email"), phone1=col("phone"), name2=col("other.customer_name"), email2=col("other.email"), phone2=col("other.phone") ) )
Product Catalog Matching
pythoncatalog_matching = ( df_vendor_products .with_column( "name_score", compute_fuzzy_ratio(col("vendor_product_name"), col("master_product_name"), method="levenshtein") ) .with_column( "sku_score", compute_fuzzy_ratio(col("vendor_sku"), col("master_sku"), method="hamming") ) .filter((col("name_score") > 75) | (col("sku_score") > 80)) .with_column( "match_confidence", semantic.map( """ Vendor Product: {{ vendor_name }} Master Product: {{ master_name }} Name Similarity: {{ name_score }} SKU Similarity: {{ sku_score }} Assess whether these products are the same. Return: HIGH_CONFIDENCE, MEDIUM_CONFIDENCE, or LOW_CONFIDENCE """, vendor_name=col("vendor_product_name"), master_name=col("master_product_name"), name_score=col("name_score"), sku_score=col("sku_score") ) ) )
Address Standardization
pythonvalidated = ( df_addresses .with_column( "street_similarity", compute_fuzzy_ratio(col("street_address"), col("canonical_street"), method="damerau_levenshtein") ) .with_column( "city_similarity", compute_fuzzy_ratio(col("city"), col("canonical_city"), method="jaro_winkler") ) .filter((col("street_similarity") > 80) & (col("city_similarity") > 85)) .filter( semantic.predicate( """ Address: {{ street }}, {{ city }}, {{ state }} {{ zip }} Canonical: {{ canonical_street }}, {{ canonical_city }}, {{ canonical_state }} {{ canonical_zip }} Street Match: {{ street_score }} | City Match: {{ city_score }} These addresses refer to the same physical location. """, street=col("street_address"), city=col("city"), state=col("state"), zip=col("zip_code"), canonical_street=col("canonical_street"), canonical_city=col("canonical_city"), canonical_state=col("canonical_state"), canonical_zip=col("canonical_zip"), street_score=col("street_similarity"), city_score=col("city_similarity") ) ) )
Performance Optimization
Filter Before Semantic Operations
Apply deterministic filters first to reduce LLM calls:
python# Efficient: Filter early filtered = ( df.filter(col("text_length") > 100) .filter(compute_fuzzy_ratio(col("name"), "target", method="jaro") > 80) .filter(semantic.predicate("{{ text }} contains technical content", text=col("content"))) )
Cache Expensive Operations
Cache after fuzzy matching or semantic operations:
pythonfuzzy_cached = ( df.with_column( "similarity", compute_fuzzy_ratio(col("name1"), col("name2"), method="levenshtein") ) .cache() ) # Multiple operations use cached results high_matches = fuzzy_cached.filter(col("similarity") > 90) medium_matches = fuzzy_cached.filter((col("similarity") > 70) & (col("similarity") <= 90))
Multi-Method Matching
Combine multiple algorithms for robust matching:
pythonfrom fenic.api.functions.builtin import greatest multi_method = ( df.with_column("lev_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="levenshtein")) .with_column("jaro_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="jaro_winkler")) .with_column("indel_score", compute_fuzzy_ratio(col("n1"), col("n2"), method="indel")) .with_column("max_score", greatest(col("lev_score"), col("jaro_score"), col("indel_score"))) .filter(col("max_score") > 80) )
Best Practices
Threshold Calibration
- Start with threshold of 80 for Jaro-Winkler on names
- Use 85-90 for Levenshtein on general text
- Test with labeled data to optimize precision/recall
- Lower thresholds for fuzzy pre-filtering (70-75)
- Higher thresholds for final matching (85-95)
Template Testing
Test templates on small datasets during development:
pythontest_df = session.create_dataframe({ "name": ["Product A", "Product B"], "price": [100, 200] }) test_result = test_df.select( jinja( "Product: {{ name }} | Price: ${{ price }}", name=col("name"), price=col("price") ) ).show()
Schema Validation
Validate columns exist before template application:
pythonrequired_columns = {"product_name", "price", "description"} available_columns = set(df.columns) if not required_columns.issubset(available_columns): missing = required_columns - available_columns raise ValueError(f"Missing required columns: {missing}")
Monitor Score Distributions
Track fuzzy score distributions in production:
pythonfrom fenic.api.functions.builtin import mean, stddev, min, max score_stats = ( df.with_column("similarity", compute_fuzzy_ratio(col("f1"), col("f2"), method="levenshtein")) .select( mean(col("similarity")).alias("avg_similarity"), stddev(col("similarity")).alias("stddev_similarity"), min(col("similarity")).alias("min_similarity"), max(col("similarity")).alias("max_similarity") ) )
Unexpected shifts indicate:
- Data quality degradation
- Format changes in source systems
- New data sources with different patterns
- Need for threshold adjustment
Setup and Configuration
Install Fenic:
bashpip install fenic
Configure session with model providers:
pythonimport fenic as fc from fenic.api.session.session import Session from fenic.api.session.config import SessionConfig, SemanticConfig, OpenAILanguageModel config = SessionConfig( semantic=SemanticConfig( language_models={ "gpt": OpenAILanguageModel( model_name="gpt-4.1-nano", rpm=500, tpm=200_000 ) }, default_language_model="gpt" ) ) session = Session.get_or_create(config)
Create DataFrame and apply operations:
pythonfrom fenic.api.functions.text import compute_fuzzy_ratio df = session.create_dataframe({ "name": ["John Smith", "Jon Smyth", "Jane Doe"], "reference": ["John Smith", "John Smith", "Jane Doe"] }) result = df.with_column( "similarity", compute_fuzzy_ratio(col("name"), col("reference"), method="jaro_winkler") ) result.show()
Key Takeaways
- Use fuzzy matching for efficient pre-filtering before semantic operations
- Jinja templates provide maintainable prompt engineering in DataFrame pipelines
- Combine both techniques for 60-80% cost reduction versus semantic-only approaches
- Cache intermediate results after expensive operations
- Calibrate thresholds using labeled test data
- Monitor score distributions in production for data quality signals
Additional Resources
- Fenic Open Source Announcement
- Build Reliable AI Pipelines with Semantic Operators
- Log Clustering and Triage with Fenic
- How Typedef Cut RudderStack's Triage Time by 95%
- Fenic GitHub Repository
- Fenic Documentation How to Implement Fuzzy Matc ... fcf08008a7c3e11d4ae6618d.md External Displaying How to Implement Fuzzy Matching and Jinja Templati 296df41efcf08008a7c3e11d4ae6618d.md.

