<< goback()

AI Content Pipeline for Search, Recommenders & Agents with fenic

Typedef Team

AI Content Pipeline for Search, Recommenders & Agents with fenic

Ship a Practical AI Content Pipeline for Recommender Systems, Agents, and Search Tools Using fenic

A practical, agent-ready workflow for turning raw content/articles into labeled clusters, intent, grounded summaries, and exportable features with fenic DataFrames and semantic operators for content intelligence.


Most teams hit the same wall with their content:

You’ve shipped hundreds of blog posts, tutorials, and case studies over the years. Some of them still convert. Some are outdated. Some overlap so hard they cannibalize each other. And when someone asks a simple question like:

“What are our best beginner tutorials on clustering with code?”

…you open a spreadsheet, search the blog, and start skimming one page at a time.

This guide walks through a different approach.

Instead of manually tagging and skimming, we:

  • Ingest a small corpus of articles (titles, URLs, text snippets).
  • Enrich each row with cheap features like length and “has code.”
  • Use semantic operators to add embeddings and cluster labels.
  • Classify each article’s narrative intent (tutorial, thinkpiece, case study, etc.).
  • Estimate a complexity bucket (beginner/intermediate/advanced).
  • Build a cluster report with exemplars and LLM-generated summaries.
  • Export a clean feature table for BI tools, agents, or downstream ranking.

All of it is built with fenic: an opinionated, PySpark-inspired DataFrame framework designed for AI and agentic applications, with first-class semantic operators like semantic.classify, semantic.extract, and clustering built in.

If you’re responsible for docs, developer education, or marketing content, this is the kind of “content intelligence” you can use.

TL;DR

  • Treat content analysis like data engineering: DataFrames + semantic operators, not ad-hoc scripts.

  • Cluster semantically, label narrative intent (tutorial / announcement / explainer / opinion / case study), and bucket complexity (intro / intermediate / advanced).

  • Export two artifacts that unlock search, recommendations, and planning:

    • features.(parquet|csv) – per-article features (intent, complexity, has_code, cluster label, etc.)

    • cluster_report.csv – ~10 themes with exemplars and grounded 1–3 bullet summaries

  • The notebook is runnable end-to-end. Bring your corpus, press play, and adapt.

Try it yourself:

For the code and copy-paste-ables, check the Colab and demo repo ⬇️

This post focuses on the why and what, so you can decide quickly and click through.


What "done" looks like

When the notebook finishes, you’ll have:

1) A unified feature table (one row per article)

  • url, title, body_clip (short snippet)

  • char_len, clip_len

  • has_code (boolean)

  • complexity_bucket (intro/intermediate/advanced)

  • intent (tutorial/how-to, news/announcement, opinion/thinkpiece, research/explainer, case-study/showcase)

  • Optional: cluster, cluster_label (human-friendly theme name)

2) A compact cluster report (one row per theme)

  • cluster id + cluster_label

  • count (how many articles live here)

  • exemplar_title, exemplar_url (closest to centroid)

  • cluster_summary: 1–3 grounded bullets on audience + tone

With just those two files you can ask:

  • “Show beginner tutorials that include code in the K-Means theme.”

  • “What are our top 10 clusters by count?”

  • “Find opinion pieces about AI in production that pair with this launch.”

From an agent’s perspective:

“Given this developer’s last 3 reads, suggest 2 more articles that match their level and intent.”

The nice part is: all of this emerges from one fenic pipeline (load → enrich → embed → cluster → classify → export).

Why fenic is a good fit for this problem

Content intelligence is messy. You deal with:

  • Unstructured text (markdown, HTML, scraped pages)
  • A mix of cheap heuristics (string length, regex) and semantic signals (embeddings, classification)
  • The need to batch model calls and keep costs predictable

You could wire this together with Pandas, a few bespoke scripts, and some async model calls. But it gets hard to:

  • Explain what’s happening
  • Reproduce the run later
  • Move from notebook to scheduled job to agentic tools

fenic’s value here is that everything is a DataFrame transformation:

You describe the pipeline once, and it can run in your notebook today and as a scheduled job later. There is no need for custom batching, model SDK glue, or a separate ETL job.

Architecture at a glance

The notebook is deliberately small and linear. Here’s the flow you’ll run in Colab:

  1. Load + normalize
    Bring a CSV or table with at least url, title, body. Create a short snippet (body_clip) that keeps model calls cheap and deterministic.

  2. Cheap features first
    Compute char_len, clip_len, and a simple has_code flag (look for code fences/backticks/language tags). These already answer useful questions before any embeddings.

  3. Semantic clusters
    Embed body_clip and group articles into K coherent clusters. For each cluster:

    • Pick the exemplar (closest to centroid)

    • Generate a concise cluster label (human-readable)

  4. Narrative intent
    Use few-shot semantic.classify to tag each article as tutorial / announcement / explainer / opinion / case study. Crucially, build examples from your corpus (not toy prompts) so labels reflect your house style.

  5. Complexity buckets
    Bucket into intro / intermediate / advanced with a transparent rule over char_len and has_code.

  6. Cluster summaries
    For the ~10 clusters, run grounded extraction to get 1–3 bullets on audience + tone (typed via a mini Pydantic schema). These bullets anchor editorial decisions and agent prompts.

  7. Export
    Write features.parquet and cluster_report.csv with fenic’s native writers. Done.

All of the above is implemented in the Colab.

Where this becomes useful

Once you have features.parquet and cluster_report.csv, there are several easy wins.

1. Smarter RAG and content search

Instead of feeding your LLM the “top 5 BM25 matches,” you can:

  • Prefer tutorial/how-to articles when the question sounds like “how do I…?”
  • Prefer case studies when the question is about business impact
  • Filter to beginner-level content for newer users

You already have the features; you just plug them into your retrieval and ranking logic.

2. Content recommendations ("read next")

Given an article the user just read:

  • Use the embedding and cluster to find semantically similar pieces
  • Use intent and complexity to avoid repetitive or mismatched suggestions

Example:

If someone just finished a “beginner” explainer in cluster “K-Means basics,” recommend one “practitioner” tutorial and one related opinion piece at the same cluster.

3. Editorial analytics and gaps

Your content/product teams can now ask:

  • “How many advanced tutorials do we have on observability?”
  • “Are most of our opinion pieces clustered around the same theme?”
  • “Which clusters have no case studies yet?”

Because everything is just a table, you can answer these with a few fenic group_by/agg calls or plug the parquet into your BI tool of choice.

4. Agent surfaces

Finally, this is a perfect substrate for agents.

You can expose read-only MCP tools like:

  • list_articles(intent, complexity)
  • similar_articles(url)
  • cluster_overview(cluster_id)

Each tool is a deterministic fenic query over the feature table. Your agent becomes a thin layer that:

  1. Interprets the user’s request
  2. Calls the relevant tool(s)
  3. Renders the result in a friendly way

No need to let the model “hallucinate” your catalog; it just queries it.

Cost, latency, and control

Because fenic handles batching and minimal API surface, it’s straightforward to reason about cost:

  • Embeddings: main paid step; we embed body_clip once per article
  • Classification: one semantic.classify call per row, typically via a small “mini” LLM
  • Extraction: only for ~10 cluster exemplars, so negligible

You can tune:

  • The snippet length (body_clip)
  • The subset of rows you classify (e.g., skip archival content)
  • The model aliases used in your semantic_cfg

Everything else—feature engineering, clustering, exports—runs locally.

Adopting this pattern in your own stack

If you want to replicate this with your docs or blog:

  1. Load your corpus into fenic with at least url, title, and body.
  2. Add cheap features (char_len, has_code, maybe tags or product names).
  3. Use with_embeddings and with_cluster_labels to create semantic clusters.
  4. Summarise clusters with semantic.extract so humans can reason about them.
  5. Define a small, opinionated intent taxonomy and build few-shot examples from your corpus.
  6. Add a lightweight complexity bucket signal that fits your audience.
  7. Export features.parquet and integrate it into RAG, search, agents, or analytics.

You don’t need to adopt agents on day one. Even just having a clean content feature table will pay off in search, analytics, and planning.

Conclusion and closing thoughts

If your team is sitting on a large, messy content archive and you’re not sure what to do next, the pipeline is a very practical way to see the shape of what you already have.

fenic’s sweet spot is exactly this kind of work: turning unstructured text into structured, semantic, agent-ready tables without giving up the simplicity of a DataFrame pipeline.

fenic gives you the DataFrame-style ergonomics you’re used to, plus semantic tools that were built for this kind of AI-adjacent work.

The rest is just deciding what questions you care about and letting the pipeline answer them.

Try the demo, then make it yours

  • Clone the demo: Use our Colab and point it at your own small corpus slice first. Keep body_clip short, then grow from there.
  • Docs: docs.fenic.ai — semantic operators, text utilities, batch inference
  • Examples: GitHub → typedef-ai/fenic (look for other applications and MCP tool demos).
the next generation of

data processingdata processingdata processing

Join us in igniting a new paradigm in data infrastructure. Enter your email to get early access and redefine how you build and scale data workflows with typedef.