Ship a Practical AI Content Pipeline for Recommender Systems, Agents, and Search Tools Using fenic
A practical, agent-ready workflow for turning raw content/articles into labeled clusters, intent, grounded summaries, and exportable features with fenic DataFrames and semantic operators for content intelligence.
Most teams hit the same wall with their content:
You’ve shipped hundreds of blog posts, tutorials, and case studies over the years. Some of them still convert. Some are outdated. Some overlap so hard they cannibalize each other. And when someone asks a simple question like:
“What are our best beginner tutorials on clustering with code?”
…you open a spreadsheet, search the blog, and start skimming one page at a time.
This guide walks through a different approach.
Instead of manually tagging and skimming, we:
- Ingest a small corpus of articles (titles, URLs, text snippets).
- Enrich each row with cheap features like length and “has code.”
- Use semantic operators to add embeddings and cluster labels.
- Classify each article’s narrative intent (tutorial, thinkpiece, case study, etc.).
- Estimate a complexity bucket (beginner/intermediate/advanced).
- Build a cluster report with exemplars and LLM-generated summaries.
- Export a clean feature table for BI tools, agents, or downstream ranking.
All of it is built with fenic: an opinionated, PySpark-inspired DataFrame framework designed for AI and agentic applications, with first-class semantic operators like semantic.classify, semantic.extract, and clustering built in.
If you’re responsible for docs, developer education, or marketing content, this is the kind of “content intelligence” you can use.
TL;DR
-
Treat content analysis like data engineering: DataFrames + semantic operators, not ad-hoc scripts.
-
Cluster semantically, label narrative intent (tutorial / announcement / explainer / opinion / case study), and bucket complexity (intro / intermediate / advanced).
-
Export two artifacts that unlock search, recommendations, and planning:
-
features.(parquet|csv)– per-article features (intent, complexity, has_code, cluster label, etc.) -
cluster_report.csv– ~10 themes with exemplars and grounded 1–3 bullet summaries
-
-
The notebook is runnable end-to-end. Bring your corpus, press play, and adapt.
Try it yourself:
For the code and copy-paste-ables, check the Colab and demo repo ⬇️
This post focuses on the why and what, so you can decide quickly and click through.
What "done" looks like
When the notebook finishes, you’ll have:
1) A unified feature table (one row per article)
-
url,title,body_clip(short snippet) -
char_len,clip_len -
has_code(boolean) -
complexity_bucket(intro/intermediate/advanced) -
intent(tutorial/how-to,news/announcement,opinion/thinkpiece,research/explainer,case-study/showcase) -
Optional:
cluster,cluster_label(human-friendly theme name)
2) A compact cluster report (one row per theme)
-
clusterid +cluster_label -
count(how many articles live here) -
exemplar_title,exemplar_url(closest to centroid) -
cluster_summary: 1–3 grounded bullets on audience + tone
With just those two files you can ask:
-
“Show beginner tutorials that include code in the K-Means theme.”
-
“What are our top 10 clusters by count?”
-
“Find opinion pieces about AI in production that pair with this launch.”
From an agent’s perspective:
“Given this developer’s last 3 reads, suggest 2 more articles that match their level and intent.”
The nice part is: all of this emerges from one fenic pipeline (load → enrich → embed → cluster → classify → export).
Why fenic is a good fit for this problem
Content intelligence is messy. You deal with:
- Unstructured text (markdown, HTML, scraped pages)
- A mix of cheap heuristics (string length, regex) and semantic signals (embeddings, classification)
- The need to batch model calls and keep costs predictable
You could wire this together with Pandas, a few bespoke scripts, and some async model calls. But it gets hard to:
- Explain what’s happening
- Reproduce the run later
- Move from notebook to scheduled job to agentic tools
fenic’s value here is that everything is a DataFrame transformation:
semantic.embedto add anembcolumnsemantic.classifyfor narrative intentsemantic.extractto produce typed summarieswrite.parquet/write.csvfor export
You describe the pipeline once, and it can run in your notebook today and as a scheduled job later. There is no need for custom batching, model SDK glue, or a separate ETL job.
Architecture at a glance
The notebook is deliberately small and linear. Here’s the flow you’ll run in Colab:
-
Load + normalize
Bring a CSV or table with at leasturl,title,body. Create a short snippet (body_clip) that keeps model calls cheap and deterministic. -
Cheap features first
Computechar_len,clip_len, and a simple has_code flag (look for code fences/backticks/language tags). These already answer useful questions before any embeddings. -
Semantic clusters
Embedbody_clipand group articles into K coherent clusters. For each cluster:-
Pick the exemplar (closest to centroid)
-
Generate a concise cluster label (human-readable)
-
-
Narrative intent
Use few-shotsemantic.classifyto tag each article as tutorial / announcement / explainer / opinion / case study. Crucially, build examples from your corpus (not toy prompts) so labels reflect your house style. -
Complexity buckets
Bucket intointro/intermediate/advancedwith a transparent rule overchar_lenandhas_code. -
Cluster summaries
For the ~10 clusters, run grounded extraction to get 1–3 bullets on audience + tone (typed via a mini Pydantic schema). These bullets anchor editorial decisions and agent prompts. -
Export
Writefeatures.parquetandcluster_report.csvwith fenic’s native writers. Done.
All of the above is implemented in the Colab.
Where this becomes useful
Once you have features.parquet and cluster_report.csv, there are several easy wins.
1. Smarter RAG and content search
Instead of feeding your LLM the “top 5 BM25 matches,” you can:
- Prefer tutorial/how-to articles when the question sounds like “how do I…?”
- Prefer case studies when the question is about business impact
- Filter to beginner-level content for newer users
You already have the features; you just plug them into your retrieval and ranking logic.
2. Content recommendations ("read next")
Given an article the user just read:
- Use the embedding and cluster to find semantically similar pieces
- Use intent and complexity to avoid repetitive or mismatched suggestions
Example:
If someone just finished a “beginner” explainer in cluster “K-Means basics,” recommend one “practitioner” tutorial and one related opinion piece at the same cluster.
3. Editorial analytics and gaps
Your content/product teams can now ask:
- “How many advanced tutorials do we have on observability?”
- “Are most of our opinion pieces clustered around the same theme?”
- “Which clusters have no case studies yet?”
Because everything is just a table, you can answer these with a few fenic group_by/agg calls or plug the parquet into your BI tool of choice.
4. Agent surfaces
Finally, this is a perfect substrate for agents.
You can expose read-only MCP tools like:
list_articles(intent, complexity)similar_articles(url)cluster_overview(cluster_id)
Each tool is a deterministic fenic query over the feature table. Your agent becomes a thin layer that:
- Interprets the user’s request
- Calls the relevant tool(s)
- Renders the result in a friendly way
No need to let the model “hallucinate” your catalog; it just queries it.
Cost, latency, and control
Because fenic handles batching and minimal API surface, it’s straightforward to reason about cost:
- Embeddings: main paid step; we embed
body_cliponce per article - Classification: one
semantic.classifycall per row, typically via a small “mini” LLM - Extraction: only for ~10 cluster exemplars, so negligible
You can tune:
- The snippet length (
body_clip) - The subset of rows you classify (e.g., skip archival content)
- The model aliases used in your
semantic_cfg
Everything else—feature engineering, clustering, exports—runs locally.
Adopting this pattern in your own stack
If you want to replicate this with your docs or blog:
- Load your corpus into fenic with at least
url,title, andbody. - Add cheap features (
char_len,has_code, maybe tags or product names). - Use
with_embeddingsandwith_cluster_labelsto create semantic clusters. - Summarise clusters with
semantic.extractso humans can reason about them. - Define a small, opinionated intent taxonomy and build few-shot examples from your corpus.
- Add a lightweight complexity bucket signal that fits your audience.
- Export
features.parquetand integrate it into RAG, search, agents, or analytics.
You don’t need to adopt agents on day one. Even just having a clean content feature table will pay off in search, analytics, and planning.
💡 Maybe useful: From Threads to Themes – How We Built a Fast HN Research Agent with fenic + Pydantic-AI
Conclusion and closing thoughts
If your team is sitting on a large, messy content archive and you’re not sure what to do next, the pipeline is a very practical way to see the shape of what you already have.
fenic’s sweet spot is exactly this kind of work: turning unstructured text into structured, semantic, agent-ready tables without giving up the simplicity of a DataFrame pipeline.
fenic gives you the DataFrame-style ergonomics you’re used to, plus semantic tools that were built for this kind of AI-adjacent work.
The rest is just deciding what questions you care about and letting the pipeline answer them.
Try the demo, then make it yours
- Clone the demo: Use our Colab and point it at your own small corpus slice first. Keep
body_clipshort, then grow from there. - Docs:
docs.fenic.ai— semantic operators, text utilities, batch inference - Examples: GitHub →
typedef-ai/fenic(look for other applications and MCP tool demos).
