Build an LLM Agent for Log Clustering & Triage

There’s a point in every incident where the problem isn’t finding logs but in understanding them. Formats drift, error messages mutate, and even well-tuned dashboards can feel like they’re describing yesterday’s system.

In the middle of the scramble, engineers ask the same questions again and again:

What broke? How widespread is it? Where should we look first?

This article lays out a pragmatic, agent-first way to answer those questions with Fenic. Fenic is an opinionated, PySpark-inspired DataFrame framework built for AI/agentic applications.

You’ll learn how we turn raw logs into severity-aware clusters, expose read-only tools via MCP, and let a LangGraph agent respond to natural-language questions like “only errors in payments” or “show me the raw lines for cluster 7.”

The full runnable demo is in a Colab/Repo. In this blog, you’ll see the why and the what so you can decide quickly whether this fits your stack.

Try it now → Open the Colab | Repo → typedef-ai/fenic-examples | Docs → https://docs.fenic.ai/latest/

Why a Different Approach to Log Triage

Most of us have tried two extremes:

Regex everywhere. It starts clean and ends brittle. One library upgrade or format tweak and your “error” dashboard quietly loses half its signal. You might not notice until the night you needed that signal most.
One-click LLM summarization. Tempting, but operationally awkward: cost is difficult to predict, behavior drifts with prompts, and there’s no clean hand-off back into the tooling your SREs actually use.

The middle path is boring on purpose, but it works: Keep the DataFrame ergonomics engineers already use, add semantic operators where they pay off, and make the “AI bits” visible, testable, and schedulable.

That’s the ethos behind Fenic. You manipulate logs like a table, but you also get first-class text extraction, embeddings, and LLM utilities. This is all with explicit configuration and batch-friendly behavior.

The Promised End-state (What “Done” Looks Like)

Imagine asking your system:

“Top clusters above WARN in the last hour.”
“Only ERRORs for payment-api.”
“Assignments for cluster 7—I want the raw lines.”
“Coverage check—are we parsing reliably today?”

The agent answers in seconds because it isn’t guessing. It’s calling deterministic MCP tools backed by DataFrames you just computed:

Severity-aware clusters of similar failures (stable across parameter noise).
Exemplars that read like summaries, not stack dumps.
Evidence via fingerprints so a human can sanity-check the grouping.
Coverage: a simple metric that tells you whether your parsing kept up with reality.

There is no need to sift through pages of stack traces. The agent uses the read-only, queryable views you produced during the pipeline run.

Behind the scenes, you also keep operator-ready artifacts (CSV/JSON) for dashboards and a compact Markdown report for humans. Those artifacts are the audit trail; the agent is the interface.

Interested in Markdown Processing? Run our extensive example in Colab.

Architecture at a Glance (Why it Stays Maintainable)

The pipeline that powers those tools is deliberately small and linear. Each step is a DataFrame transform you can diff, test, and run headless.

1) Parse without brittleness

Real systems emit a mix of ISO-ish timestamps, syslog-style messages, Python logging lines, and occasionally JSON.

With Fenic, you define a few templates (named fields, not brittle capture groups), unnest them into candidate columns, and coalesce to a canonical schema:

timestamp (normalize later if needed)
level (INFO/WARN/ERROR/...)
service (e.g., payment-api)
message (payload after the header)
optional trace_id

A line that doesn’t match any template doesn’t vanish; it falls back to raw text. That single choice makes coverage transparent and drift visible.

2) Fingerprint for stability

Clustering succeeds or fails on the grouping key. We want a fingerprint that ignores volatile tokens (IDs, ports, timings) but preserves the cause. A practical pattern is:

sql
service | symbol | file#function | stem

symbol: recognizable error label (TimeoutError, OperationalError, unique_constraint_failed).
file#function: optional call site (useful for stack-tracey languages).
stem: a normalized message that keeps the idea of the failure while stripping noise.

You can extract these with rules or a tiny LLM call. Either way, the fingerprint stays explainable because you keep the pieces next to it in the table.

3) Tag severity deterministically

Before we embed anything, we separate signal from noise with cheap rules:

Hard failures (level=ERROR/FATAL/CRIT, timeout, connection refused, HTTP 5xx, nginx upstream errors) → error
Soft indicators (retry, latency, degraded) → warn
Everything else → info

This stops INFO chatter from drowning out real problems and lets you cluster inside severity buckets. You can layer an LLM adjudicator later for ambiguous messages, but the baseline is deterministic and fast.

4) Cluster semantically (per severity)

Now the semantic bit pays off. You embed enriched text, like:

shell
[svc:payment-api] [sev:error] TimeoutError on /v1/users after Xms

Clustering within severity buckets keeps WARN chatter from bleeding into ERRORs. Choose K-Means (with a small heuristic for K) or a density approach (HDBSCAN) for larger bins within each severity.

For each cluster, pick the exemplar closest to the centroid; that exemplar often reads like the sentence you’d write in a postmortem.

The result is a handful of clusters, each with a count, an exemplar, and a fingerprint, which is precisely what humans need to triage.

5) Publish read-only tools

Persist the three useful views (triage, clusters, and assignments) and register MCP tools with Fenic:

list_clusters(severity_floor) → ranked by severity-weighted score
clusters_by_severity(severity) → a single lane
assignments_for_cluster(cluster_id) → raw lines and evidence
coverage_metrics() → processed vs total lines

These are just queries over tables; they return rows, not plain-English (narrative) summaries.

A LangGraph agent sits on top and chooses the right tool based on the question.

Agent-first: the interface humans actually use

The trick to a good agent is primarily in well-chosen tools. With deterministic MCP calls, the agent becomes a thin orchestration layer:

Question → tool selection (e.g., “top” → list_clusters with warn)
Fetch rows → render clearly (markdown tables are enough)
Drill-down → if the human asks, call assignments_for_cluster with the id in the current row

There’s no hidden magic. If the agent faces an ambiguous question, it could enrich a summary for the top 3 clusters with a small, cheap LLM. But the default behavior is to show the facts and let the responder decide.

This is why the agent replaces the daily digest: it answers the current question, at the moment of curiosity, with just enough detail to act.

What “Good” Looks Like in Production

After a week of daily runs, you'll notice:

Coverage is stable (e.g., 80–90%). If it dips, that’s a crisp signal you need a new template or a parsing tweak.
Clusters are consistent day-to-day. The exemplar for “connection timeouts to db-prod” still points at the same root cause. If a cluster explodes in size, you know where to look fast.
On-call asks the agent first. Instead of searching dashboards, responders query: “top clusters above warn,” “only errors,” “assignments for #7.” They get compact tables, not walls of text.
Artifacts fit right in. CSV/JSON slots into dashboards and data lakes; Markdown renders well in tickets and wikis.

The effect isn’t flashy but helps you respond with more velocity, which, of course, is less guesswork and more progress.

Why this Design Holds Up in Day-2 Reality

Adoption stays high. DataFrame ergonomics mean less training and faster pull requests. You can add a template or tweak a rule without changing the whole system.
Determinism where you want it. Severity, fingerprints, and clusters don’t change because the operators are explicit. If a cluster changes, you can usually point to the reason (new message variant, new template).
Costs don’t surprise you. Deduplicate before embedding, embed once per unique fingerprint, and summarize only exemplars if you need narrative text. Your spend tracks distinct issues and not raw volume.
It’s easy to run headless. The same notebook steps translate into a small script; artifacts are files; tools are queries. Nothing exotic.
The agent is an interface, not a black box. It calls small, auditable tools over your tables. You can inspect the tables, test the tools, and trust the response.

Adopting this Incrementally

A common failure mode is trying to “agentify” the entire logging stack at once. Don’t. Here’s a low-friction rollout that respects your time and budget:

Run the demo against a small, representative slice of your logs.
Check coverage. If it’s low, add a template (JSON is a big win if you have it) and normalize service names (merge aliases).
Turn on the agent and use it during on-call. Let people ask natural-language questions and see cluster movement.
Stabilize cluster IDs if needed: enrich the fingerprint prefix or seed K-Means with yesterday’s centroids so exemplars stay familiar.
Only then consider LLM summaries for the top few clusters if responders want them. Keep it bounded.
Integrate light observability. Track coverage and cluster counts; alert on sudden dips or spikes.

At no point do you need to replace your existing log tools. This pipeline sits next to them and explains what they’re already collecting.

Cost, Latency, and Control (Picking an Operating Point)

Parsing is free aside from CPU. Templates are deterministic and cheap.
Fingerprints are near-free if rule-based; if you choose to extract with an LLM, dedupe prompts by unique message text.
Embeddings are the main paid step. Control spending by deduplicating on the fingerprint and batching aggressively. You embed distinct issues, not every log line.
Clustering (K-Means or HDBSCAN) runs locally and scales well.
Summaries are optional and should be limited to top-K clusters, where a human will actually read them.

If you want one sentence to carry to your next platform meeting: rules and embeddings, agent-first, and summaries only if people ask for them.

Governance and Privacy (Don’t Skip This)

If there’s any chance your logs include PII/PHI or secrets, redact before any model call:

Mask emails, tokens, phone numbers, and user IDs at the message field.
Keep raw logs inside your private boundary; export only fingerprints, counts, exemplars, and cluster IDs externally.
If required, route embeddings to a provider in your VPC, or use a private endpoint.

Fenic doesn’t force you into any specific model provider, which keeps compliance conversations simpler.

What success looks like

After a few days, you’ll notice the conversation change. People stop posting raw stack traces and start asking the agent:

“only error clusters in payments since 9am?”
“assignments for cluster 12?”
“coverage today vs yesterday?”

You’ll also recognize a few patterns:

Coverage hovers above the threshold you set (say, 70-80%). When it dips, a new template or a rename fixes it.
Top clusters stay stable unless something genuinely changes in production. When an ERROR cluster doubles overnight, it’s a true signal.
The agent becomes the default way people ask for status: SREs, support engineers, and even PMs who want a quick read before stand-up.
Code changes that would have broken a regex pipeline are non-events. You coalesce a new template and move on.

And that’s the real win: fewer firefights over tooling and more time hunting actual root causes.

Try It, Then Make It Yours

Open the Colab and run the demo with sample logs, or drop in a small batch of your own.
Adopt the minimal pipeline (parse → fingerprint → severity → cluster) and expose the tools.
Use the agent during on-call. Adjust templates and fingerprints as reality evolves.
Iterate: If responders ask for more context, add domain columns (endpoint, table, region) into fingerprints and enriched text. If they want prose, summarize exemplars only.

Grab the Essentials

Docs: Fenic operators, semantic config, and MCP tooling.
GitHub: clone, star, and open issues/PRs.
More examples and ready-to-run scripts.