Back to Projects
Featured Project

AgentOps Smart SRE

Debug multi-agent failures in minutes, not hours

Production MVP for automated Root Cause Analysis of multi-agent system failures with real-time progress streaming.

The Problem

When AI agents fail in production, finding the root cause across multiple services is like finding a needle in a haystack made of other needles. Multi-agent systems are notoriously hard to debug: agents call other agents, make decisions based on intermediate states, and fail in ways that are nearly impossible to reproduce. Traditional monitoring tools weren't designed for this. You need something that understands agent interactions, follows the reasoning chain, and pinpoints exactly where things went wrong.

What You Can Do

Automated Root Cause Analysis

Submit a failure trace and get a detailed breakdown of what went wrong and why.

Real-time Investigation

Watch the analysis unfold in real-time via Server-Sent Events streaming.

Evidence-based Debugging

Every conclusion is backed by specific log entries, traces, and state snapshots.

Production Observability

OpenTelemetry integration gives you full visibility into agent behavior.

Tech Stack

Backend API

FastAPIPythonPydantic

High-performance async API with automatic validation and OpenAPI docs.

Data Layer

PostgreSQLRedisRQ Workers

Reliable storage with background job processing for heavy analysis.

Frontend

Next.jsTypeScriptTailwind CSS

Modern React dashboard with real-time updates.

Observability

OpenTelemetrySSE StreamingStructured Logging

Full tracing and live progress streaming for transparency.

Architecture

The system has three main components:

1. Ingestion Service: Receives failure reports and agent traces

2. Analysis Engine: AI-powered root cause analysis running as background jobs

3. Streaming API: Real-time progress updates via Server-Sent Events

4. Dashboard: Next.js interface for investigating failures

Analysis jobs are queued in Redis and processed by RQ workers, allowing the system to handle bursts of failures without blocking.

Results

10x
Faster root cause identification
<30s
Average time to first insight
Real-time
Progress visibility via SSE

Key Features

  • 1Automated Root Cause Analysis for agent failures
  • 2Real-time progress streaming via SSE
  • 3Evidence-first analysis methodology
  • 4OpenTelemetry integration for observability
  • 5Production-ready Next.js dashboard

Interested in this project?

Check out the source code, try the demo, or get in touch to discuss how similar solutions could help your team.