August 27, 2025
Arri Marsenaldi

Post-mortem of a Failed GenAI Project: An Architectural Analysis

A candid and technical breakdown of why a promising multi-step AI agent project failed in production. This is an architect's analysis of the flawed system flow, the catastrophic results, and the resilient architecture that should have been built.

In the rush to deploy Generative AI, many teams are discovering a hard truth: the leap from a single-prompt demo to a reliable, production-grade AI agent is not an incremental step, but a massive architectural chasm. I learned this firsthand on a project I'll call "Chimera."

Project Chimera was an ambitious initiative to build a multi-step AI agent to automate a complex B2B client onboarding workflow. It succeeded spectacularly in controlled demos but failed catastrophically under the unpredictable conditions of the real world. This is not a story of blame, but a technical post-mortem designed to dissect the architectural flaws that led to its failure, and to share the resilient patterns that should have been used from the start.

The Initial Vision: An Automated Onboarding Concierge

The business problem was clear: a high-touch, manual client onboarding process was slow, error-prone, and impossible to scale. The strategic goal was to create an AI agent that could receive a new client's company name and a goal, and then autonomously.

The Strategic Goal

Enrich the client's data using a tool like Clearbit.

Create a customized project in Jira.

Draft and send a personalized welcome email via SendGrid.

Post a notification in a specific Slack channel.

On paper, this would save dozens of hours per week and create a flawless client experience.

The Flawed Blueprint: A Monolithic Serverless Agent

My initial architecture was designed for simplicity and speed of development. It was centered around a single, powerful Cloud Function that acted as the agent's brain and hands, orchestrating the entire workflow in one process.

Fig 1: The initial, deceptively simple architecture. A single orchestrator function directly calls a chain of external APIs in a synchronous sequence.

This architecture worked perfectly in demos where every API responded instantly and never failed. In production, it unraveled completely.

The Unraveling: Where the Architecture Failed

The failure wasn't a single event, but a cascade of systemic issues rooted in the monolithic design.

The Latency Catastrophe

LLM calls are slow. API calls to external services like Jira can be slow. Chaining four or more of these synchronous, long-running tasks together regularly pushed the Cloud Function past its 9-minute execution limit, causing frequent, unpredictable timeouts.

The Resilience Black Hole

When the Jira API failed intermittently on step #2, the entire process crashed. The function had no memory of successfully completing step #1. There was no mechanism to retry just the failed step; the entire, expensive process had to be restarted from the beginning, frustrating users and hammering APIs.

The Observability Nightmare

When a run failed, we had logs, but no state. We couldn't answer the most basic questions: What step did it fail on? What was the last successful action? What was the plan it was trying to execute? Debugging was effectively impossible.

The Cost Overruns

Long-running, high-memory Cloud Functions are expensive. Retrying these entire, multi-minute functions due to a single downstream failure caused costs to spiral out of control.

The architectural root cause was clear, we had designed a stateless, synchronous process for a stateful, asynchronous problem.

The Resilient Blueprint: A Decoupled, Event-Driven Architecture

After the post-mortem, I redesigned the system based on principles of resilience, observability, and scalability. The solution was to decouple the agent's "thinking" from its "doing."

Fig 2: The redesigned, resilient architecture. The Orchestrator only thinks. It reads and writes state to Firestore and dispatches tasks to a Cloud Tasks queue. Dedicated, single-purpose Cloud Functions execute each tool independently.

This new architecture solves every one of the previous failures:

PrincipleDescription
Resilience

The Cloud Tasks queue provides automatic, configurable retries with exponential backoff. If the Jira API fails, only the small jira-tool-function is retried, not the entire workflow.

State Management

Firestore acts as the agent’s memory. The orchestrator can be interrupted at any time and resume exactly where it left off by reading the last known state.

Observability

The Firestore document for each agent run provides a perfect, step-by-step audit trail of every decision and tool output, making debugging trivial.

Scalability & Maintainability

Adding a new tool is as simple as deploying a new, isolated Cloud Function. The core orchestrator logic rarely needs to change.

Cost Efficiency

Each function is now small, single-purpose, and executes quickly, dramatically reducing compute duration and cost.

Key Architectural Lessons for Production GenAI Agents

PrincipleDescription
Persist State Religiously

An agent without memory is a toy. State management is not a feature; it is the core of the architecture.

Decouple Thinking from Doing

The orchestrator's only job is to decide the next step. The execution of that step is someone else's problem.

Embrace Asynchronicity

Use task queues for any interaction with the outside world. The real world is unreliable; your architecture must assume failure as a normal operating condition.

Project Chimera failed, but the lessons learned were invaluable. Building reliable, multi-step AI agents is not an AI problem; it's a distributed systems problem. By applying these battle-tested architectural patterns, we can move beyond brittle demos and begin to deliver on the true promise of autonomous AI.