Your RAG Pipeline is a Black Box. Here's the Architecture to Fix It. | Devopixel Blog

Your new Retrieval-Augmented Generation (RAG) application works. The demo was a wild success. Then, you ship to production. A week later, a key stakeholder sends you a screenshot with a simple, terrifying message: "Why did it say this?"

The AI has produced a factually incorrect, nonsensical, or subtly biased answer. You look at your logs. You see an inbound request and an outbound response, but everything in between—the most critical part of the process—is an impenetrable black box. You can't answer the most basic questions:

What documents were retrieved from the vector database?

Why were those specific documents considered relevant? What were their scores?

What was the exact, final prompt that was sent to the LLM after augmentation?

Was this a failure of retrieval, or a failure of generation?

If you cannot answer these questions, you don't have a production system. You have a liability. As a Digital Product Architect, I've seen this scenario play out too many times. The root cause is always the same: a naive architecture that treats the RAG process as a single, atomic operation. This article dissects that flawed pattern and provides a detailed blueprint for an observable architecture that fixes it.

The Flawed Blueprint: The Stateless Black Box

The most common RAG architecture is a single serverless function that performs the entire process in one go. It's simple, fast to develop, and completely opaque.

Fig 1: The black box architecture. A single function takes a query and returns an answer. All the critical intermediate steps—the retrieval and augmentation—are ephemeral and lost forever, making debugging impossible.

This architecture is fundamentally flawed because it fails to capture the most valuable data the system produces: the metadata of its own decision-making process. It optimizes for a single successful run, but it is architecturally blind to failure.

The Resilient Blueprint: Architecting for Observability

The solution is to treat the RAG process not as a single transaction, but as a sequence of observable events. We must re-architect the system with a core principle: every step of the inference pipeline must be logged to a centralized, structured, and queryable location.

Fig 2: The observable architecture. The core RAG process remains the same, but every step now emits a structured log event to a Pub/Sub topic. An asynchronous Cloud Function then writes this rich data to BigQuery, creating a complete, auditable "paper trail" for every request without adding latency to the user-facing response.

A Component-by-Component Breakdown

Component	Description	Key Details / Columns
1. Interaction Log	Central BigQuery table capturing every user request. Serves as the observable heart of the system.	`interaction_id` (UUID) `timestamp` `user_query` `retrieved_context_chunks` (JSON/Array) `retrieved_chunk_sources` (JSON/Array) `retrieved_chunk_scores` (JSON/Array) `final_prompt_sent_to_llm` `llm_response` `latency_ms` `user_feedback` (1, -1, 0)
2. Asynchronous Logging Pipeline	Decouples logging from the user-facing request. The Query Handler publishes events to Pub/Sub. Cloud Functions batch-process and stream them into BigQuery.	Pub/Sub topic for incoming events Batch Cloud Function to write into BigQuery Non-blocking, low-latency design
3. User Feedback Loop	Captures thumbs up/down feedback and updates the corresponding BigQuery row asynchronously. Enables systematic improvement of the AI system.	Feedback event with `interaction_id` and score Cloud Function updates BigQuery row Closes the loop for observability and model evaluation

The Payoff: Turning the Black Box Inside Out

With this architecture in place, you are no longer blind. When a stakeholder asks, "Why did it say this?", you can now provide a definitive, data-backed answer.

Benefit Description

Root Cause Analysis

Benefit	Description
Root Cause Analysis	Query the BigQuery table by `interaction_id` to see exactly what happened. Identify if issues stem from: Retrieval strategy or chunking (irrelevant documents) Prompt augmentation or model behavior (documents correct but ignored)
Systematic Evaluation	Run aggregate queries to answer business-critical questions: Average user feedback score Percentage of queries retrieving no documents P90 latency of LLM responses
Fine-Tuning Dataset	The interaction log becomes a high-quality dataset for future model fine-tuning, containing: Prompts Context chunks LLM responses Human feedback / quality ratings

Query the BigQuery table by interaction_id to see exactly what happened. Identify if issues stem from:

Retrieval strategy or chunking (irrelevant documents)
Prompt augmentation or model behavior (documents correct but ignored)

Systematic Evaluation

Run aggregate queries to answer business-critical questions:

Average user feedback score
Percentage of queries retrieving no documents
P90 latency of LLM responses

Fine-Tuning Dataset

The interaction log becomes a high-quality dataset for future model fine-tuning, containing:

Prompts
Context chunks
LLM responses
Human feedback / quality ratings

The Architect's Verdict

Observability in an AI system is not a feature or a "nice-to-have." It is a foundational, non-negotiable requirement for building a trustworthy and maintainable product. By moving away from the naive, monolithic black box and architecting a decoupled, event-driven system with a centralized state log, you transform your RAG pipeline from a brittle liability into a resilient, transparent, and continuously improving asset.