From Ingestion to Inference: A Deep Dive into the System Architecture of a Real-World GenAI App for Retail | Devopixel Blog

In the hyper-competitive retail landscape, companies are drowning in data but starving for wisdom. They have terabytes of sales figures, thousands of customer reviews, and real-time inventory levels, but they struggle to answer the most critical question: Why? Why is one product outperforming another? What are the hidden themes in customer feedback? What trends are emerging that we can act on before our competitors?

Generative AI promises to answer these questions, but a simple prompt to a generic LLM is not a solution. To deliver true value, we must architect a system that can reason over a company's unique, private data in real-time. This is not just an AI project; it is an end-to-end system architecture project.

This article is a deep dive into the architectural blueprint of a Retail Intelligence Engine, a production-grade Generative AI application I designed for a mid-sized retail client on Google Cloud. We will break down the two core sub-systems required to go from raw data ingestion to actionable, AI-driven inference.

The High-Level Blueprint: The Two-Pipeline Architecture

The foundational architectural principle for a system like this is the separation of concerns. The process of collecting and preparing data (Ingestion) has fundamentally different requirements than the process of answering a user's question (Inference). One must be reliable and thorough; the other must be fast and responsive.

Fig 1: The complete high-level architecture, showing the clear separation of the Asynchronous Data Pipeline for processing raw retail data, and the Real-Time Inference Pipeline for serving insights to business users.

The Asynchronous Data Pipeline (Ingestion)

The intelligence of our AI is entirely dependent on the quality and freshness of its knowledge. This pipeline is the industrial-grade factory that continuously processes raw retail data into a clean, structured, and queryable format for the AI.

Fig 2: A component-by-component breakdown of the event-driven data pipeline. This architecture ensures that data from any source is processed reliably and efficiently.

Component Breakdown

Stage	Description
Raw Data Sources	The system is designed to ingest data from multiple origins. This includes structured data from the E-commerce Platform DB (e.g., sales, inventory) and unstructured data like Customer Reviews or social media comments.
Event Bus (Pub/Sub)	All incoming data is treated as an event and published to a Pub/Sub topic. This decouples the data sources from the processing logic, making the system highly scalable and easy to extend with new sources.
ETL & Structuring (Cloud Functions)	A suite of dedicated, single-purpose Cloud Functions subscribes to the Pub/Sub topic. They are responsible for cleaning the data, structuring it (e.g., extracting sentiment from reviews), and preparing it for storage.
Data Warehouse (BigQuery)	The cleaned, structured data is loaded into BigQuery. This becomes the "single source of truth" for all historical performance and the primary context source for the AI.
Vectorization (Vertex AI)	A separate Cloud Function takes qualitative data like product descriptions and customer reviews, generates vector embeddings using a Vertex AI model, and upserts them into a dedicated vector database.

The Real-Time Inference Pipeline

This is the user-facing part of the application where a business analyst or marketer interacts with the AI to ask complex questions. This pipeline is architected for low latency and a seamless user experience.

Fig 3: The low-latency inference pipeline. A business user's natural language question is augmented with rich, private data to generate a deeply insightful answer.

Component Breakdown

Stage	Description
User & Web App	A business user interacts with a React-based web application (hosted on Cloud Run), asking a complex question like, "What were the main complaints about our top-selling jackets in Q4, and how did that correlate with return rates?"
API Gateway	The entry point that securely routes the request to the backend.
Query Handler (FastAPI on Cloud Run)	A containerized Python service that orchestrates the entire RAG process. We use Cloud Run for its flexibility with complex dependencies and longer timeout limits compared to standard Cloud Functions.
Context Retrieval	The Query Handler queries BigQuery for structured data (e.g., sales and return rates for jackets in Q4) and Vertex AI Search for unstructured, semantic context (e.g., the most relevant customer reviews about those jackets).
Prompt Augmentation & Generation (Vertex AI)	The retrieved data is synthesized and formatted into a rich, detailed prompt. This "augmented" prompt is then sent to a powerful generative model (like Gemini) to produce a final, narrative answer that directly addresses the user's question, backed by real data.

Key Architectural Takeaways

Principle	Description
Treat Data Ingestion as a First-Class System	The reliability of your AI is directly tied to the reliability of your data pipeline. Architecting it as a decoupled, event-driven system is non-negotiable for production applications.
Use the Right Tool for the Job	While Cloud Functions are excellent for lightweight ETL tasks, a containerized service like Cloud Run provides the necessary control and flexibility for a complex, multi-step inference orchestration.
Separate Structured and Unstructured Data	The true power of this architecture comes from combining structured insights from BigQuery with semantic understanding from a vector database. The inference pipeline is where these two worlds meet.

By moving beyond a simplistic view of AI and focusing on the end-to-end system architecture, we can build powerful intelligence engines that provide a true, sustainable competitive advantage for any data-rich business.