Why pure vector search fails for e-commerce

The client — a Polish wallpaper and décor e-commerce store — had a 57MB PDF catalog with 1,200+ products. The first instinct was: vectorize the whole thing, let the LLM answer questions. That's the wrong approach for e-commerce, and here's why.

When a customer asks "How much is SKU-4821?" the answer must be exact. A semantic vector search returns the most similar chunks — but similarity doesn't mean accuracy. Prices can hallucinate. Dimensions can be mixed between products. Stock status gets confused between variants.

The original catalog was also 57MB of raw PDF — uploading it directly to Gemini File Search hit the hot-cache pricing tier (~$45/month) and caused 8+ second response latency. Unusable for a chat widget.

The root cause

E-commerce data has two fundamentally different types of information that require different retrieval strategies:

  • Hard facts — price, SKU, stock status, dimensions. Must be exact. Hallucination here = business loss.
  • Soft knowledge — style descriptions, use cases, recommendations, FAQ. Can be approximate. Semantic retrieval works well here.

A single retrieval strategy cannot serve both. That's the core insight behind Hybrid RAG.

Hybrid RAG: SQL + pgvector

The solution splits the catalog into two layers, each serving queries it's suited for:

User query
    ↓
LangGraph ReAct Agent
    ↓                        ↓
SQL tool (hard facts)    pgvector tool (semantics)
price, SKU, stock         style, FAQ, recommendations
    ↓                        ↓
          → Gemini 2.5 Flash ←
                 ↓
            Response

Layer 1 — SQL (Supabase PostgreSQL)

All deterministic facts live in structured tables: product ID, name, price, stock status, dimensions, category, supplier. When the agent needs a price or availability check, it runs a SQL query via the get_product_detail tool. The answer is always exact — no embedding distance, no approximation.

Layer 2 — pgvector (semantic search)

Product descriptions, style notes, use case text, and FAQ content are chunked, embedded with Sentence-Transformers, and stored in pgvector. When a user asks "what works well for a small bathroom?" or "something modern for a kids' room?", the agent runs a similarity search and retrieves the most relevant style context.

The ETL pipeline

Before any query can be answered correctly, the data must be clean. I built an ETL pipeline that: extracts structured fields from the raw PDF, deduplicates product entries, runs contract tests on every category before mass embedding (dry-run on a single item to verify prompt accuracy), and upserts to both Supabase tables. The 57MB PDF became a 5MB clean knowledge base.

LangGraph ReAct — why not a simple chain

A single-shot LLM prompt can't handle real e-commerce queries. "Do you have any beige wallpapers under €30 that work in a humid bathroom and are in stock?" requires multiple retrieval steps: filter by color and price (SQL), check humidity suitability (semantic), verify stock (SQL again).

LangGraph ReAct agent solves this with a tool-calling loop:

  • search_products — keyword + category filter in SQL
  • filter_by_attributes — price range, stock, dimensions
  • get_product_detail — exact facts for a specific SKU
  • semantic_recommend — style and use-case similarity
  • get_faq — common questions about installation, care

The agent decides which tools to call, in what order, based on the user query. LangSmith provides full observability into every tool call and token usage.

How ~90% API cost reduction happens

The naive approach — send the full catalog to the LLM for every query — is expensive. 57MB = millions of tokens per request at hot-cache pricing.

The Hybrid RAG approach sends only what's needed:

  • SQL query returns exact rows — 200-500 tokens of structured data
  • pgvector returns top-3 relevant chunks — 400-800 tokens of context
  • Total context per query: ~1,000-1,500 tokens vs. millions

Gemini 2.5 Flash (not Pro) handles the final synthesis — fast, cheap, accurate enough for this task. The architectural decision to split retrieval is worth more than any model optimization.

Result: the cost per query dropped from ~$0.045 (hot-cache full PDF) to ~$0.004 (Hybrid RAG). At scale, this difference is the entire profit margin.

From dev to live: FastAPI + Railway

The agent is exposed via a FastAPI /chat endpoint. The frontend is a lightweight web chat widget embedded on the store's product pages. Telegram was used only for internal debugging during development — not in the production customer-facing interface.

Deployment is on Railway with Docker — zero-config autoscaling, no server management. The Supabase instance handles both the PostgreSQL data and pgvector extension. No separate vector database infrastructure needed.

Observability from day one

LangSmith is wired in before the first production query. Every agent run is logged: which tools were called, in what order, how many tokens were used, where the agent hesitated or retried. This is not optional — without observability, debugging a misbehaving agent in production is guesswork.

Quality assurance

Buyer-scenario integration tests run against the live agent: 20+ test queries covering price questions, stock checks, style recommendations, and edge cases (discontinued products, out-of-stock variants). These tests run in CI before any deployment. The agent cannot go to production with a failing buyer scenario.

What I'd do differently — and what works

Split by data type, not by retrieval method

The biggest mistake in RAG implementations is vectorizing everything. The correct question is: "Does this piece of data need exact retrieval or approximate retrieval?" Prices and SKUs need exact. Descriptions and recommendations need approximate. Build your retrieval layer accordingly.

ETL quality determines agent quality

Garbage in, garbage out is the most common RAG failure. I spent more time on the ETL pipeline — deduplication, field normalization, contract tests — than on the agent itself. A well-structured knowledge base makes the agent dramatically more accurate without any prompt engineering.

ReAct over chains for multi-step queries

Chains work for single-step retrieval. Real user queries are multi-step. A ReAct agent with well-designed tools handles conversational context, follow-up questions, and compound queries without special-casing every scenario.

Observability is not optional in production

LangSmith from day one — not bolted on after the first production incident. You need to see what the agent is doing before users tell you it's wrong.

When does a RAG sales agent make sense for your business?

This architecture makes sense for any e-commerce or B2B business in Germany that has:

  • 100+ products with structured attributes (price, category, specifications)
  • Customers who ask questions before buying — not just browse and click
  • A catalog that changes regularly (seasonal, stock updates)
  • A support team spending time answering repetitive product questions

The integration is a chat widget on the existing site — no full rebuild required. The agent connects to your product database and existing infrastructure. Build time for a standard e-commerce setup: 3-6 weeks.

Industries where this works well in Germany: furniture and interior design retailers (Möbelhändler), building materials suppliers (Baustoffhändler), B2B parts and components distributors, wallpaper and flooring specialists, specialty food and beverage wholesale.

Tools used in production

Python LangGraph FastAPI Supabase pgvector Gemini 2.5 Flash Sentence-Transformers LangSmith Docker Railway pytest VS Code Claude Code
Brauchen Sie einen KI-Verkaufsassistenten für Ihren Online-Shop?

I build production RAG agents for e-commerce and B2B businesses in Germany. SQL + vector search hybrid, zero hallucinations, integrated into your existing site. Let's talk about your catalog.

← Back to articles