At QCon London 2026, Lan Chu, AI Tech Lead at Rabobank, shared lessons from deploying a production AI search system used internally by more than 300 users across 10,000 documents. Her experience shows that most failures in RAG systems stem from indexing and retrieval, rather than the language model itself.

The system allows users to search thousands of internal documents, including PDFs and PowerPoint files, to quickly extract insights for tasks such as preparing for client meetings.
Its architecture follows a typical RAG pipeline:
1- Document ingestion: parsing, chunking, and embedding documents before indexing them in a vector database
2- Retrieval and generation: retrieving relevant chunks and sending them to an LLM to generate answers
3- Observability: monitoring traces, retrieval performance, and evaluation metrics
While the architecture appears simple, Chu explained that production systems quickly encounter challenges around document quality, retrieval relevance, and evaluation.

The presenter highlighted that parsing documents accurately is crucial for AI retrieval systems. Enterprise documents often have complex layouts like tables and infographics, and simply converting them to plain text can strip away important structure, causing misread numbers or misinterpreted tables. To fix this, she built a pipeline combining traditional text extraction with visual-language models that understand layouts.
Even with modern language models, chunking content is necessary to avoid overwhelming the model and increasing costs. Chu tested different methods and found that breaking documents into sections worked best for her dataset, reaching high accuracy, though she stressed that the right strategy depends on the specific data.
Standard retrieval systems rely on vector similarity, but this can miss important context, like the timing of a document. Her system added temporal scoring to favor newer documents and a routing layer to decide whether to retrieve documents or call external APIs. Since models can struggle with tool parameters, sometimes users are asked to confirm inputs.
Evaluation is often neglected, but Chu recommends building datasets from real user queries, tracking failure modes like routing or temporal errors, and using statistical methods to verify improvements. Real queries often provide more value than synthetic datasets.
The key lessons are that building effective AI search systems requires careful attention to several key areas, retrieval quality relies on accurate document parsing and indexing, chunking strategies need to be tested and validated on real datasets, and retrieval should consider signals beyond simple text similarity, like temporal relevance. The presenter noted that agentic architectures can enhance capabilities but introduce additional complexity, and that robust, structured evaluation frameworks are essential to ensure reliable performance in production AI systems.