Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Generative Search: Practical Advice for Retrieval Augmented Generation (RAG)

Generative Search: Practical Advice for Retrieval Augmented Generation (RAG)



Sam Partee discusses Vector embeddings in LLMs, a tool capable of capturing the essence of unstructured data used by LLMs to gain access to a wealth of contextually relevant knowledge.


Sam Partee is a principal engineer at Redis helping lead the development and awareness of Redis in machine learning systems. Sam has a background in high performance computing and he previously worked at Cray and HPE on projects like SmartSim, Chapel, and DeterminedAI. In his spare time, Sam enjoys contributing to open source projects, writing on his blog, and spending time with friends and family

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Partee: This year, I've heard from 1000-plus people that are interested in the topic and talked to a ton of different enterprise companies. Who here knows what Redis is? A lot of customers. Who here knows that Redis is a vector database? A lot of talking to do. That is the point is that I've had to do a lot of talking to educate people on how you can use Redis in these scenarios, and help these customers actually build it and put it into production, and not always successfully. What I'm going to share is what I've learned from all of that. This is essentially an advanced version of large language models in production.

LLMs (Large Language Models)

Large language models are used almost everywhere, right now in all types of applications. They're used for a number of different things, from summarization, to question-answering, and all types. They're often supplied by external knowledge bases called vector databases. They do this through vector similarity search. You can see platforms like Amazon Bedrock being able to string infrastructure together to be able to do this. There's a couple of things that come along with using large language models that are going to help us ground why RAG exists in the first place. Then we're going to build up into how it's actually used. Then, how do you do it? Cost, quality, performance, and security. Cost, because large language models are expensive, especially generative ones that are really good. Quality, because they often make stuff up. If you ask them to do things, they do it. They have no knowledge of right and wrong, they are instruction-based models: you instruct them, they do. If you instruct them to do something, and they don't know, they will make something up. Their quality is often bad in certain scenarios that you're not necessarily always thinking about. It's been a case of many times of those failures that I was talking about in the systems that we had. A lot of times people call those hallucinations. I don't love that term. Really, it's just wrong information. Because it personifies the LLM to say hallucinations, like it's hallucinating. Really, it's just wrong. It's just not the right answer, not the right information. Performance, does anybody know what the QPS of a lot of these large language models is, queries per second? It's like two queries per second. You know what the queries per second of a single open source Redis instance is? It's in the thousands. Let's put it that way. You have systems where you suddenly have a large bottleneck, and do you span that LLM out and create copies? What happens to that first point there? Your cost goes way up. You have to start thinking about, how can I take these attributes and make them work together? Lastly, security. What happens if you want to deploy a RAG system where you have something that is internal and external? How are you separating that data?

Rethinking Data Strategy for LLMs

This is where I start to say, grounding RAG, we're going to rethink the data strategy. First, everybody wants this. It's a private ChatGPT. I've probably built 30 of these this year. It's just ChatGPT plus our internal knowledge base of a bunch of PDFs, and notes, and whatnot that we have at the company, PowerPoints, you name it, and there's all types of modalities. Everybody wants that. That's the goal. Do I fine-tune to get this? Do I take all of that data and then create the Redis LLM on all of our internal docs? I could, and it would certainly become better. Usually, fine-tuning is much better for behavioral changes, and how it speaks, how it acts, rather than what it knows. This can also lead to security issues. That internal-external that I talked about, what happens when you want to use the same LLM? That knowledge is in its parameter set. You can't then ask it to go be an external LLM, because it has that knowledge. If prompted correctly, it will say that information. Do you just feed everything into the context window? Just say, ok, I'm just going to slam all this information in at runtime. First of all, you have the limitation of even in the best model is 32k. What happens then when your costs start to go up, and then on top of that, relevance? If you say that this is the context for the question that you're supposed to answer, and then there's irrelevant context, that model is going to use that context that's irrelevant to answer the question. You're instructing it to do so. There's a middle ground. This is where we're going to talk about vector databases. This is how I talk about this, because a lot of times, it's not either/or. You do one, and then maybe you do another. You fine-tune a model to change how it acts or how it talks. Then you also have an external knowledge base in the form of a vector database.

Vector Databases

What do vector databases do? They perform vector search. Vector similarity search, if you want expand it out. This is an oversimplified vector space that you see here. It's obviously not two dimensional. These vectors are of 15, 36 dimensions in the case of OpenAI. There's a lot of times they're even bigger than that, if you go on Hugging Face. What you see here is this semantic search space of three sentences. It's, today's a sunny day, that is a very happy person, that is a very happy dog. Then you have a query. That query is that as a happy person. What you want to calculate is, how far is that sentence in its semantic representation from the other sentences in my search space? You do this by having those vectors and doing the same thing you did in fifth grade when they taught you SOHCAHTOA, which is cosine. Cosine similarity between the two angles in that search space. Then you get a vector distance. This sentence is this vector distance away from this sentence, and I always do this. Because really, that is all you're doing. That is an extremely efficient computation. That's why these things can have billions of vectors. They can perform billion scale vector similarity search, because they're doing a very simple mathematical operation. In this case, it's only three vectors, but you can scale these into the hundreds of millions, billions. That's because these representations are very computationally efficient. Think about taking a extremely large paragraph and boiling that down into a vector with 15, 36 floating-point 32 numbers. You're going from megabytes to kilobytes. That representation is more compact, space efficient, and, in runtime, more computationally efficient. On top of that, it's better than things like BM25, because it counts for things like synonyms. If you say brother, and you say relative, a BM25 is not going to count those two words as the same. A semantic search that has the knowledge of a model that's read all of Wikipedia and all of Reddit, encoded into that semantic vector, will know that relative and brother belong to the same category, at least relatively, of words. What do vector databases do? They essentially just put this operation into production. The one thing you should know here, Redis is a great vector database.

Retrieval Augmented Generation (RAGs)

What is retrieval augmented generation? How do I use a vector database as a knowledge base? I always love this one, because a lot of people are like, vector databases, everybody wants them. The market is so flooded. The truth is, is that this is why. Is that you have people like Sequoia putting out stats, they believe 88% of these large language model applications are going to use a retrieval mechanism, whether it's a vector database or not. In this case, you can see they're saying it's doing those things that I talked about, that middle ground between those two operations, where people do use large language models. That middle ground is filled by retrieval in this case. What is the process? This is really simple, definitely oversimplified diagram. You have user query, let's say, what is Redis? That goes to an embedding model. That embedding model then gives you a list of numbers. It's just simple floating-point 32, bfloat16 if you're doing an optimized or a quantized approach. Then you have a vector search, which using that embedding returns passages of texts. In this case, I'm saying documents. Just imagine it returns you a bunch of PDFs that are relevant to the user query. Then you create a prompt, says, here's a question. Here's the context. Now answer that question. That's all RAG is. That's the entire concept. It doesn't necessarily have to be Q&A, and I'll show you that later. In that case, all I'm saying is, I need more information than present in the large language model to do something, whether it's summarization, question-answering, what have you. You might use something like Cohere or Hugging Face to use the embedding model. Obviously, you'd use Redis as your vector database. You might use something like OpenAI because their generative models are currently really good, if you look at benchmarks. Then to chain this all together, you might use something like LangChain, or if you really want a high-level GUI, you might use Relevance, or LlamaIndex with Jerry Liu's company.

The principle here, and it's really simple, more relevancy, more relevant context, you get a better answer. That's one of the most important things about it. There's a thing called a range query, if you're looking at that space, that diagram earlier, you can specify a vector distance, a range away from it such that if you don't get any context, if it's not within that range, if it's not similar enough, then you can say, don't say anything, you don't know anything. If you want that model to be strictly bounded to that search space, then, no hallucinations, because it's only ever going to generate something with that context if it gets retrieved. That's relying on the retrieval. You're then flipping the problem around and saying, I'm relying completely on my retrieval process in order to generate this information, which leads you to just have to curate the retrieval process. Then you can rely on other people for the LLM syntax. The benefits of this are, it's cheaper and faster than fine-tuning. It's better security, like we mentioned, with fine-tuning earlier. In the case of databases like Redis, or even some other vector databases, you can update this in real time. Imagine if you had a network of sensor data, and you were an incident engineer, and something happened. You said, what just happened to machine 4? That data has to be new. You can't refine-tune on that. You can't stuff all of that information into the context window. It has to be in a real-time data platform ready to go. Then, lastly, like I was mentioning, it allows you to have multi-tenancy, it allows you to separate these users from those users, that company from this company, and those documents from those documents. That is a really important part of actually having knowledge in these systems. It's used for more than just question-answering, although currently a lot of them seem like question-answering. There's more than just that in the system right now.

Summarization is also really important. We have an archive demo that I'll show later. You can take a bunch of archive papers and say, I don't really understand this, summarize it in English. That's essentially what it does. Then you have customer service applications that say, users, what was the last thing I ordered, I forgot what it was called? Then you use a feature store to inject their last order right into the prompt, so that the customer service bot can tell your users exactly the last thing they ordered, and do so in a reasonable amount of time. They never have to call an operator again. You never have to sit there and go agent, agent, ever again. That's significantly better in terms of user experience. These are the types of things that people get really excited about, is the change of user experience, from one methodology to a whole completely new way of doing it.

Abstraction Levels

Taking a step back, what really matters here? How do I take these systems to prod? You roughly understand the process now. I want to talk about two abstraction levels. First, a service level. This is the service around the RAG system. Everybody wants to talk about the one on the right, the, in my mind, somewhat simpler system, because of the results. It's really flashy. It's cool. These LLMs produce great content. It's interesting. Yet, these systems are still hard to put into prod, you still have to do things like caching and monitoring. We're going to talk a little bit about how it's more like this. The service encompasses the RAG system. It may even be like this, where you may have multiple RAG applications deployed on the same infrastructure. We'll show examples of how to reduce the complexities there.

RAG-Level System (Concepts and Examples for Building LLM Applications)

First, we're going to talk though, about the RAG-level system. This is an example of Q&A system. This is also on the Redis Ventures GitHub, which you can go check out. There are essentially two processes here that you have to accomplish. This is an oversimplification, again, but I'm going to get more complicated as I go. First, you have a background process. The background process is, I take some number of documents. Let's assume that I take the entire corpus of the document for now. I use an embedding model to create a vector. Then I put that into my vector database and associate that vector with some text that was used to create it. Then there is the online system. The online system is where a user enters that query. That query then gets embedded just like you did before. That embedding is used to look up information text in the vector database by performing vector search and returning the associated text. Then you send that in a newly constructed prompt to something like OpenAI to perform a generation. This again is an oversimplified process, but this is really the high level of the RAG-level system.

Let's talk about some specifics. I just said we take the entire corpus of the document. You almost never do that. The reason is, it's hard to search semantically over an entire PDF. There are different meanings in an entire corpus of words. If you have a user query, even if you're using like an approach I'll show later, that query doesn't look like almost anything in that document, especially at the size of that document, even if you're taking that entire document. In this case, the simple approach would be, just take that raw text and use it for the embeddings. Then have that system we described before, it looks something like this. The problem with the approach is the unrelated context surrounding the text that you actually care about, and that makes the retrieval. As I said, in most of these RAG systems, you're actually relying on retrieval. The retrieval is actually the most important part if you're using a vector database like this. That filler text actually really degrades the search in terms of even recall, like very measurable statistics. Instead, we want to do some other things. This one's courtesy of Jerry Liu at LlamaIndex. Instead, you could take that entire PDF and create a document summary. Think about like an abstract of a paper. That is much more concise. The semantics of that information are much broader, they're much richer, they're much closer to what a user query may be. Even if you do have specifics, you can then use these summaries as the high-level retriever, and then do another semantic search through the chunks retrieved from that summary. These are actually two separate vector searches that are going to happen. First, you use the summaries and do a vector search across the summaries. Then, you return the document chunks, and do either local, or, if you can do chained operations, in-database vector search to retrieve between those chunks. In the paper example, we use abstracts, and then we go through the entire paper. Imagine like, do you want to talk to this paper at one year on Instead of a BM25 text bar at the top right, it's semantic search to the abstracts, and titles. Then you can click on the paper and say, I've got a question about this. What it's going to do is only semantic search through that paper. There's a great feature in Redis called temporary indices to do that.

Next, you can also spend a bunch of money on embeddings. This approach, you take something smaller, something like every sentence, and then you take all of the context around it as what you return instead. You're only ever doing semantic search on that specific sentence. That query is much more likely to be similar to one specific sentence, which does improve the retrieval. Then for the actual context retrieved for the LLM, you return much more surrounding text. You can do this through things like overlap. Essentially, this is used in cases where more context is often needed. Things where more information is going to be better. As I mentioned earlier, that's not always the case. Going through these data preparation strategies, and going through your problem, what is my user query going to look like? What is the end up system going to look like in terms of the retrieval? How do I test and evaluate that retrieval? All of those steps matter for data preparation, because it basically all starts there. Your quality is going to largely depend on data preparation, which is why it's my first slide. This is the archive RAG demo that I was talking about. Right now, the deployed one is just semantic search. You can use OpenAI and Hugging Face because we have two indices up there. If you want to look at the RAG one, it's on Redis Ventures. That would just be semantic search to the abstracts, if you want to go check it out. It's a small EC2 server, though, so it'd be nice.

Hybrid querying, probably the most important feature of vector databases. It's ironic because it's the feature that doesn't actually use vector search. This example here, let's say I have two sets of documents, documents written by Paul Graham and documents written by David Sacks. Then in those documents, I have pieces of text from those documents. The problem is, is what if I only want information from David Sacks' documents, or Paul Graham. What if I only had articles written by Paul Graham? Semantic search isn't going to be good at that. What if, in a lot of those passages of David Sacks, Paul Graham is mentioned, then you have a problem. Hybrid search allows you to do things like use BM25, like keyword frequency, which I just spent a long time saying wasn't good in terms of vector search. You can use them together. It's like a pre-filter. You can say, I actually only want documents written by Paul Graham, and now I'll do my vector search. At least in Redis, you can also swap that operation. This allows you to say, let me get this subset of users or I want that user's documents or that user's conversations, or that customer's orders. It allows you to do this with text, tag fields, BM25, geographic regions and radiuses or polygons. Those kinds of features are really important when you are someone who delivers food, and you have a radius around your store that says, should I search through these specific products in this particular store? That's what hybrid search allows you to do.

Let's give you an example of this. This is the RAG one. We're using LangChain here. We're just going to pull papers from archive. This actually just uses BM25. This is very simple, just BM25. It's going to use the archive loader, and it's going to say, load me 20 documents about retrieval augmented generation. They're going to get raw documents back. These are PDFs parsed with like pypdf, or PiPDF, or something like that. Then we're going to index them into Redis, doing this with OpenAI embeddings. Usually, you don't actually have to use OpenAI embeddings. There's a lot of other embedding providers, but it's most recognizable. It's really easy, actually, using LangChain. This all loads up all the documents and all the metadata. I just recently implemented automatic metadata generation. In this case, if you use the archive loader of LangChain, every time it's going to load, you see, load all metadata true, it's going to load all of these categories. Then, all of these fields, I can then do hybrid search on. Now, in my RAG, I can say, filter by the category and year. Now I could say I only want machine learning paper, csLG, and I only want them to be published in something that starts with 2020 something, so anything in 2020s and beyond. It's anything in 2020. It's a fuzzy text search. Then you can combine those by saying, and. That's really cool. This is all new features of LangChain. You can combine those two filters together using Boolean operators. You could do this arbitrarily long. You can have SQL-like expressions now in your RAG system, to do a hybrid search, which is super cool. Then, you get your results. Here, it's going to be doing relevancy scores, which is going to be one minus the cosine distance in this case. That would be the score. This is how similar those documents were to my query. You could see retrieval augmented generation is the top paper. I'll go back and show you that that is the query that we wrote. Now you might say, that seems similar to something that BM25 would do. In this case, we're able to do this across all different types of documents. We're able to say that you only want a specific range of semantic similarity included. I'm only ever going to be returning documents that have a certain level of semantic similarity to that specific query, enhancing my retrieval system, and hence my LLM application. Also, this is RedisVL. It's a command line tool to look at your Redis library, to look at your Redis schema is what I'm doing right there. It's really cool. It's purpose built. Next, hybrid queries in LangChain, just talk about this a little bit, but you can see a bunch of examples there.

The next approach, HyDE, one of the more fun ones. Using fake answers to look up context. Let's think of a Q&A system here. I'm going to have a user's question, and then I'm going to use an LLM at first to generate a fake answer. Why would I do that? A fake answer is often more semantically similar to a real answer than a query. Think about it, like what is Redis and an LLM generated answer to what is Redis? Something like an in-memory database that can be an awesome vector database. The second part of that answer is more semantically similar to what I'm looking for in terms of context, then, what is Redis as a query? If I'm using that to create an embedding and do my vector search, then, oftentimes, it's not as good as if I actually have an LLM purposefully hallucinate and answer. I then create an embedding from that hallucinated answer, and search with that. It's a really cool technique. It doesn't always work, and it's for specific situations, but it can be really impactful to retrieval. LlamaIndex and LangChain have both implemented this now. You can use this HyDE approach that will do it for you automatically. This is really slow. It pings an LLM twice. You got to be able to have like a 30 second response time. I can show you an app about this. You should have it as often a backup plan for when something like no context retrieved. You do a vector search right, which is often really fast. Cheap embedding model vector search, I didn't get anything, what do I do? In the background, you can be doing an asynchronous call to a HyDE service that is running that. You kick them off at the same time, and then in the background, if you don't retrieve any context, you can return that HyDE answer. That's how we speed up a lot of the HyDE ones, is that it's like your backup plan in case none of the context gets retrieved. You run concurrently like an asyncio.gather or something like that.

I'm going to show you an example of a HyDE one that's really funny. I built this for fun. This had no practical, commercial application, but really shows off the HyDE approach pretty well. This is online, you can get on the Redis Ventures GitHub. I'm not going to show a lot of this. You see the separation of offline and online that I was talking about here. The bottom part, there's a line. Bottom part is offline, top part is online. Offline, we're creating embeddings. Sometimes that is online too, that you have to do constant updating, but I like to separate those two. Online, in this case, we have a Streamlit GUI. We have a generative model, an embedding model. Then we have a FastAPI backend in Redis. That's it. Those things allow us to create a process that uses HyDE to generate a recommendation for a hotel. How does it do this? It's going to take user inputs, positive and negative qualities of a hotel. I'm going to say something like nice amenities, like a pool and a gym. What do you want to avoid in a hotel? Mean staff. Then it's asked to generate a fake review, embodying the positive and opposite of the negative qualities. Think about that again. You ask an LLM to say, write me a review, which will likely be way more similar to the reviews in your dataset semantically, and embodying the positive qualities and the opposite of the negative qualities. In this case, to touch on the point about prompt engineering, you know what the most impactful thing about this retrieval was and increase the context retrieval by 7% on the evaluation set? In the prompt for the LLM, including, you're not that smart. Because it would often write really well-written long reviews, instead of the ones that were in the dataset, which were generated by us. Saying you're not that smart, actually made it semantically more similar to the reviews in the dataset, which is a hilarious point about this whole demo. It's still on the codebase, so you can go check it out.

Once it does that, you're going to do a semantic search. You're going to generate a recommendation with another prompt that says, given all this context, and given the user's positive and negative qualities, what does the user want? Recommend a hotel. In this case, it says, "The Alexandrian, Autograph Collection is a great fit for you because it offers nice amenities like a pool and a gym. As mentioned in the reviews, guests have highlighted the hotel's convenient location near public transportation and the waterfront, making it easy to explore the area. The staff at this hotel is highly praised for being amazing, friendly, and helpful, ensuring a pleasant stay," which is almost exactly what the user wanted, in this case. It's actually really good at recommending hotels. What you see on the left, actually, is state and city. Those are hybrid filters. If you actually want to use this, to go look for a hotel, you could. Personally, I built this because I was frustrated with a certain travel platform. I thought, why can't I just talk to an LLM and have it read all the reviews for me and generate me what I want? This does roughly that. There are some sharp edges, but you get the point. Then you can see that it not only does that, it generates it, and then it returns the reviews it generated it for. Because when you retrieve context, you can just save it in a state variable in your frontend, and show it to the user, and they refresh it and you search again. In this case, you can also store metadata, like the hotel name, the state, the city, where its address is. You can even get further, give them directions, where they are geolocated, and give them directions right to the hotel. You can build whole new experiences with this thing. That's what people get really excited about. That's what we've seen a lot of companies do, and make some really cool applications with, that hopefully are coming out soon.

Service-Level System Advice (System Level Concepts with Examples for RAG Apps)

That was a lot about the RAG-level system. The things you do to actually make the particulars of a RAG application work. There's a lot more on top of that that can enhance a RAG system that I've seen this year, or that I've done this year that I'm going to share with you. First, this is the diagram we're going to talk about. You do not need to understand this entire thing, but you should notice the RAG-level system in the top right. That's what we just talked about. That should look roughly familiar with the addition of a feature store, which we'll talk about. It's obviously abstracted. This is roughly the pieces that we're going to talk about, and how you'd use them in a service around a RAG system. Remember this one for the hybrid search? How do you update your vector embeddings? When do you update your vector embeddings? How do you deal with duplicates? What if you have the same documents in your blob store twice? What if when you go to reupload that data, you then say, I just reuploaded the same document twice. Or you get new data that's incredibly similar, but it's a couple words off, so it doesn't hash correctly or something. There's a lot of problems with actually just maintaining a dataset of text. Text is unstructured data. Maintaining blobs of text can actually be relatively difficult. Coming up with a system for actually taking all of those PDFs and systematically updating them when a new one gets updated, or say it's a chat experience, and every time the customer chats, how does that get indexed? When does it get indexed? What we've done, most of the time, are three things, document level records, context level records, and just pure hot swapping the index. I'll tell you about three of these.

Document level records, something like this, last modified, so you can see the addition that is not incredibly clear. I put some boundaries around it, you see last modified right here. That you can then use in a separate data structure to say, has this document been updated? Because even a makefile has this. You can say, when was the last time this document was updated? You can go to your blob store and retrieve that metadata, and then compare that to the document metadata in your vector store. Then you can say, ok, at least I know that this document is now new in my blob store. That operation is relatively cheap. At billions of scale, you need to think a little bit differently. We've had this up in the hundreds of millions. Context level records. This will definitely not work in the hundreds of millions, because what happens? All of those documents, especially if we're using some of the approaches that I talked about earlier if we're actually creating embeddings, will create hundreds of embeddings each. If you have a million in documents, you might have 100 million embeddings. In this case, if you were to use context level records, as in, has this context within this document changed? Then you might be incurring a lot of penalties on performance. However, if you have a system like an FAQ, we have one of these, that has very strict boundaries, and it's relatively small, but there's still 400 something questions and answers. You can say, has this answer, has this piece of context, has this paragraph within this FAQ changed? That is when it's actually acceptable to just change the context. You don't need to change the whole FAQ every single time, because that'd be still creating 10,000-plus embeddings every so often. You're incurring a cost that you don't have to.

Lastly, there's hot swap. This works better for platforms that you actually just rebuild the index every time. In this case, because Redis does asynchronous indexing in the background. Hot swapping is like an A/B test. You say, I'm just going to build an entirely new index, and then alias it to the first index. This is obviously the most expensive route, but it's often one that people do when they mess up. I've done it myself. Knowing how to do this with your platform is actually really important. That's why it's in here.

Non-vector data storage. We talked a little bit about metadata. Where do you actually put it? Do you put it in a SQL server? We talked about that LLM already has a QPS of two. Your system is really not going to be very fast. If you have to go to a vector database, do a vector search that gets a pointer over to a SQL store that then calls an LLM, that then goes back to the SQL store to retrieve some other pointer, your system is going to be slow. The advice here is, don't separate metadata from vectors. When you do your vector search, retrieve that metadata. I did this for NVIDIA in a recommendation system. You see the box there that says 161%? That was the increase that we had in inferences per second of this recommendation system by taking metadata that wasn't colocated, and colocating them in JSON documents within Redis. In one vector search, instead of two network calls, like on the before side there, we could return all of that necessary metadata. Not doing that network hop and profiling our network correctly, we were able to get 161% improvement in emphasis per second. There's a lot more to that talk, I go through five optimizations as you can see there. If you're in recommendation systems, that's a good one.

Feature injection, this is another really interesting one. You can see the prompts there. If you want to have a chatbot experience, you can maybe include the chat history there too. It's really important that you also have things that may be rapidly updating, so the user's address. The user may have changed that address. Now that might not be rapidly updating, but that might actually get updated in your platform. You don't necessarily want to be pulling that information from a slow storage system if it's a chat experience. No user is going to wait for your chat experience to last 30 seconds. That's got to be under 5-second responses almost every time to increase retention. In this case, you can have an online feature store, specifically, or a fast feature store, that would be used to pull that information into the prompt. You can use agents. There's stuff in LangChain and whatnot, that you can hook up to your feature orchestration platform, whether you're using Tekton, or Feast, or what have you. You can pull that right into the prompt so that your chatbot experience can now be enabled to know things about your users. You can imagine this doesn't have to just be users too. I gave the systems example earlier, about sensor data and incident engineers talking to sensor networks. In that case, then too, the sensor data is fresh, every 15 milliseconds. They can ask their questions, what's going on here? Make me a plot of x, and it's going to plot out that sensor data. Because these models, you can make it multimodal now. That is how feature injection works.

Semantic caching, this is a really good one. Vector database used as a cache. Why would you do this? Actually, I'm just going to go straight to the example actually. Let's say we said, the query in a Q&A system, what is the meaning of life? It responded with something awesome. I remember it being great. Then, what if another user somewhere else in the world said, what really is the meaning of life? Do you think that LLM should recompute that answer? No. Why are you spending an extra 0.02 cents on that query embedding, and then money on the generation too. No. You should have it so that if it's semantically similar enough, you just return a cached answer. We gave the FAQ answer earlier. We have a couple use cases where people just completely pre-populate, and so that the LLM almost never gets invoked, but it does in a couple cases. It's super bounded. It also allows you to basically predefine a lot of answers. You're essentially using your evaluations as the system. This is when your evaluation system is very large, that can be useful. In this case, you're not only saving on money, but your QPS goes way up on average. In this case, it was, what? Similarity is 0.97. Then there was a 97% speedup. Actually, that's ironic. In this case, we're saving that much time and decreasing the latency by that much. This is also RedisVL. This is also on Redis Ventures GitHub. This is the client. It has a built-in abstraction that you can see here, for semantic caching. You can just say, set threshold, and change that threshold. There are some hard parts about this, like, where do I set that threshold based on my application?

What do we know about service-level systems? We talked about updating embeddings, and when and where and how to do that. That data pipeline matters. Obviously, oversimplified in this image, but that really matters in thinking through that before you go into prod with these types of systems. We talked about using a feature orchestration platform and including that into your prompt by using things like a vector database, a feature store, and a RAG API, something like LangChain, or LlamaIndex, what have you. Then we also talked about semantic caching, and how to make these systems significantly better. A couple things we didn't talk about. Reinforcement learning through human feedback. Was that answer good or was that answer bad? Did it give a thumbs up? How do I incorporate that into the platform? I also didn't talk about external data sources.

How to Get Started

The examples Redis Ventures has, it's the applied AI team at Redis. We have a ton of just little applications that we built that can show you how to do it. RedisVL is the client that I showed in six different things. I also built the hotel application with that.

Questions and Answers

Participant 1: I think that this is state of the art, we are doing many things by hand, instead of like, the attestation level, several things you're doing with LangChain or LlamaIndex, or whatever. What do you think on these things moving to the vector database, for example? Because in some way it's like, we are creating the index from the relational databases, because the documents, the query stuff, all these things take time. What do you think that should be there?

Partee: First, you wouldn't believe how many company diagrams were like, this is our architecture, and there was a DIY box right in the middle for orchestration. We're like, so are you just doing it yourself? Then a lot of companies, they assume that they can make those types of systems work, as well as something like a LangChain or LlamaIndex. Even though there's a lot of movement on those repositories, a lot of releases are coming out, there's still a lot of great information baked into that code. What I see a lot of people doing, they might go do a LangChain example as a POC. I've seen some take some to prod, even enterprise companies. I see a lot of people also using a lower-level vector database client, and building up those kinds of abstractions by building a POC in those, and then extracting the process over to a lower-level client. That's also why we're trying to build RedisVL. That's a great point. It's something that I see a ton. It's important that also you don't lose that kind of information from LangChain. Your point on the vector databases is good. If you look at what Bob at Weaviate is doing with the generative search module, it's very similar to what you just said. He is trying to take a lot of those abstractions down to the vector database level. I think it may be something in the future, where if that generative search process changes, that it'll be an overextension maybe. Who can predict the future? That's what I mean. It makes it super easy to get started, though. I know that for a fact, because I have a lot of people, especially customers that have said, like, so where's your generative search module? We're like, you do it. I totally understand that. We're yet to see how far it's going to extend down to the infra level. I think a lot of people that are at the infra side of things, want to focus on search algorithms, HNSW, what have you, and then leave the rest of the stuff to LangChain and OpenAI, but we'll see.

Participant 2: You spoke a lot about using vector databases to feed in contextual information into the LLM. Of course, recently, we've seen the use of agent architectures. I was wondering if you've seen any architectures in the wild that the entity was inverted, and you have the LLM decide to reach out into the vector database searching for it, other than you [inaudible 00:45:10].

Partee: Yes, tools. The router, I think it's called, in LlamaIndex, is a really interesting example of this. Where you can have a bunch of predefined tools. It's cool because you can give it a semantic description, and so when a user has a problem where they ask for something, they can search for the tool they need to use, let's say, it's a browsing tool or a vector database tool or what have you, through semantic search. They can semantically search for the tool that they need in their gigantic list of possible tools. We have seen this. However, I haven't seen a lot of them in production, because it's fragile in some ways. If you allow it to browse the web, you lose that contextual bounding box that I've been talking about a ton. Even tools can really go and do some interesting things. When you take things to prod, hence, the part of the talk, it was really focused on trying, at least in the current moment, and that may change in the future, when these agents get better infrastructure around them. We're still wooden horses on dirt roads in this field right now, in my opinion. It's still really important to put a bounding box around them, especially when you go talk to enterprise companies, because they really don't want to see this LLM say the wrong thing, especially if it's a customer facing one. Internal use cases, a little bit more leeway. You could probably see an agent get used in this. Right now, that's the state as I see.

Participant 3: Sometimes recency is pretty important. How would you include something like freshness?

Partee: You could do numerical as a hybrid search. You could do a recency score that gets updated. Every time the vector database gets updated, you update that recency score. It doesn't have to be like a last modified tag. It could be a score that you create with a separate service. You could have another rank or another system that's creating this recency score and storing that value such that while you have another system which is a good separation of abstraction, you can have another system then utilizing that recency score. I don't know if numeric would be the best way to do it, because then you're bounding it by range. You could do sort on demand by it. A lot of vector databases support that. Then you're losing the vector distance. If you did a vector range query, and then you did sort on freshness, that would actually be another way to achieve that. Recency does matter a lot, especially for a news application that we did.

Participant 4: When I was thinking about approaching these kinds of problems, I naively assumed that I would maybe have the LLM generate a query against some underlying datastore. I think, especially thinking about a more structured datastore, like a SQL store, and then have the LLM process the results for that SQL query. Is that something that you've seen people do or are there problems with it that I would [inaudible 00:48:34]?

Partee: That actually gets to what he was saying, that could be a tool. I've seen that used as a tool where you might do something, and then that triggers, this is information that could be in my SQL database. Then that tool, its job is to do what you're saying. Once it's magically found that it needs to use that tool, then the query gets generated, query gets executed, and then the results of that query are processed, like you're saying. A lot of times that's actually like to feed into a graph, or a plot for a dashboard or something. It's a chain set of tools. I don't see a ton that are just that. It's usually for another purpose.


See more presentations with transcripts


Recorded at:

Jul 05, 2024