Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Retrieval-Augmented Generation (RAG) Patterns and Best Practices

Retrieval-Augmented Generation (RAG) Patterns and Best Practices



Jay Alammar discusses the common schematics of RAG systems and tips on how to improve them.


Jay Alammar is Director and Engineering Fellow at Cohere, a leading provider of large language models for text generation, search, and retrieval-augmented generation for the enterprise. He is co-author of Hands-On Large Language Models.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Alammar: It is a delight to be sharing some ideas about what I find to be one of the most fascinating topics in maybe all of technology. What a time for what we call AI or language AI. It's been a roller-coaster ride if you've been paying attention the last few years. This is a pattern that you can see, just in computing in general, every 10, 15 years, you have this new paradigm that opens up new ways of doing things. The rise of the personal computers, the creation of the web browser, the smartphones. Every one of these just creates new ways where people can communicate with computers and ways they can use them. That also changes definitions of jobs, entire industries as well. Now we stand at this era or time where this generative AI or AI in general, or possibly generative chat are this new mode that is probably going to be as big as these previous technological shifts. I even sometimes like to take a further step back and say, if you look at the human species, in general, jumps in language technology can be milestones for the species as a whole. You can point at the invention of writing, which is just this jump in human language technology. That's technically what we call the beginning of history is this one jump that we use language in a specific way, and then that's what we call history. Then, about 400, 500 years ago, the printing press, this device made it possible for so many people to have books. Before this invention, if you wanted a copy of a book, you needed to go and copy it yourself or pay somebody to copy it by hand. Once this way of automating books came around, you didn't need to be rich anymore to be able to own a book. Then that opens the door for the next technological revolution, the renaissance. Pretty soon, we're just launching Teslas into space within the span of a couple of years. I have talks from two years ago, where I say, this jump isn't going to be as big as these two. We don't really know, there's a lot of hype. I would like to cut through the hype. Two years ago, we had a less than maybe 20% chance, it's possible that this is going to be as big. Now I put it at less than 40% chance. It's definitely going to change so many different things, but we'll see if it lives up to this too.


My name is Jay. I blogged for the last 9 or 10 years about machine learning, and language processing, and large language models. The most read article is The Illustrated Transformer that explains this neural network architecture that launched to this explosion in AI capability over the last few years. It has about 3 million page views or so. I'm the author of this upcoming O'Reilly book called "Hands-On Large Language Models." Already you can find the first five or six chapters on the O'Reilly platform. I work at Cohere. Cohere is a large language model company, it's one of the earliest large language model companies. I've been training these models for about four years. It was started by one of the co-authors of the transformer paper, Aidan Gomez. I joined because I was absolutely fascinated about this technology. I wanted to see how this cool machine learning trick would make its way into industry. How will it become products? How will it become features? That's what we're going to be sharing here, some of the things that I learned from what we see in Cohere and how people adopt these products. You're definitely in the right place. This is the right time to start to think about how to build the intelligent systems of the future using technologies like these.

Useful Perspectives of Language AI

When I used to go and give talks two years ago, I had to introduce to people what a large language model is. That was not a common piece of software knowledge. Now, everywhere I go, developers or non-developers, everybody has interacted or had some sense of what is a chat model, what is a ChatGPT, what is a large language model that you can talk to. How we talk about them is a little bit different. Now what we try to do is to try to raise the awareness that, ok, you've interacted with a generative chat model, a large language model, we suggest that you don't think of it as a black box, we suggest that you don't think of it as a digital brain, regardless of how coherent and how well written it might be. The useful perspective we advise everybody, and developers specifically, is to think about it in terms of what can you build with this. What kinds of capabilities do these give you, or adds to your toolbox as a developer? How can you think about it more than just this box that you send some text in, and you get some text out. That's what we're going to be talking about. When we talk about the language side of it, you can talk about two major families of capabilities that these models bring to you. These can be language understanding capabilities. That's the first group of capabilities that language models provide to you. It can also be language generation capabilities. That's the other group. Even if you're not dealing specifically with text generation, who here has Midjourney, or Sora for image generation, or Stable Diffusion for video generation. All of these models, even though they are generating things other than text, they have a language model embedded in them in the beginning of that flow, because you're giving them a prompt, and that understanding of a prompt is the use of a language model for understanding, not necessarily for generation.

For your toolbox, I want to break this down further. Generative chat is only one of multiple things that you can do with generation. Text generation can be summarization, so taking a large article and providing a two or three sentence summary, is a generative task. Copywriting, if you tell the model to write me an email that says this, or write new variations of this type of text, that's another generation text. These are in addition to generative chat. One of the main messages here is, don't be limited to chat only. Everything you can do with generation can go into a lot of different generation tasks. A lot of them should probably be offline. You shouldn't always lock yourself into, I'm going to be always thinking about online chat, right this very minute. Then, even when you think about these other non-chat generative capabilities, you're only thinking about one of the things that AI can do. In fact, some of the most dramatic and most reliable use cases that AI has right now that has even progressed the last few years, are things like search, or text classification, or categorizations. Large language models, or language models, in general, small and large, are some of the best tools that we have for that. Think beyond just generating text. I think it's a little bit unfortunate that this wave is being called generative AI, because some of the most robust applications are really more on the understanding, the representation, the search systems. We will talk about these and where their capabilities lie. This idea of search by meaning or using a language model for search is one of the most reliable applications of large language models. If you've tried to build with large language models, reliability becomes something that you care about a little bit because there's so many cherry-picked examples that you've come across in social media. If you try to build that into a product, you realize that that works 6 out of 10 times, or 3 out of 10 times depending on that use case. Reliability is important when you think about robustness.

Semantic Search

When you think about search, don't think only about building the next Google Search. Search is this major feature of every application that you interact with, every company needs to search their own internal documents. A lot of search systems are broken. They need to be improved. You can almost never find what you want. How language models can help here is with this idea of semantic search. Let's take a query, for example. You want to search for Brazil to USA travel. Before language models, most search was based on keyword relevance. It will break it down and compare the keywords in the query to keywords in the document. If a system is not very well tuned, it might give you this article, which has all the same keywords, but the order of the two countries is flipped, and so that document really does not become a relevant document for that search. If you use language models for that search, they have a better ability to catch the intention and the context that you're using these words for. That's the idea of using language models for search. Two main concepts to think about for semantic search, for using language models for search, one is called dense retrieval. Who has built anything with embedding search, like a vector search database or Pinecone? That is what's called dense retrieval. Then the other one is called reranking.

Let's take a quick look at how those are. You have a query, you send it to a search system, in this case, a vector database, or it can be a database with vector search capabilities. There are features of Postgres, for example, that can do something like that. That vector database has access to a text archive, and then produces a number of search results. This is the search formula basically. Behind the scenes, how dense retrieval works is that it takes the query, it sends it to a language model that provides something called an embedding. Language models, you can have them generate text, but you can also have them generate another thing, which is a list of numbers that capture the meaning of that text. That's what's called an embedding vector. Here you have the query, you send it to the model, the model gives you maybe 300 numbers, like a vector size, that is a numeric representation that is useful for downstream software applications. Then you simply send that vector to this vector database, and it finds the nearest neighbors to that point, and those will tend to be the best search results for that query. This model has to be tuned for search specifically. You can't just get any embedding models. An embedding model used for search has to be tuned in a specific way for search. There's a step before that of chunking the text and embedding the archive beforehand, so that is step 0. If you'd like to play around with embeddings, we've open sourced, we've released the embeddings of all of Wikipedia. These are 250 million embedding vectors of Wikipedia in 300 languages. It's the English but also every other Wikipedia out there. It's on Hugging Face, you can download it and play around with it. You can build incredible things with it. That is dense retrieval.

The other idea is called reranking. Reranking is really the fastest way to inject the intelligence of a language model into a existing search system. It works, basically, in this two-step pipeline. You don't have to remove your existing search system. If you're using Elastic, or any other search system, that can be your first step of your search pipeline. That is the system that queries or searches your million or a billion documents, and gets you the top 100 results. Then those 100 results, we pass to a language model, and it just changes the order, it outputs the same 100. It says, this article that you told me is number 33, is actually the most relevant to this query, so I'm going to make it number one. This is the only one call at the end of that search pipeline that tends to dramatically improve the search system regardless, whether you're using Elastic, whether you're using embedding search, whether you're using hybrid search of the two, just reranking just gives you that uplift very quickly. This is a language model that works in what's called a cross-encoder fashion. To the language model, we present the query and the document, and it just outputs a score of relevance. Even if you're familiar with embeddings, this works better than embeddings, because the language model at the time of scoring has access to all of the text of the query and of the document. It's able to give a better result, because of that information that it has, while embeddings actually work on these very compressed spaces of the two vectors that can't encode all the information in the text. That's why reranking tends to work better than embeddings but it happens to be a second stage because it can't operate on a million documents. You need a funnel to choose the top 10 or 100 results to get to. Here you get to see the uplift that you can get. These are three different datasets. The light one is keyword search, so that's Elastic or BM25. The one next to it, this is search by embeddings. This is Elastic plus reranking. Whether you're using keyword search or embedding search, a reranker is one of the fastest ways to improve the search system that you're building with. The y axis here is accuracy. Who is familiar enough with what that means? What is search accuracy? That is very good, because I have a couple of slides on what that means. Because coming into this, I saw a lot of people wave hands about search accuracy, but I wanted a clear definition. We'll get to that. These are two kinds of language models. These are not generative language models, but these are ways of using language models for search. Embeddings and reranking are these tools.

Search is Vital for the Future of Generation Models

Even if you're interested in generation specifically, search is one of the most important things for the future of generation models. This is an example that showcases why that is. Let's say you have two questions. You have a question where you send a question to a language model. Then you have another scenario where you send it the same question, but you actually give it some context, you maybe give it the answer, before you give it the question. Models would tend to answer number two better than number one, regardless of the question. If you give the model the answer plus the question, it will probably have a better answer for you. This is one of the first things that people first realized when these models came out. They called them search killers. People were asking these models questions and relying on how factual their information are. A lot of people got into trouble for trusting how coherent those models are, and ascribing authority to their information retrieval. Then that turned out to be a mistake. Everybody now knows about this problem called hallucinations. Models will tend to give you an answer, maybe they might be a little bit too confident about it. You should not rely on a language model, specifically, only a language model for factual retrieval, you need better tools. You need to augment it somehow. That is the method here. When you're relying on the model for information, you're relying on information that is stored within the parameters of the model, and so you're always going to be using larger models. The information is always going to be outdated, because, how can you inject new information to the model? You have to train it for another nine months. While in this other paradigm, you actually augment the model with an external data source where you can inject the information at question time, whenever somebody asks you, the model is able to retrieve the information that is needed, and then that informs it. You get so many different benefits from this. You get much smaller models, because you're not storing ridiculous amounts of information in this inefficient storage mechanism and loading them to GPU memory all the time. You can have up to date information, you can change it by the hour. You can give different users the ability to access different data sources. Even it gives you more freedom of what systems you can build with a system like this, but it also gives you explainable sources. Which documents did the model look at before it answered this question? That gives you a little bit more transparency into the behavior of the model. Then, that can also help you debug, did the problem come from the search step or from the generation step?

Advanced Retrieval-Augmented Generation (Query Rewriting, Tool Use, and Citations)

This is what retrieval-augmented generation is. It is this intersection of search and generation. It's by far the highest in-demand use case that we see in industry and enterprises that we talk with. One way to look at it is like this command is the LLM we use at Cohere. You're tying your LLM with a data source that is able to access and retrieve information from. This is one way of thinking. The conversation is not just with the model itself, but it's with a model that is connected, that is grounded in a data source that is able to access that information. Now you have these two steps. Before the question is answered, it goes through a retrieval or a search step and then you go through a generation step. These are the two basic steps of the retrieval pipeline. Then another way of seeing it more clearly is you do a search step and then you get the top 3 documents, for example, or 10 documents. You put those in the prompt with the question. You present that to the language model. Then that's how it's answered. This is also the most basic formulas of RAG or retrieval-augmented generation. Then you get even better results if you have a reranker in the step, because if you're only giving 10 documents and the 11th document is really the one that has the right information, but it's beyond that cutoff line, the model is not going to be able to generate the right answer. The area where so many failures in RAG happen tend to be more in the retrieval side, prior to the generation step. Now you have these two language models in the retrieval step, in addition to the generation model in the generation step. That's one of the first ways that you can build a RAG system.

One challenge is, after you build a system like this, and because you have in mind that people will ask direct questions that can be answered with one piece of information. You roll this into production. Then you find people really ask questions like this. They say, "We have an essay due tomorrow. We have to write about some animal. I love penguins. I could write about them. I could also write about dolphins. Are they animals? Maybe. Let's do dolphins. Where do they live for example?" As a software engineer, is it fair for you to take this and throw it at Elastic and say, deal with this. It really is not. This is one of the first ways to improve initial RAG systems, which is to go through this step called query rewriting. This is an example of how that works. When a language model gets a question like this, you can write a query using a generative language model that says, to answer this question, this is the query that I need to search for. To answer this, it says, where do dolphins live? It's just extracted that piece of information, and then that is what is relevant for that component, which is the search for the system. Now you have query rewriting, as this step with a generation model. You can do it with a language model with a prompt, but in the API that we built, we have a specific parameter for it that we optimize, but you can use it in different ways. This is how it works. It outputs the search query, and then you can throw it at your search system.

Now we have a text generation LLM in your retrieval step, doing that initial query rewriting step. We'll take it a few steps further. We're building up to the most advanced uses of LLMs, step by step. What about a question like this, if somebody says, compare the financial results of NVIDIA in 2020 versus 2023. You might get this query, and you might be able to find a document that already has the information about NVIDIA's result in 2020 and 2023. There's a very good chance that you will not find one single document that does this. One way to improve that is to say, actually, to answer this, I need two queries. I need to search for NVIDIA results 2020, and then 2023. These are two separate results. You get two separate sets of results and documents back. Then you synthesize them and present them to the model. This is what's called multi-query RAG. Another way of looking at it is like this. You have the question, you send it to the language model, and then the language model says, I'm going to hit this data source. Where this data source can be whatever you want. It can be the web. It can be your Notion. It can be Slack. It can be whatever is relevant for you. Then the information is retrieved. Then you have this answering phase, what's called a grounded generation. This is where you present the model with the documents in the context and it's able to answer.

Another optimization for search is what if you build this search system and somebody says, hi, or hello? Do you need to really search your database for the model to answer hi, or hello, or something like that? You really don't. When you're doing this query rewriting, you can build that awareness in the language model to say, I need to search for this, or I need to search for these two queries, or I do not need to make a search to answer this question. Here you're starting to see the model become a switch statement or an if else where the model is starting to be controlling the flow a little bit of how to deal with this. It has a couple of options, so it can either search or it can generate directly without search. As you get more greedy and comfortable with these new primitives of using language models, you start to say, if I can search one source, why can I not search multiple sources? Why can't I give the model the ability to search my Notion, but also my ERP or my CRM, and later models start to build in these capabilities, where the model can route. It's not necessarily tied to one data source.

This should give you a sense of, this is a new way of doing software. It's really different when you think about these things. It can get a little bit more complex. Once you have a question like this, where you can say, who are the largest car manufacturers in 2023? Do they each make EVs or not? There's probably not one document on the web, or anywhere that can answer this question. One thing the LLM can do is to say, first, I will search the web for the largest car manufacturers in 2020. I need one piece of information first. Then after it gets that piece of information, it says, now I need to search, Toyota electric vehicles, Volkswagen electric vehicles, Hyundai electric vehicles. The model is now continuing to control the flow of asking follow-up questions and making follow-up searches by itself before outputting the final result. This is what's called multi-hop RAG. Where the model can make multiple jumps, and it's determining, is this enough information or should I go and do something else to produce this final answer? Let's take one last mental jump, to say, when you search the web or when you search Notion, you are invoking a function like this. You're saying, I needed to search Notion, so call, search Notion, this function and give it these parameters. If you have models that are capable enough in doing something like this reliably and accurately enough, what is to prevent you from saying, not only do you search or retrieve information from Notion, how about you post this to Notion? The models are now able to just call APIs to actually do things, to inject information, to interact with the other systems and make these sorts of interactions.

Now here you're starting to see this new thing emerge. Now we're beyond what LLMs are. This is a new thing. This is the LLM-backed agent. These are some of the most futuristic things that you can see or think about in LLMs. It's never too early, they will continue to get better. Then, now you're having LLMs as the heart of a piece of software that is able to read and write from multiple sources, that is able to use tools. It can use the Python CLI. It can use calculators. It can use email. It can use other things. Now you have this piece of software that is a little bit more intelligent. This can be connected to different sources. You're starting to see more of what it's able to do. It's an important piece of a place to invest in where things will go in the future. A lot of the agents that you see right now are maybe toys or just they show you the potential. This will take a few months or years to come through, but you can already see it showing its potential in solving problems. It's just this extension of, if you made the jump from LLMs to RAG, and then from RAG to multi-step and multi-hop RAG, tool use is the next abstraction because you're using the same thing. Instead of the search engine being your tool, it's another system or another API or another piece of software.

Then, we also always advocate for citations. If you can have a system that provides the citations for which spans in the text refer to which documents. We need to give our users the ability to verify the models up, to not completely trust it. Citations as a feature is highly recommended in building these systems. On the Cohere side, we've built this series of models, Command R and Command R+. We've released the weights. You can download them, you can run them on your laptop. They're super optimized for RAG and for all of these use cases that we talked about. They're on Hugging Face. If you have the GPUs, you can probably run them on CPUs as well, but you need a little bit of memory. You can download the quantized models. If you search Cohere on Hugging Face, you can download these models and start playing with all of these ideas of RAG, multi-step, multi-hop, and the final one, tool use. Developers love it. Now you have a shape like this where you have retrieval that has three language models in it: there's an embedding language model, there's a reranking language model, and there is a generation language model. You know why that generation model is in the search step. Then you have this generation step that does this grounded generation, hopefully, with citations that will be used.


Last note on evaluation. We talked about accuracy. This is from one of the chapters of the upcoming book. Let's talk about this accuracy metric of search. You have one query. You have two systems that you want to compare the quality of. From each one, you will get three results. Let's say you have this query, and then search system 1 gives you these results where number one is relevant, two is not relevant, and three is irrelevant, so two out of three. Search system number 2 gives you two non-relevant results and one relevant one. Who says that search system 1 is better? Who says that system 2 is better? Search system 1 got two out of three correct, so that's maybe a two out of three accuracy for this query, while one got one out of three accuracy. Search system 1 is actually better. That's one metric of evaluating search, which is accuracy. Another one is like this, what about both of them got you one, but one out of the three as relevant. The first system placed it right at the top, it said, this is number one. The second system said, this is number three. Who says system 1 is better? Who says system 2 is better? System 1 is better, because the relevant response is higher rated. There's another span of metrics. What this assumes is that you have this, that you have a test suite for your data. If you're thinking about building RAG systems, or evaluating search systems for your use case, it's good for you to develop something like this internally, which is, you have your documents. Let's say you have a set of queries that are relevant for them. Then you have these relevance judgments of, is this document relevant for this query or not, for all of them? This is how you would tend to evaluate at least the retrieval step. For end-to-end RAG, there are other methods. This is one way. You can use language models to actually generate the queries for that. I hope by now you can think about them as just these multi problem-solving tools. This builds up to this metric called mean average precision, which is one of the metrics that you can use for search that thinks about the scores and the ranking.


If I want you to take away two things out of this, one, large language models are not this black box of text in, text out. They are a collection of tools in your grasp. We talked about so many of these ways to make them useful for embedding, for search, for classification, for retrieval, for generation, for query rewriting. Think of them as just this new class of software that you're able to build with. Then this idea of this new form that they're taking, this new also class of software, which is the LLM-backed agent that is able to communicate with external data sources and tools, and use them a little bit more successfully and in successive steps. We have a guide on how to build something that plugs all of these tools together. You can just search RAG chatbot on the blog, We have a resource called LLM.University, where we have about seven or eight modules about lessons, very highly accessible. Very visual, with videos in a lot of them. It's completely free.

Best Practices and Pitfalls of Building Enterprise Grade RAG Systems

Luu: Given your experience in all these areas and working on RAGs and such, it's not easy to build RAG based systems at an enterprise grade level. What have you seen as some of the best practices and pitfalls that companies or folks have run into?

Alammar: The earlier you invest, the better, because there is definitely a lot of learning that goes into building these systems. Having the people who experiment with these internally and realize very quickly or very early their failure points, is important to build continually. A few concrete things are, yes, embeddings are great. It's probably even better to have a hybrid search system where you're doing both keyword and embeddings at the same time. You shouldn't just rely on one of these methods. Then with all search systems, you can also inject other signals into that that would be relevant. I really love software testing in, let's say, unit tests. I think we should do a lot more of that in machine learning. If you have some of that as a cache of tests that are relevant for you as behaviors, because even when you're using managed language models, their behavior might change with the next version, and the next version. You need to bring these solid software methodologies of doing. Can you catch a regression in one behavior or another? Building those solid software engineering practices into machine learning is something that people in AI should be doing a lot.

Questions and Answers

Participant 1: The multi-hop RAG, do you think that it's relevant to incorporate that into the first classification call that you talked about, the rewrite plus the skip? Because it sounds very beefy for the model to understand that actually a flow is needed. Would you separate that out? How would you tackle that?

Alammar: In machine learning, most answers are, it depends. If you're doing a highly latency sensitive system, the query rewriting needs to be the fastest that you can be. I think you should just measure how latency sensitive the use case is, and then go on that. If it's highly latency sensitive, you might go with a smaller model for the first step. Then for the multi-step, you call it when you need it. The entire ethos of this is to use the best tool for each thing and not throw everything at a massive model that can do everything, but can do it very expensively, it will need so many GPUs. Yes, always advocating using the smallest type of model that is capable to solve your task.

Participant 2: What's your favorite framework? Because there's quite a few frameworks out there at the moment, and I'm going from one to the other to the other making forks everywhere.

Alammar: I'm always trying different frameworks. I cannot necessarily recommend anyone right now. We're working a little bit with LangChain, because LangChain started soon, and it just makes it convenient for people to get started with it. I also like to do a lot of my work in just Python strings and just call the APIs and build those things. The frameworks might just make it easier for you to download a template, play around with it, see where it breaks. Don't always think that it will take you all the way to production, because it will not always do that. As long as it helped you get your hands dirty and play with. Very convenient to learn with. It's still very volatile. There's so much development very quickly that is happening. I wouldn't say there's major dominant ones. LangChain has made its name, but there continues to be newer frameworks that specialize in the different use cases.

Participant 3: I have a question on definitions. One year ago, when I learned about language models, it was quite mentally easy for me to picture that, they were taught with text messages and were able to answer questions, and so on. Nowadays, there's these agents and all that control logic that you were also talking about, about splitting questions and so on. What's your definition of a language model? Does it include all that control logic, and stuff like that? What do you think?

Alammar: That is a great thing. Because even before large language models came, language model is a very well-defined system in statistics, which is a system that is able to predict the next word successfully. That is the language modeling objective of even how these models are created in the first step. What we found out recently is that if you train a large enough language model and a large enough and clean enough dataset, it captures behaviors that really surprise you. If you've trained it on a dataset that has a lot of factual information, the model will not only pick up language, it will pick up facts. If you train it on a lot of code and give it the right tokenization, it's going to be able to generate code. It becomes these new capabilities based on how you train it and how you clean the data and how you optimize for it. It becomes a general problem-solving piece of software that is able to stretch the imagination and do more things than just generate coherent language or text. The nature of language models is fascinating.

Participant 4: You mentioned pitfalls, especially when people are initially playing with these kinds of systems. What's one that you think you would call out that people could keep eyes open before they play? What's the one that you see as most common to people, pitfall or limitation that it would be nice to know about beforehand?

Alammar: Overly trusting the model. This is a punishment that we get by making models that are so good. Once models became not gibberish creators, people started to give them more authority than they should. That nature of it is a little different. That trust in companies or in people who are new in making applications for it, can say, ok, I can interact with this system on this website. Let me take that same system and make it a chatbot on my public website to talk with the world. Over-trusting in these probabilistic systems without guardrails, without the domain knowledge. Cybersecurity gives you a lot of paranoia, because you need to think about the different ways where your system can be broken. I think we need a lot more of that in deployment of large language models.

Participant 5: These models are probabilistic. They're not deterministic. If that's the case, would you ever put them closer to your systems of record with that diagram you showed of them interacting with the world and more APIs and plugins? Would you ever put them closer to systems of record? How long before you think that's possible, or feasible?

Alammar: Yes, it depends on the use case, really. There are a lot of use cases where they are ready now. If you're doing things that are, for example, recommendations to an expert that can spot the failure modes, or if you have the correct guardrails. It's supposed to output to JSON, can you check, is it valid JSON or not? The more you build these actual practices of that domain, yes, they can solve a lot of problems right now. Then the more high risk the use case is the more care and attention that needs to be there. The more we need to think about humans in the loop, and safe deployment options. It depends really on the use case. We know a lot of tools that increase the probability of success. There are things like majority voting, for example. We have a model up with the response three or five times and seeing, do they agree with each other or not? There are tools of improving that. You as the creator have the responsibility of setting that threshold and having enough of guardrails to make sure that you get the response and the result and the behavior that your system needs.


See more presentations with transcripts


Recorded at:

May 30, 2024