The role of the modern software architect is continually shifting, and so are the underlying frameworks, platforms, and tools that are used as part of the craft. Stay up to date with The InfoQ Architects’ Newsletter, a monthly overview of things you need to know as an architect or aspiring architect. |
Transcript
Penchikala: My name is Srini Penchikala. I serve as the lead editor for the AI/ML and Data Engineering team at the InfoQ. Generative AI has been getting a lot of attention since GPT-3 was announced a couple of years ago. Especially since GPT-4 and ChatGPT came out earlier this year. With 170 million-plus people using ChatGPT in the first couple of months after its release, it has become the fastest adopted technology or framework compared to any other technology before it. All the big players in the space have been announcing their plans to integrate generative AI into their products. We have OpenAI ChatGPT. Google also announced Bard. Meta AI the Facebook Company released LLaMA 1 and LLaMA 2. All of these solutions are based on large language models, which is the main focus of today's discussion. In this live panel discussion, we want to highlight the value and benefits large language model-based applications bring to the table, and what to consider when using LLMs in your own applications. Most importantly, stay away from all the hype out there.
I will ask the panelists to introduce themselves. Please introduce yourself with your name, job title, where you currently work, and your experience with LLM technologies.
Eleti: My name is Atty. I'm an engineer at OpenAI. I'm based in San Francisco. I work on the API Development Team at OpenAI. We work on the API product exposing large language model APIs to developers, through endpoints, developer documentation, dashboards. Before I joined OpenAI about a year ago, I come from an application development background, building backend, frontend web services. I don't have any AI/ML background, historically. I've learned a lot in the last year, especially since ChatGPT's launch.
Teoh: My name is Mathew. I work as a machine learning engineer at LinkedIn. As far as work with large language models, before ChatGPT came out, there's things like the stack transformer models with BERT and all that, tend to run a lot of ranking models, just even across the industry. That as well as now more of this text-in, text-out stuff, across LinkedIn.
Low: I'm Montana. I'm the CEO and Co-founder of PostgresML, which is a machine learning database, and more than a vector database. Historically, I was at Instacart for quite a while, helped build the whole machine learning stack there. Prior to that, I've been an application engineer with a focus on NLP and machine learning applications.
HP: I'm Nischal. I'm the vice president of data science at Scoutbee. We're based out of Berlin in Germany. I've been working with language models since the whole RNN thing came along, where we were building stuff with Keras, back in the day, while BERT models were large, when that evolution happened, but they're not as big as ChatGPT and the likes anymore. I've been in the space for over a decade now. We're using large language models and the generative thing to build our products or to enhance our product offering for a bit now.
Generative AI and LLMs vs. Traditional AI Techniques and ML Models
Penchikala: What is generative AI, and how is it different from traditional AI techniques? What is a large language model, and how are these models different from traditional machine learning models?
Eleti: I'll give my perspective from a product engineering background, and maybe ML experts can give a more ML background as well. The way I view it as a product engineer is that generative AI is a class of machine learning models that are very good at generating high quality output and often are promptable using natural language. Folks who have played with ChatGPT are familiar with asking ChatGPT open-ended questions, "Write a poem, solve this problem for me, explain this concept to me." ChatGPT is surprisingly great at answering the question. Under the hood, the way it's working is it's not computing the full answer and telling you what's happening using some knowledge database. Instead, it's really generating the output one word at a time or one token at a time. With especially large and sophisticated models, the quality of that output, even though you're just generating one token at a time, is satisfactory. Similarly, you extend the analogy to other domains. Image generation models, ask it for an avocado sitting on a chair, slowly generates a very realistic image displaying that. All your generative models do similar things. That's the 10,000-foot view of how I see these, what generative AI is. It's a class of machine learning models that generate output of high quality based on prompting.
I'm seeing LLM, so large language models. Language models are essentially what run ChatGPT, they're machine learning models. As I was describing, given a prompt or a prefix string, they describe the suffix, they generate the suffix one word at a time. Large language models are language models that are just extremely large in their parameter count. They've been trained on large amounts of data, they take up a lot of space, they run on large sets of computers together. These models tend to be more powerful, and certainly more computationally intensive to run as well. Then finally, ChatGPT is, of course, really just a product that wraps a large language model, in OpenAI's case, GPT-3.5 and GPT-4, to answer questions and to try to become everyone's personal assistant, that's easy to access.
Low: I think it's interesting for historical context that deep learning became LLMs, with a few papers. It was only a couple tricks like attention is all you need, if you want to get into the history of it. There's a whole family and not just large language models, which are natural language specific, but image generation models like we saw with DALL·E early on, and Midjourney, and Stable Diffusion now. Then there's also audio models that can generate that. These are different actual constructions, but they use very similar underlying techniques to actually perform all of these various tasks. Even within more classical deep learning and LLMs, there are more task specific models that do things like summarization or classification. With LLMs, they seem to be, as they get larger, generalizing and becoming more generally shapeable, which is, I think the really interesting phenomenon is that it lowers the barrier to entry that you don't need to know about all of these specialized, nuanced models. You can just go straight to the generalist and ask them for help.
Teoh: When we went from ML models that were not deep learning to ML models that were deep learning, there was a certain amount of having to pick features that got abstracted away. The jump to LLMs seems almost analogous, where now you don't really need to know as much about any of the deep learning or even much of the training data. A lot of that stuff gets abstracted away into something that becomes easier to use. To me, this seems almost analogous, similar to the jump between non-deep learning ML to the deep learning that we've benefited from over the last couple of years.
LLM Architectures
Penchikala: Let's get into the LLM architectures themselves. If someone wants to use LLMs in their apps, so can you guys highlight some of the architectures LLMs bring to the table? Definitely, they are very powerful. How can they leverage the power and value of these LLMs without getting bogged down with all the other infrastructure related overhead?
Low: As Atty mentioned, these LLMs can be quite large. We only have speculation about the true size of GPT-3.5 and GPT-4. We do know that there are LLMs like Falcon 180B, which you need several hundred gigabytes of GPU RAM to load. You can get a commodity server in GCP, or AWS, or Azure, that will run you $25,000 a month, to run a model that large. That's out of the reach of many even companies to host those things. I think services like Anthropic's Claude, or OpenAPI's ChatGPT, where they host this thing, and they charge you per token, so that you don't have to run the full hardware yourself, that's really beneficial for some of these larger models. Although we do see a lot of the local LLaMA community on Reddit that try to run these things on their GPUs, or even on their MacBook Pro, and you can get pretty far with that. We see that when you're doing inference rather than training, when you're using these things rather than building them, you can quantize the model, which means you shrink it, sometimes by a factor of 8x or more. Then you might only need 64 gigabytes of RAM in a desktop computer, or a normal off the rack server, and it's much more affordable. Once you get into that realm then you can talk about hosting these things on-premises.
There's different advantages. There's pros and cons. Fully managed services are wonderful from an operational perspective, because you don't have to worry about managing the Python dependencies that are ever changing and whatnot. At the same time, having full control over, not just the LLM, but all the other models involved in an architecture and getting to really choose exactly how you're configuring those things. There's latency improvements, there's cost-benefit improvements, there can be quality improvements, if you try to run all of these things inside of your own data center. I think it really depends on how much you want to invest in your internal team and tooling to manage these things, and how complicated your application is going to be. GPT makes a magnificent prototype proof of concept, just start there. It's super simple. Then the question is, do you have cost or latency constraints that will drive you out of that into a more on-prem solution?
Teoh: I can speak to architecture, maybe not so much the model architecture itself, because it's not quite where my area of expertise is. In terms of architecture of certain LLM applications, stuff that I've seen being put into practice, if you look across the industry. If you look at, how people are using LangChain, or Microsoft's Semantic Kernel, there's two or three stages. You have some user input, let's assume that it's textual, that comes in. Then this first stage that you can do with an LLM, I think Semantic Kernel calls it planning. It takes in your user input, and it knows that it has access to some number of plugins, and each of these plugins has their own APIs. It tries to figure out parameters to any of the plugins that are relevant, that it should call with. Then, typically, once that happens, these plugins will get executed, and the output of those plugins will make it into some kind of synthesizing of a response. Something like this also happens, I believe, within some of these LangChain agents as well. Semantic Kernel and LangChain are two different libraries that are open source that developers can use to build some more complex LLM applications. Yes, generally, these two or three stages where you plan what you want to do based on the query that's an LLM call, you make your calls to your plugins, and then you create your response based off of that. That's a pattern that I'm beginning to see.
I wanted to point out that side of architecture, since it's closer to the applications that people are building, it's something that might be worth considering.
Low: We're going deep on architecture, about LLM apps. Then something relevant to the comment, with 38,000 documents is RAG vector databases. Those become really important when you're dealing with LLMs, because LLMs are not great at facts. They can hallucinate and they can make things up. A big part of LLM apps and architecture is actually generating a prompt, which will give the LLM the right facts. That has the most short-term memory. You use a vector database to store all of these documents so that you can retrieve the most relevant one to the user's prompt. Then you paste them all together in the prompt. Then the LLM, given this more constrained set of facts, can give you a much more reliable answer. I think that's an important topic. We can go really deep on that because you may have your vector database recall a dozen different snippets or facts. Then you might actually use some other model to prune those 12 because LLMs have limited context. You might only want to give it the top two or three most relevant facts. Mathew was mentioning LangChain. You can actually build up very sophisticated pipelines, where model A depends on model B and C and D. You really break down your problem into multiple steps, where GPT may only come into play at the very last final text generation step. I'm sure Mathew has a lot of examples of classical ML, and natural language processing techniques that we use for things like search and recommendation systems. There are all these little intermediate steps that we can solve pretty well with classical ML. It's not until we want a very creative or thoughtful response generation that we go for the full powered LLM.
Vector Databases and LLM-Based Innovations
Penchikala: I know a couple of times you mentioned about vector databases. What is the panel's thought on the use of knowledge graphs in combination with LLMs and vector embeddings? I want to use that question as a context to briefly discuss vector databases and what's happening there? How are they enabling LLM-based innovations?
HP: We've been using knowledge graphs for a while before the LLMs kicked in, as well. The way we see it right now is there's going to be data distributed in different datastores. They all have their own powers that they bring in making the LLM applications, not just coherent conversation-based applications, but also meaningful. As Mathew rightly said, generative AI can hallucinate a lot, because it might not entirely understand the conversation and might not understand the facts that they need to bring to the table. Which essentially pushes developers now to also build guardrails, in terms of what the prompts can do, what the agents can do, and when to pick up facts, and where to go. At that point in time, you can use knowledge graphs as a way to convert your conversation. If you understand the semantics of it, and you've built that ontology into your business domain and graph, then you could use the knowledge graph as a store for picking data up, understanding the conversation, and having an interaction. The same thing with vector databases as well. You have a lot of documents and you want to ask questions to these documents, or you want to understand more. That's where vector databases play a huge role, where the documents that you store have run through an embedding model of sorts. Essentially, you're limiting the blast radius for the generative AI or a large language model to make sense of the conversation, to give you meaningful results. I think in a full-fledged application that's using generative AI or large language models, we'll see all the databases coming to play from relational databases, to knowledge graphs, to vector stores. They all will serve purpose in large chains that will be built with different tools and technologies.
Low: Vector databases, they're new in the zeitgeist, but they're actually pretty old technology. At this point the technique is important. Pretty much every major database has now added vector indexing and operations to its core set. Postgres, Redis, Elasticsearch, as well as all of these specialized vector databases that have come onto the market. In my mind, vector databases are just one more index type, like B-tree indexes that we have from numeric data, or keyword indexes that we have for search data. Thinking of it as just one more tool in the toolkit.
Best Practices to Evaluate LLMs
Penchikala: Based on your experience, what have you seen as a better practice or better process to evaluate LLMs and use them in your apps? How does evaluating LLMs to use in your application, different from using any other machine learning framework?
Teoh: Evaluation is definitely one of the more underrated parts of building your LLM app. I think it's very easy to be hyper focused on what are the right things I should put in my prompt, what are the right things I should be retrieving to put into the prompt before sending it to any of these LLM APIs. Yes, definitely the ability to evaluate and to do so quickly and cheaply so that it points you to where you want to improve next, is key. I think just like with most evaluation, you can split it into two categories. There's quantitative evaluation. Then there's, of course, qualitative evaluation. Quantitative evaluation tends to scale better with more instances that you'd like to evaluate on. Qualitative evaluation can sometimes point you in directions that numbers might not be able to say. As a simple example, if you're trying to build an app that can answer questions over several documents that you have, you might have some notion of what it means for an answer to be good quality. I'm thinking of extremely rough signals that are not always guaranteed, but things that may be important to you as somebody who's evaluating it. How often are your answers referring to entities in these documents?
This is all getting around to the point that some of these quantitative metrics are not perfect, but they can at least point out red flags in your app's ability to produce responses, before making it to the end user. One thing that I've seen people have some success with is collecting these rough set of signals that are maybe imperfect, but give way to telling you what's going on in these responses, and counting the occurrences of these across some known test set. When it comes to qualitative evaluation, if you can send this to a diverse range of evaluators who can give their feedback, then that's the best way to have your bases covered. Generally, for that, I feel like that's very case by case.
Low: I think as far as we get into the quantitative evaluation, there are good standard tech specs, or textbook algorithms to evaluate a lot of these things that are provided as libraries. Something that we've been doing in classical machine learning for a decade now is using the human interaction with your system as the actual evaluation criteria for how good the model is. Mathew can talk about this with search results. The ultimate quality of your search results is based on, do people click on the number one item, or do they have to go down to the 10th item? Every interaction with our machine learning models can be recorded, and then we can systematically tweak and try new things. We can put a new version of the model out there. We can see if this improves the core business metric, because at the end of the day, that's what we care about. That still works with even things like chatbots, where you're optimizing for maybe session time. Maybe if the user asks one question, and then stops, but then comes back 30 minutes later, and asks another question and then stop, and every time they only have to ask one thing, that's probably a very good model, as long as they keep coming back over again. If they just ask one thing, and then they ask 10 more questions, and then they never come back, that's probably a bad model. We can use these more intuitive signals. There's what's called reinforcement learning with human feedback gaining popularity, even for the largest LLMs, to help fine-tune them and make them better in a particular domain or setting that I think is really important, that evaluation actually improves the model over time, not just tell us where it is now.
Teoh: Yes, of course, this can apply to LLMs, apply to your search and retrieval stack or even beyond. The quantitative methods for improving your product, everything that comes with A/B testing, everything that people have done before LLMs existed, all of this stuff is still pretty applicable when running an app that uses LLMs. Depending on how much you're serving this at scale, you may have to do your standard A/B testing and power calculations to see what it takes to get significance. A lot of the traditional methods that exist already are still relevant today.
Prompt Engineering with LLMs
Penchikala: We can move on to the prompt engineering topic. It's definitely one of the important parts of the best usage of LLMs. There are a few different ways, few different techniques, like zero-shot prompting, and few-shot prompting, and chain-of-thought prompting.
Eleti: Prompt engineering is a fancy term for what input can you give to the model in order to get higher quality outputs. Prompts are the prefix that you provide to the language model that conditions the suffix that the model produces. What people have found entirely emergently playing with language models over time, initially, people would ask simple questions like, tell me a story, or what's the capital of France? The model sometimes makes up responses or makes up facts, or does not answer the question correctly. Frequently, you'll find the model be bad at math, if you ask it, what are the first N Fibonacci numbers, or something like that? What people have found experimentally is that showing the model some examples, or few-shot prompting, helps the model do a better job answering the question, but also do a better job following the pattern that you wanted to follow. Let's say you wanted to use the model to answer a multiple-choice question, normally, you might say what's the capital of France, here are a few options. The model might respond in a sentence, the capital of France is Paris. If you wanted to select between A, B, C, and D, showing the model two or three examples of when given this question, answer this way, when given this question, answer this way, conditions the model into more likely to follow that format. That's what few-shot prompting is. Zero-shot prompting is the reduced version of that where you don't give it any examples. You just ask the question and get the answer back.
Then, finally, chain-of-thought prompting is this interesting technique where you tell the model to vocalize its thoughts, first, to explain what action it's going to take, and then take the action. The clever technique, essentially, is when the model vocalizes its thoughts and says, "To answer this question, I'm going to first think about what the capital of a country means," or something like that. It increases the probability of the next token, or the next word that it actually predicts to answer the question, to be correct. Chain-of-thought prompting is essentially a technique telling the model in the prompt, please explain your thinking first, and then answer the question. That empirically seems to provide much better results.
HP: Prompt engineering is a bit tricky, especially if you're working with APIs provided by certain cloud providers, and the version switches, the prompts won't work. One thing that we've seen, irrespective of zero-shot, few-shot, and chain-of-thought, is trying to understand the domain in which you want to operate. Essentially, as part of prompt engineering, something that's very important is to understand, what is the problem you're trying to solve? If you cannot find out in simple ways, the problem that you wish to solve, writing a good prompt for that could be very challenging. If you realize the problem is too complex, then the easiest thing to do is to apply fundamental programming things, divide and conquer. Basically, you have to divide your problem into smaller prompts and provide them context, and eventually, with more usage, prompt engineering starts to become a little bit easier, and also explainable in terms of the results that you get. Based on that performance, which Mathew and Montana were talking about when you're evaluating, you're not only evaluating how the users use the application and what the large language models are suggesting, you also get a good understanding if your prompt engineering works. Now you have another machine to fine-tune apart from your machine learning models and the application and user experience, which would be a continuous loop for prompt engineering itself. Recording your entire conversation, and logging everything that happens in your application is very valuable, so that you can go back and see the prompts that generated the outputs and what value it generated for the customer. Even if they're hard, please log everything.
Eleti: One thing to add on is actually combining the previous topic with prompt engineering. There are a set of LLMOps tools that provide you with a handy interface for testing different versions of your prompt. You might even run an A/B test across three or four prompts, and use them in production, and run evaluations on the outputs that each of those prompts is generating to give you either quantitative or qualitative scores on, this prompt did better on factuality, or this prompt was a better sounding response. It's almost like an experimental optimization game where you try different prompts, you version them, you check the evaluations. That's how you marginally improve your product over time as well.
Prompt Engineering, Machine Learning, and LLMOps
Penchikala: With this year's advancements of generative AI, public audience turned more into the prompt engineering side, and somewhat stopped discussing the learning aspect, which is machine learning. Machine learning of generative AI cannot be easily done. How do you build a learning and continuous improving LLMOps into a system that uses pre-trained ChatGPT-like system? I think he's basically asking, how do we get the best of both prompt engineering and the actual LLM machine learning side?
HP: I think it's going to be a bit challenging. I think people are still talking about machine learning with generative AI, it's just that the prompt engineering has a lot of focus because you have a lot of people who have the opportunity to experiment quickly and almost for free. As we're speaking right now, large language models, as Mathew earlier on the panel discussion brought up, is still quite big. They're still very expensive. For running a lot of experiments, they make sense. Once a lot of companies start taking this into production, the machine learning part will become very interesting, where people are trying to reduce the size of the model without losing the performance of the model. There, there's a lot of optimization work that is going on. There's also Google, which is now running a machine unlearning conference as well, which is the first of its kind, with all the privacy and all of these conversations happening. That, how do you keep your model still performing well, if you were to teach it how to unlearn, if you were to remove a dataset? That's, I think, the first part of your question. There's still a lot of conversation around machine learning that's going on. Prompt engineering is on the surface, but with a lot of hacks that are happening.
The second part of your question, what can we do to an LLMOps to improve large language models? I think it depends on the problem that you're trying to solve. You could replay back the conversations, which Mathew and Montana were suggesting in terms of RLHF. You could take an entire scenario that was played by a user, then you could have another expert user evaluate the performance of the system in terms of saying, ok, so a user asked this question, the system gave out this answer, is this the right answer or not? Based on that, you could use that as a training dataset for fine-tuning your model, or zero-shot, or few-shot, or even fine-tuning with a set. The space is still growing. I think there are very few companies that do LLMOps very well. There are a lot of other companies that are trying to apply the same machine learning operation lifecycle to LLMOps, and take it forward. We'll see some more tooling that will come up in this space, some more standards that will be discussed. I think it's just emerging as we speak. Typically, all the things that we did before the large language models came out, they all still apply.
Low: Nischal, you'd alluded to this a little bit in your answer, and I think Atty too, problem decomposition, of like, if you ask ChatGPT to just write you a story, it'll probably write you a pretty basic story, but if you say, give me the outline for a story instead, and it can give you an outline. Then you ask it like, develop the major characters. Then you ask it to develop what is the climax. Then you put all of that together, and you start like, write chapter 1 of this story. You can keep breaking that problem down into smaller chunks. For some of those much smaller chunks, you can use more classical machine learning, or just algorithmic approaches to solve those chunks. Another thing you can do, where more traditional machine learning applies, is when you're doing prompt engineering, Retrieval Augmented Generation is a pretty important technique. This is where you use your vector database to pull certain facts or similar information into the prompt to give it the machine learning context. Normal vector databases use very simple similarity metrics between vectors to find the nearest neighbors, but you can actually build a classical machine learning model to find the most relevant vectors, given the topic. That can be a classical tree-based model, it can actually give you much better relevance than a simple algebraic cosine similarity that your vector database has, as well. The more you unpack these problems, and the more you break them down into little components, the more classical machine learning starts to rise back to the surface.
Penchikala: That is how we usually do. Montana, going back to your example, if somebody is a human, author is writing a story, they will start with an outline first, and build the characters and then work on each chapter, one at a time. It's the same idea.
What is Retrieval Augmented Generation, RAG, and How is it Used?
I know one of the prompt engineering subsets is the Retrieval Augmented Generation, RAG. What is RAG and how is it used?
Teoh: The idea is that you can make your responses more tailored to whatever your input query is, if you retrieve relevant pieces of text to be included into your prompt. One thing I just wanted to call out is that in a lot of classic examples that people have explained this, if you look on the web, a lot of it is this vector DB lookup. Where you have a bunch of documents, each of them are embedded in some way in space, your query comes in, it's also embedded in the same space, and then you find the closest documents in that latent space, according to those embeddings. One thing I wanted to call out is that retrieval need not stop there. I think one of the themes that I'm seeing popping up just from this discussion so far, is that there's a lot of old tech that still applies here. Coming from a search background, this part of retrieval augmentation screams search to me. You can think of the vector DB lookup as one simple instance of search, but search goes back a very long way. If you think about some of the different components of a search system, you have your query understanding, query rewriting, and then the actual retrieval from your search index. If you have a system like that already, probably not running off of your laptop, but if the business that you work for already has some search engine, then by all means, that's a good thing to leverage for your RAG app. That's the thing that I wanted to call out is that, yes, retrieval can go much beyond and can also involve old tech. If you have that available to you, it's to your advantage to be able to use it.
Low: I just want to plus one that Mathew, because as PostgresML, we actually see that regular keyword lookups are as useful as vector lookups. Or just finding things in a B-tree index that are numerically similar, for example, filtering by user ID. That's huge, but it's something that every database has been doing for a decade.
Maximizing Benefits of LLM Tech, but Minimizing Risk
Penchikala: Again, just like anything else, power comes with responsibility. Can you guys talk about responsible LLMs, just like responsible AI that we've been talking about? How can we maximize the benefits of using LLM technologies, while minimizing the risks that come from that?
HP: Maybe given that I'm in Europe, and there's a lot of conversations happening around regulation here, can probably kickstart this. We work in the enterprise space, where our customers are also big enterprises. They have a lot of concerns around the privacy of data, irrespective of the geographic zone they are in. We talk to customers in North America. We talk to customers here in Europe. In Europe, it's a little more prevalent for customers to be aware of who is processing the data. Where is this data stored? How is this AI model trained? There's quite a lot of conversations happening around the AI Act that's going to come up here in Europe in terms of what AI regulations do we need to have in place. It's still very challenging to answer this question. There is somebody asking about, when would you run your own LLMs versus something that's there on the cloud? Partially, we're being forced, due to the AI regulation, to bank more on open source large language models, because you control the entire infrastructure, and you're hosting it. That comes with a whole lot of responsibility around managing the infrastructure, keeping the APIs up and running, and paying a lot of money, as well. If you want to manage the entire thing, you'll have to pay way more than what you would if you were to make use of Anthropic API, or Cohere, or OpenAI's APIs. There's a lot of questions that are also rising in terms of, how do we fine-tune these large language models. We try to be upfront because we are running a multi-tenant system. In general, what we see is that customers are a lot more happier in terms of making use of systems that are completely controlled by the software provider that provides them the service. I think there will be still a lot more to be regulated, especially around biases, around if the conversation that a user is having, is it a safe conversation and why does the model say what it does? Can they explain why the model is choosing to say something? These are some of the topics that are being discussed actually, very passionately, here in Europe, as part of the AI Regulation Act.
Eleti: I think Nischal covered a lot of the privacy confidentiality concerns that enterprises have. There's also, I think, safety and responsibility as different dimensions to it. From a LLM provider's perspective, we think about safety and responsibility across all the use cases, what could people use our models for and where could things go wrong? Ultimately, it's rooted in real-world impact. You could use large language models to generate efficiently, political misinformation. You could, with clever prompting, bypass some of the training of the model and get it to generate problematic content. Our goal is basically to build tools and democratize these tools in a way that people reap the benefits and the world gains productivity benefits, and so on, without enabling misuse. A lot of effort from our safety and technical teams goes into training the model to refuse certain use cases, ultimately rooted in short-term real-world impact. As application developers, I think there are things that behoove application developers as well when developing and deploying LLM apps into production. As a quick example, let's say you're Snapchat and building a product like My AI, your product is being used by young children and teenagers around the world. They might be using it for homework. They might be using it to answer questions. They may also be asking it for emotional advice. It's important that your application is supportive and responds in an appropriate manner for that use case. Use these techniques, fine-tuning, prompt engineering, to ensure that the tone and tenor of your product is in line with what your audience expects.
Then the other thing that application developers need to keep in mind is, there is a growing trend of giving LLM-based applications more agency to do actions, whether it's calling your APIs, whether it's taking actions on behalf of the user. I think retrieving information is a relatively low side effect operation, getting new information and responding. If you're taking actions and side effects in the world, let's say your conversational bot can create objects in your database or click buttons in your UI. I think some emerging UI patterns here are also interesting, one is the drafting technique. In your Gmail, if you're writing out text, it's not just writing the email and sending it out for you, it's drafting the email and then letting you edit and submit before you do it. That's one technique. I think, in general, asking users for confirmation before you take any constructive actions is another technique to keep in mind. Certainly, folks building the models, there's a lot of general-purpose safety work. Then for folks deploying the models into production applications, is usage specific safety work, that's worth building into as well.
How to Deploy LLM Apps Safely and Securely
Penchikala: Any comments on how to deploy LLM apps safely and securely, and how to make sure that they are being used in the intended way.
Low: This is a super critical topic for all of us to have. At the end of the day, you have open source models that are being developed by nation states that anybody can run. While it is important that we make all these considerations for our applications, we have to contend with the reality that there will be increased misinformation out there, and this is going to be a societal problem. People are going to become more efficient in their work, and so that will displace other people from jobs. We need to contend with this really at a societal level. There have already been massive efficiency gains, and we're already dealing with the economic realities of that. There have already been massive misinformation campaigns where state actors employ tens of thousands of people to generate misinformation. It's already happening at scale. I don't think that we need to be too alarmist because, again, it is already happening. It's less efficient. It's more expensive. As we see technology making these processes more efficient, I think that we have to look at like, what can the good guys do and how can the good guys use this technology to actually combat some of this misinformation? It's a competitive game in a way, where you'll never get to stop. You've got to keep running forward. I'm very hesitant when I hear people talk about regulation, because the bad guys are going to do it. If the good guys have their hands tied, then it's a losing game.
Penchikala: Innovation and enforcement.
Testing Vector, Knowledge Graphs, and LLM Pipelines in the Cloud
What are examples of good cloud providers that give us space to best test a vector/knowledge graph/LLM pipeline? Any comments on that, or any considerations to look for a cloud provider?
Teoh: I'm not sure of a cloud provider that provides all three. I know each of these things being offered as individual services. Who knows, maybe that's a startup idea.
Low: I'm going to give a quick plug for PostgresML here. PostgresML is really pgvector, which provides a vector database. It's a Postgres database. It's a PGML extension, which ties in things like TensorFlow, and PyTorch, and all of these other classical machine learning algorithms into the database. It's great if you love writing SQL. We've got a lot of work to do, if you'd rather write JavaScript or Python, but we're building SDKs in that realm as well.
LLMs in the Future (12-Month Prediction)
Penchikala: I would like to request each of you to make an LLM related prediction that may happen in the next 12 months.
HP: At least from where I stand right now and see the world, a lot of the AI innovation that happened touched the B2C markets quite a lot over the last decade. With LLMs, what we see is traditional markets that were considered to be laggards, and behind the innovation, I see in the next 12 months a lot of impactful work that will be done in the enterprise landscape, be it supply chain, be it manufacturing, all the way from battery optimization to power grids. There's a lot of knowledge and business processes and a ton of data that stays disconnected. I think with the large language model era that's been ushered in now, I think this side of the industry will transform the most.
Low: I actually expect all of this to get a lot more boring. I think that we've had geniuses walking around the earth for millennia, who've had access to very large sets of data and huge collective action at scale. The capabilities of LLMs, even if they are superhuman, we've seen superhuman feats before. I think that we'll continue to see iterative development, and it'll become more ubiquitous, and more companies will adopt. Just like everybody walks around with the internet in their pocket today, and we don't even think about how magic an iPhone is or an Android phone, I think people will experience the magic of LLMs but completely take it for granted, within the next 12 months.
Teoh: Along the lines of boringness, my prediction is that the exciting stuff will become boring. This is a little bit different from what Montana was saying. I'm not saying that stuff that is exciting today will drop in excitement. I'm saying that some of the exciting works that will come up, I think will come from areas of work that were previously boring. A big example here I'm talking about is maybe, you go through a bunch of customer feedback, nobody wants to go through thousands of these, if you can feed this through a model, going through this is now certainly worth it. I feel like the cost-benefit analysis of even the most boring things is going to change. Some things from there are going to unlock a ton of value. I think the exciting things will be boring, in a good way.
Eleti: I think from a developer's perspective, what we've seen is LLMs are great at answering questions, summarizing things based on RAG. What they're not really great at today is structured output. If you want LLMs to really be AI programs, give it an input, give it a prompt, and get it to do things in the world. It needs to interact with traditional systems. It needs to call JSON APIs, or take actions on your computer. I think we'll see a lot of movement there, where LLMs get much better at producing structured output and working with other tools. I think that's an amazing use for developers because you could combine your traditional software engineering, call functions, call programs, call APIs, with this intelligence in a box. That is the new tool in your toolkit. That's my big developer prediction as I think structured output will be big in the next year.
Then my big consumer prediction is I think voice is finally here. For the last 9 months, we've been chatting with our LLMs. I think soon we'll be talking to LLMs. Of course, voice technology has been around for decades, like converting voice to text and text to voice. We're seeing multiple companies make really great progress in both voice synthesis and voice transcription. When you combine that with the missing piece, the brain between the LLM, you're sort of getting the movie "Her", a voice assistant that you can talk to about your emotions, about your problems, ask questions. Get it to do things for you. I think a bold bet here many companies will make is an entirely new way of interacting with computers. I'm excited to see how that turns out.
See more presentations with transcripts