Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations LLMs in the Real World: Structuring Text with Declarative NLP

LLMs in the Real World: Structuring Text with Declarative NLP



Adam Azzam discusses why building machine learning pipelines to extract structured data from unstructured text is a popular problem within an unpopular development lifecycle.


Adam Azzam is AI Product Lead @Prefect

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Azzam: My name is Adam Azzam. I'm the AI product lead at a company called Prefect. Before that, I was a CTO of a company called Openrole, which used small, medium, and large language models to conduct job searches on behalf of folks. Ran Insight Data Science here in New York for a number of years. Then, asides that, I'm a recovering academic and a recovering mathematician. I apologize in advance for, if you've ever grown through a math lecture, I'm going to bring a lot of those vibes. I'm going to talk a little bit about how folks are using LLMs in the real world, specifically about building NLP pipelines declaratively. I'm going to talk about a bunch of things like this.

Marvin - High Level Python Framework

This is all through the lens of an open source framework that we built called Marvin. Marvin is a high-level Python library for building with LLMs quickly and with less code. I stole that moniker from Django, which is really, how do you make LLMs an ambient experience so that you don't really have to spend most of your time begging something to work. You can just drop it in an invisible way among your code that actually works, so it's a tool that you can reach for when it's the right tool to use. Of course, Marvin is an open source, free-to-use, you can download it today. It's a product of love from Prefect's backend for data engineering. If you want to get your workflows, pipelines, and agents to actually work, you can use Prefect today to orchestrate and observe your Python code. A lot of what motivated what we built today is one predates me. I'm going to use we very liberally when it was mostly these two folks. This is Jeremiah, the founder of Prefect, and Nate Nowack, a senior engineer at Prefect as well. Really, if Prefect is the backend for data engineering, then Slack is how those 25,000 engineers all come and ask us questions all the time. If you've ever built a product yourself, you know that one of the hardest parts is creating personalized documentation and onboarding to folks, especially if your onboarding experience is a bit of like a step function. For many folks, they hit, they smack their face right into that first step, they leave your product and they don't come back. Being able to smooth out that onboarding experience by bringing personalized documentation and personalized Q&A to somebody who's stuck on something that you did not anticipate somebody would get stuck on. We orchestrate a lot of LLMs to bring personalized Q&A, and personalized software documentation to folks onboarding on to Prefect.

LLMs (Large Language Models)

One of the reasons why these talks are so hard is not only there's a spread of experience, but there's a spread in age. This talk is mostly meme and code driven. I'm not going to go into particularly deep detail about LLMs. I think for the purposes of this talk, it's fine to have the most antagonistic, most basic, most aspersive description you can possibly have of LLMs, which is that they're omniscient autocomplete with a really bad attention span. This isn't really that much of an aspersion so much as a reminder of what's actually going on underneath the hood. Given a collection of tokens, it's really just pretrained to understand what token is most likely to come next. It knows what token comes next, because it's read the entire internet, and it does a good job because we gave it treats whenever it gave us a good answer. It turns out that not only are LLMs good at finishing our sentences, they're very good at solving a variety of other problems as well.

One of the first things that came out, early research when these models started getting big enough, large enough, where they started having some emergent properties, is that they can solve a variety of other tasks that just aren't autocomplete. For example, you can use them to approximate and build what 10 or 20 years ago was a mind-numbing process. If you were getting your start in ML, like 10 years ago, like I did, or 20 years ago, the way that you would have built that first spam classifier problem is you would have downloaded hundreds of thousands of emails. You would have meticulously engineered features out of the text, out of the metadata about the email itself. You would have artisanally crafted a model and labeled a lot of data yourself. Then you would have hoped that that model was robust, it didn't drift. That the next generation of spammers didn't find a way of getting around it. Today, you can just essentially say, "Hello, omniscient autocomplete. I got this email. Here's what the email said. Do me a favor, can you tell me if this is spam or not." It'll say, "Yes, of course, like the next word is, yes, it's spam." There's this way that we're able to recast, and leverage omniscient autocomplete, and recast classical tasks so that we can solve ML problems of previous generation, that is, classification. It's surprisingly adept at zero-shot classification, which means we don't show it any examples. It's good at zero-shot classification, because in a sense, it's read all of your training data already.

It's also incredibly good at entity extraction. Of course, if you were trying to do this 6, 7 years ago, or something like this, if you were getting your start trying to do entity extraction back then, you would have to hope that like SpaCy or Gensim could actually spit out whether or not your document had even proper nouns in it. Then you would have to figure out whether or not those proper nouns were actually relevant to the things that you were trying to analyze. If you were one of the folks that in a standup was like, I read a paper on conditional random fields, maybe we should use that, then you had people roll their eyes at you, and you'd put it back on the backlog. Then now it's as straightforward as saying like, "Omniscient autocomplete, I found this resume. I expect to see this list of schools and this list of experiences out of it. Here is how I expect to see these things come out." It's able to autocomplete an answer and give you what it actually looks like. Of course, I think the thing that sparked the attention of a lot of folks, since maybe last September, is that these two things combine for one more emergent property, which is their ability to choose tools. If you give it a collection of tools, and you describe what those tools are, you give it an objective, it's surprisingly adept at being able to select one of those tools. If you expose an actual execution environment to it, and you give it some of the rules of what you do when it returns it back to you, then you can actually now dynamically generate tasks and execute them.

LLMs in some sense, have enabled us to plan, generate, and execute tasks with tools. We can generate, deduce, and extract answers. We're going to talk about that in depth. Something we're not going to talk about is how to solve some of that bad attention span problem. It can search semantically over a knowledge store that you keep. Of course, this has increased the surface area of what software we can build. Of course, it's increased the surface area of what can go wrong. It's not like the last generation of software was easy to build. Now, we have dynamically generated maybe cyclic graphs. It was hard enough getting DAGs to run. With LLMs, type safety is really a second-class citizen. If you want to actually integrate it with an incredibly Pydantic system, getting a large language model to give you outputs that can form the type safety guarantees to pass it on to something else, is incredibly difficult to get right. It's only really declarative if you beg hard enough. What this adds up, is this adds up to what is often like the worst developer experience that you can imagine. About 15 years ago, I was solving ODEs with MATLAB, and so to still be able to genuinely say this is the worst developer experience, you know I mean it.

How They Work

Of course, how do all of these things actually work? Really, it's this recasting framework, where we take some classical task that we want to actually achieve, like classification, entity extraction, tool selection. What we're able to do is we're able to actually just recast these as autocomplete problems, and hoping that since autocomplete is more or less solved, we can actually now use this as a way of getting answers to unsolved problems. As a recovering mathematician, you should know that in math, we don't make that much progress. We only ever really get big results every five or six years. What we spend in the five to six years in between is trying to reframe every problem to see if it can be solved by the last big thing that somebody did. Likewise, if you want to think about autocomplete being solved, which it isn't, but if you want to think about it like that, then what you're seeing now is everybody is revisiting a lot of intractable problems in NLP trying to see if we can recast them as autocomplete problems and smacking them over the head with omniscient autocomplete. I'm going to show you about how we smack a lot of problems around.

Of course, an LLM at the end of the day is just, given an initial string of tokens, they are just really trained to predict the next token. We can think about a system like ChatGPT pretty literally as just like, given a system prompt, which might say what the objective of the system is, it might give you some style criteria about this person, this omniscient autocomplete autocompletes things as if it has the prior that it's like a priorate or something like that. Given some conversational history, its objective is to return a message from the system that's most likely to complete this conversation. We can refactor this. What's going on here is, every single system like this is really coming in three parts, which is, there's some task that I want to accomplish, so here is building a conversational assistant. That is one way of shoving it into an autocomplete problem. I now take a single invocation, or what is often multiple invocations of an LLM in order to get its response. Then it's on me to parse that response back into something that actually solves my original problem.

By refactoring this code, what we actually stumble upon is we stumble upon this recasting framework of, if I have a classical task that I'm trying to get an answer from, one of the most robust ways that we can now solve it is take advantage of the fact that LLMs have solved autocomplete problems and trying to actually do this translation problem in between. The idea now is, can we revisit intractable problems? Can we revisit things that were incredibly hard to write imperatively? Can we translate these problems into an autocomplete problem, and then, essentially, build a declarative experience out of this? A lot of folks have a very rigid definition of what a transpiler is, and I'm not one of those. A transpiler, to me, is anything that bundles my bad code into something more performant, which, of course, is not what a transpiler actually is. If you'll excuse the abuse of this analogy here, what we're really doing is we're transpiling our original task down to the runtime of an LLM. Then we're trying to pull that back into something that solves our original task. In the same way that React Native lets me write terrible JSX, and then actually it transpiles down to something like Swift. Again, not transpiling but I'm going to say that it does. In the same way here is that I get to write terrible code, declarative code, and then I pass it off to an LLM to execute it, and I hope that its autocomplete is actually a way to answer my question.

Text and Data Generation

We've seen four classical problems that LLMs are trying to actually solve and what they're being solved for. We've got text generation or generation of data. We've got classification problems. We've got entity extraction problems. Then we've got choice problems. I'm going to talk about all four of these things, about how people are actually using this in the real world. Let's talk about generation. If you revisit what ChatGPT was, ChatGPT was really just a means of taking some system message in a conversational history, and then outputting a message that a helpful assistant would respond back to me. What if I don't want to build a conversational history or a conversational assistant, let's say that I want to build something incredibly contrived, like generating a list of fruits? This is a complete non-sense function. If I ran this, it wouldn't actually run or basically return anything. Of course, what I would like to actually do is I want to have a very typed understanding of what a fruit actually is. If you're not a Python person, Pydantic is basically the modern way of how people do validation and parsing of classes. This is essentially a data class here. If I define a fruit as a name, a color, and calories as strings and floats, this is something that a typical data engineer would write to do parsing and validation before it actually gets committed to a database. Of course, down here is a function that I don't really know how to write. Maybe I would download a list of fruits, so then I would randomly sample them or some non-sense like this. Besides that, if I didn't have that, I had no idea how I would actually write this.

If I wanted to actually solve this problem, we have a good template from how we thought about ChatGPT, which is, I know that I should probably cast this signature somehow to a prompt, which is transpiling it down to something that my virtual machine of an LLM can actually tackle. Then I'm going to actually try and use this LLM as a means of generating some answer to the task that I've given it. Then I hope that when I pull back that answer into the realm that I care about, that I actually get an answer out of this. The problem is now that ChatGPT is in some sense, that I feel ridiculous saying that ChatGPT is easy, because it's incredibly not. There's not as rigorous type safety guarantees of like, I slam in a bunch of text and I just get a bunch of text spit out back out to me. Now I actually have real type guarantees here where it's not just ok to spit back a comma separated list of fruits. This thing now actually needs to be a list of things that actually compile and validate against the schema. I now have to be thoughtful in how I actually cast this to an autocomplete problem. I now need to run this through an autocomplete problem. I need a way of actually pulling back these results.

We've got a couple of different ingredients that we can actually work with here. We got the name of the function, which was list_fruits, check, that seems pretty relevant to understanding what this function is intending to do. We've got the signature of the function. That's great. I know that I'm parsing in an integer n. I've got the docstring of the function. Generate a list of n fruits. Then the last piece is I've got this response type. It's the literal type of what I expect to get out of here, it's my data model. What folks often do, what we do especially at Marvin is, we take all of this information, and we write a prompt behind the scenes. What we can do is we can say, essentially, hello, omniscient autocomplete LLM, we pay our tithings, what have you. Here's the name of the function. Here's the signature of the function. Here are the binded keyword arguments of the function that was parsed to you. Here, the docstring is a description of what you're trying to achieve. Then what we can do, especially in Pydantic is that we can expose the OpenAPI schema that documents that model, and we can say we expect you to conform your answer to the following. It turns out that this is surprisingly robust. Part of this is that, OpenAI and other large language models have read the internet, which means that they have read enough OpenAPI schemas to where this is a pretty consistent way of getting rigorous typed outputs out of your actual functions.

Marvin is a high-level framework. It's totally open source. I'm trying to build the most Pythonic interface to working with LLMs. What we introduce is the notion of an AI function. An AI function lets you just decorate Python functions that otherwise you would have never been able to write yourself. When I execute this function, instead of me getting nothing out of it, I now actually get a list of typed objects that come out of this. Now this turns it into an executable. Of course, the way that this works is exactly as I described to you before, and something we'll see a demo of here in the second half of this. Literally, all it's doing is it's taking the name of the function, it's finding a representation of its signature. It is finding a representation of the response type of this function, and then the docstring. Then here, of course, I omitted that I'm calling list_fruits on actual three of them. This is the actual output that you get out of it. Of course, this is a codeless function. There is not code being written somewhere that gets brought back to your machine and is executed. If you guys have ever heard the story of when Ford released the first car, probably the second question after how much, was, where do the horses go? One of the great marketing techniques that they did for the first car was they called it the horseless carriage to remind everybody that there's not just horses living in this car that makes the car go. Similarly, here, there's no code that's being executed. Instead, we're translating this into English. We're sending it to a program that's programmed in English. Then we're getting back out its response, and then parsing this back into a language like Python. There's no code being executed. This is purely running on the runtime of an LLM.

Of course, the way that we did this is we just unmask this original task as an autocomplete task. Now what we can do is we can essentially write functions declaratively. We can say, I have absolutely no idea how this function should work. I have some vibe about how this function and what it should do. Then now I can run functions purely on vibes, which I imagine to an audience of all engineers, myself included, sounds a little weird. If we accept that we can now write software on vibes, let's see what we can do with this. Since I was a kid, I really thought that basically all error outputs in every programming language I ever worked with was just trash. For me, coming up, had definitely always wanted to have something that was actually able to explain my errors back to me. I have no idea how to write that function. Nobody who even wrote the original libraries knew how to write a function to give me good errors, and so, why should I? LLMs have a good sense of how to actually explain things in English. What I can do is I can literally write a function that's called explain_error. It takes an exception, which I'm going to feed to it in as a string, and it's going to output as a string. Of course, I don't know how to actually program this, I'm just going to tell it, you're going to get an error, and you're going to tell me how to fix it, if possible. Now this is just, build a context manager here that listens to the errors as I make them. Now if I commit what is probably one of the first errors you commit when you start writing with Python, if I now add an integer to a string, the bottom half of this is what I would normally get. I've got this unsupported operand type between int and string. Maybe this isn't actually too hard to grok yourself.

On the top half is what you actually get when now you're executing with an AI function. Now you're getting, here's the error that you got and here's why it occurs. Here's what you can do to actually try and fix it. Of course, all this is, is I can now declaratively write a function. How does it work? It's this recasting framework. We take explain_error, the name. We take docstring, which says, your job is to explain errors. We take the binded inputs, which is the actual string of the exception. Then we tell it, your output here has to be a string. What I actually get is I get a nice explanation of what I can actually be doing differently. That is how you can actually generate data using Marvin. I want to make sure that I'm not overfitting on Marvin. My job here is not to shill any particular framework. Hopefully, the takeaway from a lot of, not only this, but what we're going to talk about is more important, which is this general idea of how we can recast classical intractable problems into something that we can hand off to an LLM as a runtime, and extract it back. Whether or not you use Marvin or any other framework, it's this idea that's most important. That's the mathematician to me.


Before that with generation, let's talk about extraction. If before we were generating a list of n fruits, now what I actually want to do is, I'm going to flip it. It's going to seem almost invisible. I had list_fruits before, then now I'm going to have extract_fruits. Before is generate a list of n fruits, and then now I'm going to extract all fruits from a text. Great. Now, if I had a banana, a strawberry, and mango smoothie, now I can actually parse extract_fruits, and I'm going to exactly get this thing back out. The best way to think about an AI function is that, if an LLM is something like a new energy source, maybe combustion, something like this, what all of this framework is, is building like the gears, cranks, pistons to actually convert this into mechanical energy that can power the rest of the mechanical system. We are essentially wrapping it in a lot of type guarantees, because that's the language that software speaks, but we're parsing it off to something that's more expressive. If software is prose, the right LLM is poetry here. Now we're actually able to do extractive tasks. This is such a common task. What do we all do as NLP practitioners is try and build random embeddings that maybe play well with our classifiers. Really, it's a lot about extracting data, so we can pass it off to folks.

Now what we can do is, morally speaking, if you remember, Pydantic is really just a validation and parsing library. We can hook into the init, and we can now say, you're not meant to accept a positional argument, but now I'm going to tell you that you have to accept the positional argument. It's going to be some unstructured context string. When I parse it in, what I'm going to do is I'm going to run this extract_fruits function that I defined in a previous slide, and now I'm going to spread that into the init of my original function. Now, if you parse in a really terribly worded sentence, like I have a banana today or something like this, now what you can actually get out of this is you can get strongly typed Pydantic outputs. This is something that you can actually commit to a database if you're daring. Of course, this is super messy, and I don't want to do this all the time. Given how common of a design pattern this is, this deserves to be a first-class citizen of our system, and so it is. Marvin introduces what's called an AI model. It's just a class decorator for Pydantic models, where now it enables you to parse in a positional argument, and you can actually mine out information. Notice here, I never said that a banana was yellow. I never said how many calories it is. What I'm doing is I'm depending and almost using the hallucination of an LLM as a feature, both literally and figuratively a feature, to actually not only extract but deduce and infer hidden information that's not readily available. You can turn off all these sorts of things as well, if that's not your vibe.

I'll give you some classic examples, then we'll actually go into a demo here. If you were a data engineer that was responsible for receiving type location input from some fancy model that a data scientist built before you. I say this as one of these fancy data scientists that I never thought I earned my salary, which is, I'm sitting here and I'm waiting for some fancy model to pass to me city, country, latitude, and longitude of whatever, some user information that's being passed down to me. This is exactly what I would write if I was a data engineer, to put in front of anything before I actually commit it to a database. If I decorate this with an AI model, now I can start taking stuff like, no way, I'm also from the Windy City. I don't have to wait for a data scientist to block me on this. I don't have to wait for, there's a new PRD that says, we've got three new features we want to start mining out of this. Then your data science team says, it'll take us about three months to train a new model to do all those things. Now I can depend on an LLM to at least maybe call it a prototype if you're skittish, or call it an NLP pipeline if you're brave. You can actually now start building inference into your actual parsing and validation libraries. This is what actually gets you strong typed outputs from something that doesn't even ever say the word Chicago, Illinois. If you're wondering, that's exactly where it is.

Example (WeightWatchers)

The first example that I'll show you here, and then I'm going to hop into a Jupyter Notebook here. This is a real example from WeightWatchers or WW now. WW solicits essentially food that people want to upload into their app. As you can imagine, I'm trying to log my calories. I'm trying to log my meals or something like this. WeightWatchers has assigned points to all these sorts of things. When I have a banana nut muffin, and it's not in their giant database, they ask me to manually upload this, which means that WeightWatchers has a repository of 100 million randomly uploaded descriptions of meals that people have eaten. Of course, what do they want to do? They're on the hook for trying to tell people like, how did this contribute to your budget, as a part of the behavioral therapy program we're all trying to run here? They've tried everything for literally a decade. This has had some of the greatest data scientists working on their team that literally every quarter is like, we'll try again next quarter, trying to standardize and normalize and clean and extract this data. Literally, they want to extract a little bit more like this. One of their big things is they want to understand not only do we want to be able to take in and clean and normalize a lot of the stuff, but for people that are uploading recipes, which you can also do so that you can save and persist them for future stuff. If you imagine a weight loss app, you want to be able to upload ingredients for a dinner you make often and then you want to just be able to say I have this again on Thursday. They want to be able to also take those recipes, clean them, bring it back into their recommendation system, and then show them to other people. The problem is that they've got 100 million instances of this, and they have absolutely no idea what they are. They're just giant blobs of text that live in an S3 bucket that nobody has the keys to really anymore. That's a joke, they do. Of course, what they're trying to mine out of this is like, can I get the ingredients out of this thing? Is this thing vegetarian? Is this thing vegan? How long does it take to maybe cook this? What cookware is necessary for this if it's actually mentioned? You can actually just give types to all of these things, decorate this with an AI model, and parse in a recipe and actually get it out.

Let me show you what this actually looks like. I'm just going to keep sampling this. This is literally what users are saying. They said I had [inaudible 00:30:19]. I don't know what this is, but I guess it's chicken with walnut sauce. I've got Ugo with citrus garlic Sacher. I've got Schlotzsky's number 14. Schlotzsky is a horrible Subway. They had the misfortune of not only going to Schlotzsky's, but getting the number 14. I'm going to show you how you can do this declaratively. What I can do is I can now actually start building this a bit more declaratively. Maybe I'm going to start simple. Given a randomly sampled name from what I've shown you, what I would actually want to do is I want to just extract the name and the description. From spice corn soup, that's literally all that was passed to me. I'm going to get a description of, it's a delicious soup, made with fresh corn kernels, onion, garlic, and spices like cumin, coriander, paprika, it's usually served hot, can be garnished with cilantro and lime juice. Of course, that wasn't in this at all. We're depending on the fact that ChatGPT is incredibly well read and that it's seen this somewhere before. That's not all that I want to extract from it. I want to build this declaratively, of course. Now I'm interested in knowing, is my randomly sampled thing, is it a dessert and is it vegan? My salmon mango roll. It's a sushi roll filled with fresh salmon and mango slices. It's definitely not a dessert, and it's definitely not vegan. Then of course what I'm getting out here is nice and tight. Now I want to start doing something insane. Now I actually want to start looking at the origin of this. I'm going to start building nested models. LLMs are pretty good at giving you flat JSON. If you ever see a demo where somebody says like, I finally got JSON output here, it tends to be a flat one. Getting actually nested data models to come out of this stuff is, again, not hard, you can read through how we do it, but it takes some attention. Let's see, maybe I'll be the unlucky one with an error. Now I'm going to get my [inaudible 00:32:28] black lentils. It's got tomato, cumin, it's everyday dull. It's not a dessert. It is vegan. It's from the continent of Asia, and specifically India. Now we can start parsing all sorts of stuff to this. I can say like, it's a salad with kalamata olives and feta cheese. What is this? It's a Greek salad. Great. It's definitely not a dessert. It's not vegan. It's from Greece, which is in Europe. Let's see if we can do sandwich made with chopped steak and Cheez Whiz. Again, without decorating this with an AI model, this would all crash. It still might crash, but at least doesn't crash as often. It's a joke, it's robust. Of course, it's going to break here. Of course, what we get out of this, we get a Philly cheesesteak. A sandwich made with chopped steak and cheese whiz. It's not a dessert. It's definitely not vegan. It's from the U.S., go Philly.

This is a sense of how we can start building these data models declaratively. You'll notice that for every new feature that I wanted to add, I didn't have to go to a PM, to go to an EM, to go to a head of data science to see if we can earmark enough whatever to get the training data. That entire circle is collapsed, and in some sense, returns a lot of territory that has been lost over the years to folks that are working in a much more deterministic land. In some sense, just returns a lot of inferential territory, low-level inferential territory. If you still want to use this as training data, you can. If your data science team says like, this could hallucinate random stuff. That's fine. Data scientists have been creatively figuring out how to stack models together for the last decade, and figuring out how to work with bad third-party data from vendors they don't absorb the data from. You can think about it as a third-party data source that you can work with. That's WeightWatchers.

Example (Clearbit)

Clearbit is another one of these companies. Clearbit, I think after the last series C, is a $450 million company. Their whole thing is essentially schema normalization from hundreds of thousands of different websites. They're not pulling information about companies from some nice structured API and then have to normalize that. They are literally looking at the raw DOM of a couple 100,000 companies and then trying to unify that into a single API that they sell access to. If you want to query a company, it costs you 60 cents. If you want to query a company this way, you can now just literally scrape the page, parse it into an LLM, tell it exactly what typed information you want out of it. Instead of paying 60 cents per query, you'll pay 0.01 cents per query. I discovered this at my last startup, because I was a customer of Clearbit.

Example (ZipRecruiter)

Then you have folks like ZipRecruiter. ZipRecruiter is literally trying right now to redesign their entire resume parsing pipeline. This is one of these classic problems where not only do you want to extract the fact that somebody's first name is this, last name is this, email and phone number is this. You want to not only extract, did this person go to school at a university that maps onto our whatever table? There are also more inferential and deductive questions that you want to be asking at the same time. One of the beautiful things about LLMs is that they've been exposed to so much what people hypothesize code that they're able to do deductive inference as well. Not only can you start asking questions like, where did this person go to school? You can say, given the experience that this person has, does this person have at least three years of experience in Python? What it'll do is it'll actually, one, get the right answer. Two, it will look at all the experiences that you can infer if somebody used Python in, add them all up together, then you can actually extract this stuff out. You can start asking whether or not somebody has management experience.

Example (Flatiron Health)

Last example, I'll skip over, is a real-life example of Flatiron Health. Flatiron Health is basically how do we do data science to cure cancer. They sell an EHR, collect all that data with people's permission, then they try to apply data science to understand what practices actually lead to better oncology outcomes. A lot of what they do is try to understand things like, this doctor was way too lazy to actually write any of this stuff down in a structured way. Instead, they spit it into a tape recorder, they had their assistant actually type it out into a transcript. Then now it's our job as a giant company to try and turn this into structured output. This is a blog post that they put out. Talked about how they have a team of, I think, 18 or 19 deep learning engineers, that trained an RNN to try and extract whether or not somebody smoked, to try and extract what biomarkers they actually had out. The right way to actually do this is just build a data model for it. You literally just say, I expect to see the blood tests. I expect to see the treatments that this person got. I expect to see a list of their diagnoses. I'd like to know their name, their age, and whether or not they're a smoker. You can decorate any of these models, and you can now just parse in full EHR data. Of course, they don't parse this to OpenAI, they have incredibly secure HIPAA compliant LLM that they host internally. There's a lot of privacy, and security concerns with that type of information. This is now a way that you can not only extract information, but if you flip it on the generative side, you can start generating synthetic testing data for your more robust NLP pipeline, should you want.

Classification and Choice

I want to talk about classification and choice. OpenAI gave a talk which says, we'll choose functions for you. Then sometimes we'll get the inputs to those functions. It's great. We wrote a blog post about it. We've hooked it back into Marvin. It works. I really appreciate what they've done to fine-tune a problem to make it more friendly to developers. That's the only reason why I wake up every day is to make LLMs more friendly for developers. I want to talk about how we can get forced choices in a much faster, cooler way in such a low latency way that you can actually build it into an application. LLMs, as I said, can solve other tasks if you ask nicely, but they will solve other tasks if you torture them. Here's what I mean by that. What I mean by torture is they'll solve other tasks if you micromanage them. What I'm going to talk about is I'm going to talk about constraint sampling. As I said, LLMs are nothing more than next token prediction problems, which means that if we stand there and we micromanage everything that they say, we can get them to say pretty much whatever we want, when we need them to. We can get them to be creative when we need them to. Here's what this looks like. Essentially, what we're doing is we're removing options from their vocabulary as they're speaking. We can do this to strongly enforce bulletproof types, but we can also get them to actually make choices. This is something that's being known as context-free grammar.

Context-free grammar is instead of saying, please give me back something that conforms to this schema. I instead sit around and I say, no, you said he was going to start with, sorry, no, actually, it has to start with an open brace. There's a lot of work that's being done on this. Jsonformer is one of the first folks that came into this for open source models. This is exclusively available for open source work, because you need access to the decoder layer to actually micromanage this autocomplete as it's speaking. Typically, you're going to say, please, I'm begging you for JSON. Then instead it'll actually give you something like this, which you don't want. For open source models, what you can do, is, for every single character, you can micromanage exactly what it's saying. What this means is that, you can micromanage it into giving you a structured output, and then using the grammar in between your structured output, you can let it be creative. You can say, I know that it's got to start with an open brace for this to be JSON. I know that the second character has to be a quote, because this is JSON. You're allowed to put whatever you want next, but then I'm going to enforce that you put a quote, and then a colon after that. This is only available for open source models, or is it?

Let me tell you a little bit of how you can do this with closed source models. The catch is, you can only do this with exactly one token. You can do this in general, as long as you're willing to make an arbitrary number of requests back and forth to OpenAI, but nobody has time for that. We only have time to make one request. I'm going to tell you how you can stretch your budget. This is a budgeting conversation. As I said, LLMs will solve other tasks if you torture them. Then now this is the OpenAI edition. Let's say that I'm trying to choose between Wikipedia, the New York Times, and Google to call as a tool for a user query. What I can do is I can strip out all of those things, and I can assign them all numbers. I can say New York Times. I'm going to describe what it is, it gets the number one. Google I'm going to describe what it is, it gets the number two. Wikipedia, I'm going to describe what it is, it gets the number three. What I'm going to do is I'm going to recast a tool choosing problem as choosing the single index of the tool that I want to actually use. OpenAI lets you bias the tokens that comes out of it, but only for the entire response. What this means is you can say, you're only allowed to say the word apple, and then it will just spit out the word apple 100 times. You can constrain exactly what words it's allowed to use in its response. If I enumerate the n options that I want chosen, and I instruct it that it's only allowed to answer in the tokens that represent null, 1, 2, and all the way to n, and then I constrain it to say that you're only allowed to give me one single token. What I've done is I've now turned this into a single forced choice, that still uses the deductibility of an LLM to make that choice. Now I can actually make this choose the index of the best tool for my job.

This might be the first ML talk you've seen that has a confusion matrix in it. One of the first questions that you might ask is, "That is a way of choosing a tool. Good job, Adam. Is this any good way of actually choosing a tool?" I want to tell you how you can test that, and that it actually is. Of course, we can rely on Marvin not only for extracting information, but for generating information. What I'm going to do is I'm going to say, here's a list of information resources, given a list of any information resources, Wikipedia, Google, and LinkedIn, I'm going to get a list of those. Then for each one of those sources, I'm going to generate n human sounding user queries for which source is the appropriate resource to use without mentioning the name of the source. If I plug those two things together in a big ol/Map job, I can get about 100,000 examples for maybe 5 bucks. Then I can now treat my tool choice as a classification problem. I can see in this multi-class classification problem, did it return the thing that I tasked it to do? If you do this constraint problem, you get about 84% accuracy of tool choice. I did this over 10 tools. I generated something like 1500 examples, and it chooses consistently the right tool. The data doesn't actually leak in like, "Please check Reddit for this thing." The thing that it does confuse the most is conversational questions of like, what's the best way to find an attorney? Then Quora sometimes is the right answer, and it gets confused with Google. Besides that, it's incredibly robust at picking out factual information, whether or not you should check Reddit, whether or not you should check YouTube. If you do this in an unconstrained way, it's about 50% slower, and it has about 14% accuracy. Why does it have 14% accuracy? Sometimes, it's still a conversational agent, instead of Google Scholar, it'll be like, you can use Google to search this. Of course, that's not an actual choice of a tool. If Reddit is the right answer, it'll respond, you can try Reddit if you want. Or if Reddit is the right answer, it can be like, you should go to r/learnprogramming.

What you can do is you can now actually build this into middleware in your regular old application. As middleware, I can actually build an API. Then in the middleware, I can say, if you go along a certain route, with a certain query, so now I can say, how many customers do I have? I can now declare a route. I can say, this is like some route that I want to send a natural language query to. Instead of trying to parse all of that to an LLM, sometimes I want different routes to take different actions. I want to have different personalities behind those routes. Sometimes I want those personalities to literally just redirect to another page. Now if I say, how many customers do I have? It's customer time. If I go back to AI, how do I make a return? I'm making returns over here. All of this is, is like literally take a FastAPI router. You define all the routers that you want, orders, customers, returns, you can document them, tag them. Then as you tag them, you can actually intercept the ASGI request, hook into the scope and say, is it going to this route that doesn't exist? I can now describe each one of those API routes to an LLM. I can do this force constraint choice which, for problems this small, happens on the order of about 150 to 180 milliseconds, which is actually low latency enough to build into a true application. This is coming to a Marvin near you.


We build Marvin. Really, it's just the most Pythonic LLM library possible. It lets you work within an ambient way so that you never think about the LLMs, which is my favorite. It's launched. Everybody loves it. Pydantic loved it. All these other people loved it. If you want to learn more about it, you can follow us on @askmarvinai, @aaazzam,


See more presentations with transcripts


Recorded at:

Nov 04, 2023