Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations A Bicycle for the (AI) Mind: GPT-4 + Tools

A Bicycle for the (AI) Mind: GPT-4 + Tools



OpenAI recently introduced GPT-3.5 Turbo and GPT-4, the latest in its series of language models that also power ChatGPT. Sherwin Wu and Atty Eleti discuss how to use the OpenAI API to integrate these large language models into your application, and extend GPT’s capabilities by connecting it to the external world via APIs and tool use.


Sherwin Wu is a Member of Technical Staff at OpenAI. He works on the Developer Platform team, which is responsible for the products that allow developers to build products on top of OpenAI models and capabilities. Atty Eleti is a software engineer at OpenAI working on their API and developer platform. Previously, he spent many years building APIs at Stripe.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Computers - A Bicycle for the Mind

Atty Eleti: I want to take us back in time to 1973, 50 years ago. In 1973, Scientific American published a very interesting article, in which they compared the movement of various animals. They set out to compare the locomotive efficiency. In other words, how many calories an animal burns to get from point A to point B, in relation to their body mass and things like that? What they did is they compared various animals, birds, insects, and of course us humans, and ranked them from most efficient to least efficient. What they found was the condor was the most efficient in terms of locomotive efficiency.

A condor is a beautiful bird, native to California and some parts of South America, and it can fly sometimes hundreds of miles without ever flapping its wings. It has really good gliding power. Humans on the other hand, humans walk, came in rather unimpressively at about one-third down the list, which is not such a great showing for us. The beauty of the Scientific American article is that, in addition to all the species, they added one more item to the list, and that was man on a bicycle. Man on a bicycle blew the competition away, almost two times as efficient in locomotion as the condor.

I love this story, because it is such a simple realization, that with a little bit of tooling, a little bit of mechanical help, we can really augment our abilities by quite a lot. Some of you might have heard this story before. You might be thinking, where have I seen this? This story was often told by Steve Jobs in the early days of Apple. Him and the Apple team used the story as inspirations with the early Macintosh. Steve compared the story and said, humans are tool builders.

We make tools like the bicycle to augment our abilities to get things done. Just as the bicycle is a tool for motorability, for moving, computers are a tool for our mind. They augment our abilities, and creativity, imagination, and productivity. In fact, Steve had this amazing phrase that he used to describe personal computers. He said, computers were a bicycle for the mind. Ten years after this article was published, in 1983, Apple released the Macintosh, and unleashed the personal computing revolution. Of course, we're here now, many years later, still using Macs every day.

2023 - AI and Language Models

That was 1973. We're here in 2023, 50 years later, computing has changed a lot. If the folks at Scientific American ran the study, again, I bet they would add one more species to the list. A species that for most of us has really only been around in the public imagination for six months or so. I'm talking of course about AI, or language models in specific.

Ever since ChatGPT launched in November last year, AI and language models have captured the public imagination around the world. More excitingly, they've captured the imagination of developers around the world. We've seen an amazing number of people integrate AI into their applications, build net-new products using language models, and come up with entirely new ways of interacting with computers. Natural language interaction is finally possible and high quality. There are limitations and there are problems. For any of you who've used ChatGPT, you know that its training data was fixed in September 2021, so it's unaware of current events.

For the most part, language models like ChatGPT operate from memory from their training, so they're not connected to current events, or all the APIs out there, your own apps and websites that you use every day. Or if you work at a company, it's not connected to your company's database and your company's internal knowledge base and things like that. That makes the use of language models limited. You can write a poem. You could write an essay. You can get a great joke out of it. You might search for some things. How do you connect language models to the external world? How do you augment the AI's abilities to perform actions on your behalf, to do more than its innate abilities offer it?


If computers are a bicycle for the mind, what is a bicycle for the AI mind? That's the question we're going to explore, a bicycle for the AI mind. We're going to talk about GPT, the flagship set of language models that OpenAI develops, and how to integrate them with tools or external APIs and functions to power net-new applications. My name is Atty. I'm an engineer at OpenAI. I'm joined by Sherwin. Together, we're on the API team at OpenAI, building the OpenAI API and various other developer products.

We're going to talk about three things. First, we're going to talk about language models and their limitations. We'll do a quick introduction to how they work, what they are. Develop an intuition for them. Then also learn about where they fall short. Second, we're going to talk about a brand-new feature that we announced, called function calling with GPT. Function calling is how you plug OpenAI's GPT models to the external world and let it perform actions. Finally, we'll walk through three quick demos of how you might take the OpenAI models and the GPT function calling feature, integrate it into your companies, your products, and your side projects as well.

LLMs and Their Limitations

Sherwin Wu: I wanted to start by just giving a very high-level overview of LLMs: what they do, what they are, how they work. Then also talk about some of the limitations that they have right out of the box. For those of you who've been following this space for a while, this is probably information that you all know, but I just want to make sure that we're all on the same page before diving into the nitty-gritty.

Very high-level, GPT models, including ChatGPT, GPT-4, gpt-3.5-turbo, they're all what we call autoregressive language models. What this means is that they are these giant AI models, they've been trained on a giant corpus of data, including the internet, Wikipedia, public GitHub code, other licensed material. They're called autoregressive because all they're doing is they're synthesizing all this information. They take in a prompt, or what we might call context. They look at the prompts. Then they basically just decide, given this prompt, given this input, what should the next word be? It's really just predicting the next word.

For example, if an input given a GPT is, the largest city in the United States is, the answer is New York City. It would think about it one word at a time, and it would say New, York, and then City. Similarly, in a more conversational context, if you ask it what the distance between the Earth and the Sun is. GPT has learned this somehow from the internet, it'll output 94 million miles. It's thinking about it one word at a time, based off of the input.

Under the hood, what it's really doing is each time it's outputting words, it's looking at a bunch of candidate words and assigning probabilities to them. For example, in the original example of, the largest city in the United States is, it might have a bunch of candidates, so New for like York, or New Jersey, or something, Los for Los Angeles, and then some other possible examples. You can see that it's really thinking that New York City is probably the right answer, because New is assigned a probability of 95%. In this case, it generally picks the most likely outcome, so it picks New, and then it moves on. After this word comes out, you now know that New is the first word, so it's constrained a little bit more on what the next word is.

You can see now it's thinking New York with much higher likelihood, but it's also considering New Brunswick, New Mexico, New Delhi as well. Then once the second word has been done, this is basically a layup for the model. It basically knows that it's New York City with almost 100% probability. It's still considering some other options with very low residual probability, so County, New York Metro, New York Times, but with that, it chooses City and concludes its answer.

For the more astute LLM folks, it's technically an oversimplification. You're not really predicting words, you're predicting tokens, like fragments of words, which are actually a more efficient way of representing the English language, mostly because fragments of words are repeated in a bunch of different words instead of the word itself. The concept is still the same. The LLM is taking in that context, and it's probabilistically outputting a bunch of different tokens in a row. That's it. That's really what these language models are. With this, the crazy thing that I think surprised a lot of us is that you can get really far with just predicting the next word.

This is a graph from our GPT-4 blog post that we released in March of this year, which shows the performance of our most capable model, GPT-4, on various professional exams. This is literally just GPT-4 predicting the next word, based off of questions. You can see that it's actually performing at human or even past human performance on a lot of different exams. The y-axis is percentile amongst test takers. It's basically at like 80th percentile, sometimes even 90th, or even 100th percentile on a bunch of different exams such as AP exams, the GREs, LSAT, USA Bio Olympiad as well.

At this point, a lot of these tests I can't even do, and so GPT-4 is well above my own ability, and this is just from predicting the next word. This is really cool. You can build a lot of cool things with this. Anyone who has been playing around with LLMs for a while, you will realize that you very quickly start running into some limitations here. The biggest one, of course, is the out of the box LLM or GPT is really an AI that's in a box. It has no access to the outside world. It doesn't know any additional information. It's just there with its own memory. It feels like when you're taking a test in school and it's just you and the test and you're coming up with things out of memory.

Imagine how much better you do on the test if it were open node, if you could use your phone or something like that. GPT today is really just in its own box. Because of this, as engineers, we want to use GPT and integrate it into our systems. Limiting GPT and not allowing it to talk to our internal systems is very limiting for what you might want to do. Additionally, even if it does have access to these tools, because the language model is probabilistic, it's sometimes very hard to guarantee the way that the model interacts with external tools. If you have an API or something that you want to work with, the current model isn't guaranteed to always match the input of what an API might want, and that ends up being a problem.

For example, if I were building an application and I was to give this input to GPT, basically said, below is the text of a screenplay, extract some information from it and structure it in this JSON format. I'm really just giving it a screenplay and asking it to infer a genre and a sub-genre, as well as some characters from it, and age range. What I really want is I want it to output something like this. It's like exactly like the JSON output.

Maybe this is a screenplay about like Harry Potter romance or something. It knows that it's romance, teen romance, it sees Ron and Hermione, and outputs it exactly in this JSON format. This is fantastic, because I can just take this output, and now I can use this and throw this into an API. Then I'm like in my code, and it all works. The problem is it does this maybe like 80%, 70% of the time.

The rest of the time, it will actually try and be extra helpful and do something like this where it says, "Sure, I can do that for you. Below is the information you asked for in a JSON format." Which is super helpful kind of, but if you're trying to plug this into an API, it actually won't work because there's all this random text in front and your API won't know how to parse it. This is obviously very disappointing. This is not what you actually want. What we really wanted to do is we wanted to help break GPT out of the box, or give GPT basically a bicycle or another set of tools to really augment its ability, and have that work very seamlessly.

Function Calling with GPT

This brings me to the next part, which is going over what we call function calling the GPT, which is a new change to our API that we launched that makes function calling work a lot better with our GPT models in a very first-class way. To illustrate an example of this, if you asked GPT a question like this, what's the weather like in Brooklyn today? If you ask a normal GPT this, it'll basically say something like, as an AI model trained by OpenAI, I'm unable to provide real-time information, which is true, because it can't actually access anything. It's in a box. How does it know what the weather is like right now.

This obviously really limits its capabilities, and it's not desirable. What we did was we updated our GPT-4 and our gpt-3.5-turbo models or flagship models. We took a lot of tool use and function calling data, fine-tuned our models on those, and made it really good at choosing whether or not to use tools. The end result is a new set of models that we released that can now intelligently use tools and call functions for you. In this particular example, when we were asking the model, what's the weather like in Brooklyn today? What I can now do is parse in this input, but also tell it about a set of functions, or in this case, one function that it has access to that it should try and call out to if it needs help. In this case, we'll give it a function that's called get_current_weather.

It takes in a string with the location, and then it knows that it can use this. In this case, in the new world, when you parse in this input, GPT will express its intent to call this get_current_weather function. You will then call this function yourself in your own system however you want. Let's say you get an output here that says 22 Celsius and Sunny. You can parse that back to GPT, it'll synthesize this information and return to the user saying the weather in Brooklyn is currently sunny, with a temperature of 22 degrees Celsius.

To unpack this a little bit, what's really happening is GPT is knowing about a set of functions, and it will intelligently on its own, express its own intent to call one of these functions. Then you execute the call and then parse it back to GPT. This is how you end up connecting it to the outside world. To walk through this a little bit more, what's really happening at a high level, it's still just like a back and forth, so your user asks a question, a bunch of things happen. Then you're responding to your user. What actually happens behind the scenes with your app is you're going through this three-step process where you're calling out to OpenAI, then you're using your own function, and then you're calling out to OpenAI, or GPT again.

The first step, obviously the user asks a question, in this case, it is, what's the weather like in Brooklyn today? Then the next step is, in your application, you call a model, you call OpenAPI, and you tell it about the set of functions that it has access to, as well as the user input, very concretely. This is an example API request that actually works today, anyone with API access can try this. This is an example curl that uses our function calling ability. You can see that it's just normal curl to our chat completions endpoint, which is a new API endpoint that we released, that powers our GPT-4 and GPT-3.5 models. You're curling this API. You're parsing in a model.

In this case, you're parsing in a gpt-3.5-turbo-0613, which stands for June 13th, which is the model that we released. This is the model that's capable of doing function calling. You're also parsing in a set of messages. For those of you who might not be familiar with our chat completions format, you can parse into our model, basically a list of messages. That's the conversation history.

In this case, there's only one message, there's no history, really. It's just a user asking what's the weather like in Brooklyn today. You can imagine, as the conversation gets longer, this might be like a 5 to 10 message list. You're parsing the messages, and the model will be able to see the history and react to that. Then, the net-new thing here is functions.

This is a new parameter you can parse in now, and what you're parsing in here is you're listing the set of functions that this model should be aware of, that it should have access to. In this case, we only have one function, it's the get_current_weather function. You're putting a natural language description here as well. You're saying this function gets the current weather in a particular location. You're also putting in the function signature. You're saying it has two arguments. It has a location, which is a string, which is just city and state, and it's in this format, so San Francisco, California. It also has a unit parameter as well, which is Celsius, or Fahrenheit.

Below the fold here, there's also another argument in here that says the only property that is required is the location. You technically only need to parse in location, you don't need a unit here. You parse this request over to GPT, and GPT will then respond. In the old world, GPT would probably just respond with text. It'll say, I can't do this, because I don't have access. In this case, what our API responds with, is an intent to call the weather function.

What's really happening here is GPT is intuiting on its own, that in order to figure out the current weather, I'm not able to do it on my own, but I have access to this get_current_weather function, so I'm going to choose to call it, and so I'm going to express an intent to call it. Additionally, what GPT did here, if you haven't really noticed, is it's constructing the argument here. You can see it's telling you, it wants to call get_current_weather, and it wants to call it with the argument location, Brooklyn, New York.

What it did is it saw the function signature, created a request for it. Then also figured out that Brooklyn is in New York, and then structured the string in this way. It figured all of this out. With this, GPT has expressed an intent to call a function now. The next step is now it's on you to figure out how you actually want to call this function. You have the return from the function call, get_current_weather with this particular argument. You can then execute it on your own. It could be local, running on your own web server. It can be another API in your systems. It could be an external API, you can call the API.

Then let's say in this case, we call something, maybe an internal API, and it returns with this output that you saw, so 22 degrees Celsius and Sunny. Given that output from the model, you start your third step in this process, which is then calling the model, calling GPT with the output of the function, and then seeing what GPT wants to do. In this case, I was talking about the messages. This time, the second request you're sending to the OpenAI API, you're going to add a couple messages here. Originally, there's just one message which was the, what's the weather like in Brooklyn? Now you're adding two new messages that represent what happened with the function call.

The first one is basically a rehash of the intent, so you're basically saying the assistant or GPT wanted to call the get_current_weather function with this argument of Brooklyn, New York. Then you're also adding a third message, which basically says the result of the function call that you had, so this is the result of get_current_weather. Then you're inlining the data that is output here, which is the temperature 22, unit Celsius, and description Sunny, and you parse that all to GPT. At this point, then GPT takes in that and put it and decides what it wants to do.

At this point, the model is now smart enough to realize, "I'll call this function. Here's the output. I actually have all the info I need to actually fulfill the request." It'll now respond finally with text, and it'll say the weather in Brooklyn is currently sunny, with a temperature of 22 degrees Celsius. At that point you finally have your final output from GPT. Then that's when you respond to the user.

Putting this all together, you end up getting the experience that you'd ideally like to have here, which is the user asks, what's the weather like in Brooklyn today? Your server thinks a little bit, GPT expresses the intent, you do this whole three-step process, calling out to your function. Then ultimately, what the user sees is, the weather in Brooklyn is currently sunny, with a temperature of 22 degrees Celsius. Success.

Demo 1 - Converting Natural Language into Queries

Eleti: We just went through a couple of introductory topics. First, we learned about how language models work, some of their limitations, in that they don't have all the training data, they're not connected to the external world, that their structured output is not always parseable. Sherwin also just walked us through the new feature, function calling and how the API works, and how you parse functions to the API and get output back, and get GPT to summarize the responses in a user facing way. Let's walk through a few demos about how you can combine all of this and apply it to your products and your applications.

Let's start small. The first example we'll walk through is something that converts natural language into queries. The example we're going to do is, imagine you're building a data analytics app or a business intelligence tool, like Tableau or Looker. Some of you may be good at SQL, I certainly am not. Most often, I just want to ask the database, who are those top users, and just have the response back. That's finally possible today. We're going to use GPT, we're going to give it one function called SQL query, all it takes is one parameter, query a string.

It's supposed to be a valid SQL string against our database. Let's see how that works. First, we're going to give the model a system message describing what it's supposed to do. We say your SQL GPT, and can convert natural language queries into SQL. Of course, the model needs access to a database schema. In this case, we have two tables, users and orders. Users have a name, email, and birthday. Orders have a user ID, purchase amount, and purchase date. Now we can start querying the database with some natural language.

Let's ask this question, get me the names of the top 10 users by amount spent over the last week. A fairly normal business question, certainly not something I could write SQL for in a jiffy, but GPT can. Let's run it. We can see that it's calling the SQL query function. It has one parameter, query, and it created a nice SQL query. It's selecting the name and the sum of amount. It's joining on orders. It's getting the last one week of orders, ordering it by the total spent, and limiting it to 10. Seems correct and appropriate. Let's run it against our database. We got some results back.

Of course, this is in a JSON format, and so not user renderable. Let's send this back to GPT to see what it says. GPT summarized the information and said, these are the top 10 users by amount spent. This is what they spent over the last week, so Global Enterprises, Vantage Partners. This is an amazing user readable answer.

We're going to say a quick thank you to GPT for helping us out. It says thanks, and GPT says, you're welcome. That's a quick way to see how totally natural language, completely natural language queries were converted into structured output into a valid SQL statement that we ran against our database, got data back, summarized it back into natural language. You can certainly build data analytics apps off of this.

You can build other internal tools. Honeycomb recently built a very similar tool for the Honeycomb query language. That's one example of using GPT and functions to convert natural language into queries.

Demo 2 - Calling External APIs and Multiple Functions

Let's do a second demo. This one is about calling external APIs and multiple functions together. Let's amp up the complexity level. Let's say we're here at a conference in New York and we want to find dinner reservations for tonight. We're going to call GPT with two functions. The first one is, get_current_location. That runs locally on device, let's say on your phone or on your browser, and gets the Lat and Long of where you are. The second function here is Yelp search, which uses Yelp's API, so the popular restaurant review application, and you parse in the Lat, Long, and query.

Let's run through a demo. The system message in this case is fairly simple. All it says is your personal assistant who is helpful at fulfilling tasks for the user, sort of puts GPT into the almost mental mode of being a helpful assistant. I say, I'm at a conference and want to grab dinner nearby, what are some options? My company is expensing this so we can go really fancy. Let's run that with GPT and see how it can do it.

Of course, GPT doesn't know where we are, so it says get_current_location, and we're going to call the local API to get our Lat and Long. We've done that. That's Brooklyn, New York, somewhere here, I think. We'll give that back to GPT and see what it says. It has the information it needs, now it wants to call Yelp, and it says Lat, Long, and query, and it says fine dining. That's good. That's what I want. Let's call Yelp and get some data back.

We got a bunch of restaurants from Yelp's API. I want it in a nice summary, of course, so let's run it again. Says, here are some fancy dining options near your location, La Vara, Henry's End, Colonie, Estuary. It says, please check the opening hours and enjoy your meal. Sounds delicious. Thank you GPT, yet again, for helping organize dinner tonight.

That's an example of using GPT and functions to call external APIs, in this case, the Yelp API, as well as to coordinate multiple functions together. It's capable with its reasoning ability to parse user intent and do multi-step actions, one after another, in order to achieve the end goal.

Demo 3 - Combining Advanced Reasoning with Daily Tasks

A third demo, let's ramp it up a little bit more. We talked about how GPT-4 can pass the SAT and the GRE. If it can, it must be smarter than just calling the Yelp API or writing some SQL. Let's put it to the test. We're all engineers, we have many things to do every day. One of the tasks that we have to do is pull request review. We have to review our coworker's code. It would be awesome if GPT could help me out a little bit and make my workload a little bit lower. We're going to do a demo of GPT that does pull request review, sort of build your own engineer.

We only need one function, submit_comments. It takes some code and returns a list of comments that it wants to review, so line, numbers, and comments. You can imagine we can then send this out to the GitHub API or the GitLab API, and post a bunch of comments. Of course, you can add more functions and make it even more powerful down the line. Let's see how that works.

In this case, the prompt is a little bit long. Let's scroll up and see. We're saying, GPT, you record, review bot, you look at diffs and generate code review comments on the changes, leave all code review comments with corresponding line numbers. We're also playing around with personality here. We're saying 0 out of 10 on toxicity, we don't want that.

For fun, let's try 8 out of 10 on snark. We've all known a couple of engineers who display these personalities. Then 2 out of 10 on niceness. Let's just start there. Then below here is some code that we want reviewed. It's an API method in a SaaS application that changes permissions for a user. Let's run it. Let's see what GPT has to say about the code. Say, give me three review comments. We can see it called the submit_comments function, and it outputted perfectly valid JSON. Let's see what it says. It says, are we playing hide and seek now, what happened when role is not in the body? You add a little twist there, you're directly accessing the first item.

Casually committing to the DB session, are we? It's a little bit rude. We don't want that. Let's fix this. I'm going to exit out of this right now and go and change our prompt a little bit. To do, exit. Behind the scenes, what I'm doing is going back to the prompt and just changing those numbers. Taking toxicity, and then the next one, snark, we're taking it back to 0. We don't want that. Let's be polite.

We're going to make politeness 10 out 10. Let's do, give me three review comments again. It's once again calling the function with perfectly valid JSON. It says, it's nice to see you retrieving the role value. It says, your error messages are neat, nicely descriptive. I appreciate you committing to your database changes, good work. I would love somebody to review my code like this. Thank you GPT, and I will exit. That's a quick third demo.

At its core, it's still doing the same thing. It's calling one function, given some prompt, responding to it. What we're seeing at play is GPT's reasoning ability. GPT knows code. It's seen thousands and millions of lines of code and can give you good reviews. If you peel back some of this personality stuff, it's pointing out typos, it's pointing out potential error cases and edge cases. We're combining here the advanced reasoning with daily tasks. It's certainly very good at coding. It's certainly very good at exams, but its intelligence applies quite broadly. It's really up to the creativity of developers to take this and apply to as difficult tasks as possible and run the loops on that.


That's a quick wrap up of our content. We covered three things. First, we talked about LLMs and their limitations. We learned about how LLMs work, they're token predicting machines. We learned about their limitations. They're stuck in time. They don't always output structured output, and so on. Second, we learned about this new feature, function calling with GPT, which is an update to our API and to our models. It allows the model to want to express intent about when it wants to call a function, and to construct valid arguments for you to then go call that function on your end. Then, finally, we walked through some demos. At some point, I'm going to go productionize that PR thing.

Let me bring it back to where we started. We talked about this famous Steve Jobs quote, about computers being a bicycle for the mind. It's certainly been true for me. It's certainly been true for all of you. We're in the computing industry, computers have changed our lives. Computers have augmented our innate abilities, and given us more productivity, more imagination, more creativity. The AI and language models in ChatGPT is a baby. It's only been around for a few months. It's up to us to augment the AI's mind and give it new abilities beyond its inner reasoning abilities, connect it to tools, connect it to APIs, and make really exciting applications out of this feature.

The original quote is quite inspiring to me. We can never do justice to a Steve Jobs quote. "I remember reading an article when I was about 12 years old, I think it might have been in Scientific American, where they measured the efficiency of locomotion for all these species on planet Earth, how many kilocalories did they expend to get from point A to point B. The condor won, came in at the top of the list, surpassed everything else. Humans came in about a third of the way down the list, which was not such a great showing for the crown of creation.

Somebody there had the imagination to test the efficiency of a human riding a bicycle. A human riding a bicycle blew away the condor, all the way up on the top of the list. It made a really big impression on me, that we humans are tool builders, and that we can fashion tools that amplify these inherent abilities that we have to spectacular magnitudes. For me, a computer has always been a bicycle of the mind, something that takes us far beyond our inherent abilities. I think we're just at the early stages of this tool, very early stages. We've come only a very short distance, and it's still in its formation but already we've seen enormous changes. I think that's nothing compared to what's coming in the next 100 years."

As much as that applied to computers 50 years ago, I think the same applies to AI today. Technology is in its infancy, so we're very excited to see where it goes.


Strategies for Coping with Errors and Failures

Participant 1: How should we cope with errors and failures, and what strategies might you suggest? Taking your example, where you built a SQL query, what if the question that I ask results in ChatGPT giving a syntactically correct SQL query, but semantically, it's completely off. I then report back to my users something that's incorrect. It's hard to tell the user, fault here, but do you have any strategies you can suggest for coping with that?

Eleti: I think the first thing is, as a society and as users of these language models, we have to learn its limitations, almost build antibodies around its limitations. There is a little bit about just knowing that the outputs might be inaccurate. I think a second part is like opening the box. We've integrated function calling in production with ChatGPT. We've launched a product called plugins, which basically does this, it allows ChatGPT to talk to the internet. One thing we do is all the requests and all the responses are visible to the end user if they so choose to see them. That helps with the information part. I personally say I think also SQL is a very broad open surface area. I think limiting it down to well-known APIs that only will perform safe actions in your backend is a good way. You can always get good error messages and things like that. Those would be my off-the-cuff tips.

LLMs and LangChain

Participant 2: Has anybody tried to do some LangChain, and will it work with LangChain?

Eleti: Yes, actually, LangChain, Harrison and the team launched an integration an hour after our launch, so it works.

Data Leaking

Participant 2: This still exposes the leakage problem. The SQL example is a good example. If somebody reads this, and they do an SQL query against a financial database, and they feed it into the gpt-3.5-turbo, basically, you're leaking data.

There's these problems where if you're using a text-davinci-003 or different models, some of that data from the query is turning into the model itself. That example seems to me would be extremely dangerous.

Wu: There's actually a misconception that I think we haven't cleared up very well recently, which is, up until I think March or February of this year, in our terms of service for the API, we said, we reserve the right for ourselves to train on the input data for the API. I think that's probably what you're talking about, which is like you're parsing in some SQL queries, and that will actually find its way somehow back into the model as it returns. Actually, as of right now, we no longer do that. In our terms of service, we actually do not train on your data in the API. I think we haven't made this super clear and so people are very paranoid about this. As of right now, it doesn't. You should look it up in our terms of service. We don't train on it. That being said, that thing parsed in is not like enterprise-grade. We're not isolating it specific to your user. We're just not training it on our own data. That type of feature around data isolation at the enterprise layer is obviously coming soon. That specific layer of security is not there yet.

Eleti: We do not train on API data.

Parallelization of Function Calling

Participant 3: The demos that you showed were a little bit slow. I'm wondering, do you guys support parallelization of function calling? Like right now are you even sequential, you get this function signature back, and you got to call it, but let's say ChatGPT said, three functions should be called simultaneously, does that work?

Eleti: The API literally does not support multiple function invocation. There's no output where it says, call these three functions. You can hack it. The way you do it is you just define one function, which is call multiple, and you provide a signature where the model calls multiple functions, and it totally works. At the end of the day, still we're using the model's reasoning ability to output some text.

Context Preloading for Models

Participant 4: In the SQL example you gave, you gave it some table to have access to. Is there a way for us to just preload all of the contexts for any subsequent call by anybody?

Wu: There are a couple potential solutions. We have a feature called a system message that you can parse in, which basically sets the overall conversation context for a model. That's just really upended in the context. We've been increasing the context window to something like 16,000 tokens at this point. You can increasingly squeeze more things into the system message. The model is trained to be extra attentive to the system message to guide how it reacts. In this example, Atty had two table schemas in the system message. You could foreseeably add a lot more all the way up to fill up the whole context.

Participant 4: That would be how I would preload?

Wu: Yes, that's the simplest. There are a couple other methods as well. You can hook it up to an external data source, a DB or something. Fine-tuning is another option as well. There are other things too.

Reliable Function Calling with GPT

Participant 5: A take around with integrating to GPT, into a different software. I had some problem with Enums, which are in use with me, would sometimes come in German or French when I asked it for doing some jobs in English, and in French or German. Is this also going to happen with this function API.

Eleti: Yes, unfortunately. The model is prone to hallucination, in the normal situation, as well as in this situation. What we've done is basically fine-tuned the model so it's seen maybe a few 100,000 examples of how to call functions reliably. It's much better at it than any other prompting you might do yourself. It still makes up parameters, it might output invalid JSON, it might output other languages. To prevent that, we're going to do more fine tuning. We have a couple of low-level inference level techniques to improve this as well that we're exploring. Then on your end, you can do prompt engineering, and just please remind the model, do not output German, and it will try its best.

Wu: It'd be really interesting to see if it's gotten better at this, especially if you have a function signature and you're explicitly listing out the 5 different English Enums. The newer models are probably better, but it won't be perfect. I'm not 100 sure, we don't have Evals for like cross-English, French Enums, unfortunately. It's probably a good one to think about, but we'd be curious to see if it got better with this.

GPT's Ability to Figure Out Intent

Participant 6: I have a question about the API's ability to figure out the intent. Is there similar temperature parameters for the function call so that if I parse in two functions with similar intent, would GPT be deterministic for each function to call, or is there any randomness of picking which function to call, if I asked it multiple times.

Eleti: There is still randomness. At the end of the day, under the hood, it's still outputting token by token, choosing which function it wants to call. Lowering the temperature increases determinism, but it does not guarantee it. That said, there is a parameter in the API called function call, where if you know which function you want it to call, you can actually just specify it upfront, and it will definitely call that function.

Function Call Entitlement

Participant 7: Do you guys have entitlements for the function calls, if we wanted to limit certain users from certain function calls or like which tables you could access in those SQL queries. Are people still needing to implement their own?

Eleti: All of that will happen on your servers, because you have the full context of who has access to what. All that this API provides is the ability for GPT to choose which function to call and what parameters to use. Then we expect that you should treat GPT's output as any other client, so untrusted client output that you would validate on your end with permissions and stuff.

Chain of Thought Prompting, and Constraint Sampling

Participant 8: I was just wondering if you could perhaps elaborate on what's going on under the hood here. Is this chain of thought prompting under the hood? Is this effectively an API layer on top of those techniques?

Eleti: Chain of thought prompting is a way to ask the model when given a task, first, tell me what you're going to do, and then go do it. If you say, what's the weather like in Brooklyn? It might say, I have been asked a weather request, I'm going to call the weather API. Then it goes and does that. That's a prompt engineering technique. This is a fine-tune. With the launch of plugins, we've collected both internal and external data of maybe a few 100,000 examples of user questions and function calls. This is all fine-tuned into the model. That's where it's coming from.

There's a third technique that we can use, which is called constraint sampling, where the token sampling level, you make sure that the next token that is being predicted is one of a valued set. In a JSON example, after a comma it has to be a new line or something like that. We might be getting that wrong, but you get the idea of, you have grammar that you want to assign [inaudible 00:45:02]. We don't have that yet. It's an area that we're exploring. This is the long journey from prompting to fine-tuning to more low-level stuff. This is the journey to getting GPT to output reliably structured data.

Vector Database Compatibility

Participant 9: Will this work with a vector database? The idea is I want to constrain the information based on what I fed into the vector database, but it still works with the function logic?

Eleti: Yes, works exactly as it used to.

Is Function Calling Publicly Available?

Participant 10: Are we able to use it today? Is it open to the public right now?

Wu: This is publicly available today with a caveat. It's available on the gpt-3.5-turbo model. Anyone here can actually access function calling with gpt-3.5-turbo because that's generally available. It's also available on GPT-4 on the API, but unfortunately that is still behind a waitlist. If you are off that waitlist and you have access to GPT-4 API, you will actually be able to do this with GPT-4. It's way better at it. It is a little slower. If you're still on the waitlist or you don't have access to GPT-4 API, you can try this out today on gpt-3.5-turbo.


See more presentations with transcripts


Recorded at:

Aug 08, 2023