InfoQ Homepage Articles Maximizing the Utility of Large Language Models (LLMs) through Prompting

Maximizing the Utility of Large Language Models (LLMs) through Prompting

Jan 19, 2024 14 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Prompt engineering is about experimenting with changes in prompts to understand their impacts on what large language models (LLMs) generate as the output. Prompt engineering yields better outcomes for LLM use with a few basic techniques
Zero-shot prompting is when an LLM is given a task, via prompt, for which the model has not previously seen data
For the language tasks in the literature, performance improves with a few examples, this is known as few-shot prompting
Chain-of-Thought (CoT) prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can't be solved with zero-shot or few-shot prompting
Built upon CoT, self-consistency prompting is an advanced prompting technique, that provides the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses

A new job title, "prompt engineer," has made waves in tech media recently, bursting onto the careers pages of top AI companies with promises of eye-wateringly high salaries. But what even is prompt engineering? The term itself was coined only within the past few years and refers to the art and science of prompting large language models (LLMs) to achieve better results.

These LLMs, which include the popular ChatGPT and GPT4 from OpenAI, Claude from Anthropic, and Bard from Google, can be used for all types of language tasks, from drafting emails and documents to summarizing or classifying text.

Prompting, essentially, refers to the practice of giving inputs to the LLMs, typically a set of instructions describing how to perform the desired task. For example, to use ChatGPT to draft a cover letter based on a job description, one might use the prompt,

"You are an applicant applying for the following job. Write a cover letter explaining why you are well-suited for the role."

Easy enough, right? Here's where it gets tricky: researchers have found that the results provided by LLMs depend heavily on the specific prompt given, and so even though it might seem trivial to explain a well-defined task like writing a cover letter, simple adjustments like wording and formatting changes can substantially affect the output received.

Prompt engineering is fundamentally experimenting with changes in prompts to understand their impacts on what the model generates, and therefore doesn't require an advanced technical background - merely a little curiosity and creativity.

Furthermore, prompt engineering is not just the purview of highly paid programmers. We argue that everyone who uses LLMs can, and should, be a prompt engineer. The most basic reason is that prompt engineering will yield better outcomes for LLM use, and with even a few basic techniques, it's possible to measurably improve performance on many common tasks.

Few-Shot Prompting

Let's consider a task for which someone might want to employ an LLM: classifying customer reviews by sentiment. When an LLM is given a task, via prompt, for which the model has not previously seen data, this is referred to as zero-shot prompting. In this case, the task would be a zero-shot classification task. The zero-shot prompt might be something like the figure shown below, where the user gives instructions on classifying movie reviews to Google's Bard.

This might work reasonably well on its own, but across a variety of language tasks in the literature, performance nearly always improves with a few examples, or a few-shot prompt. In this case, the few-shot prompt might start the same way, but append several examples of the expected output. Again, we see that Bard generates a response of "POSITIVE" for the final unlabeled review.

More precisely, this prompt would be two-shot, referring to the fact that there are two examples provided to the model. The fact that providing examples in this way measurably improves the model's ability to complete the task is unique to LLMs; as a result, unlike other machine learning models, LLMs are able to perform a wide variety of tasks that they are not trained to do.

However, every approach comes with its own limitations. While few-shot prompting is an effective technique for numerous tasks, it tends to fall short when addressing more complex reasoning tasks. Let’s look at an example. In the figure below, we provided ChatGPT with a multi-step reasoning task: do the odd numbers in the given group add up to an even number?

Even though we used few-shot prompting to provide the LLM with multiple examples, it did not give us the correct answer. To solve this task, we need to first determine the odd numbers. Then, we need to add the odd numbers together, and finally, we need to determine if the resulting number is odd or even. That is a multi-step reasoning task, so it might help if we break the problem down into multiple steps and demonstrate it to the model.

Chain-of-Thought Prompting

This is where Chain-of-Thought (CoT) prompting comes in. Introduced by Google researchers, CoT prompting breaks down multi-step problems into intermediate steps allowing LLMs to tackle complex reasoning that can’t be solved with zero-shot or few-shot prompting. Here, instead of directly asking the LLM to solve the problem, we would use CoT prompting to guide the model through the intermediate reasoning tasks. Let’s revisit the previous example.

By providing the intermediate reasoning step, ChatGPT is able to get the correct answer with only one example. This technique can be extremely beneficial for complex multi-step tasks - if the target task involves multiple steps or sequential reasoning, CoT can help break down the problem into smaller and more manageable tasks. A more recent idea is zero-shot CoT, where you ask the LLM to think through the intermediate reasoning tasks itself by adding "let’s think step by step" to the original prompt. Let’s demonstrate this with our example again.

Without any examples, the LLM is able to correctly perform the multi-step reasoning task with a simple prompt, which it was previously unable to do with few-shot prompting. CoT prompting is a useful technique for multi-step reasoning problems, tasks that benefit from intermediate explanations, or simply, tasks for which standard prompting techniques are insufficient.

Self-Consistency Prompting

Built upon CoT, self-consistency prompting is an advanced prompting technique, also introduced by Google researchers. This approach aims to provide the LLM with multiple, diverse reasoning paths and then selects the most consistent answer among the generated responses. Self-consistency prompting is more powerful when used in conjunction with CoT and boosts its performance with complex reasoning tasks. Let’s look at an example from the research paper demonstrating how the self-consistency approach works. First, you prompt the LLM using CoT prompting. Then, you generate a diverse set of responses, and finally, you select the most consistent response as your answer. By utilizing a majority voting system, the model is more likely to arrive at an accurate and reliable answer.

The three steps in the self-consistency method: 1) prompt the language model using CoT prompting, 2) sample a diverse set of reasoning paths from the language model, and 3) marginalize out reasoning paths to aggregate final answers and choose the most consistent answer. (source)

Let’s look at another example of analyzing emails. Suppose that a software consulting company receives hundreds of emails every day, but often, most of them are promotional content or spam. They want to classify emails as IMPORTANT or NOT IMPORTANT.

Here is an example email that the company might receive:

Hello!

I trust this email finds you well. We are currently facing a critical challenge within our organization and believe your expertise could provide valuable insights. Given your successful track record in software engineering, we believe your team's insights and recommendations could be instrumental in resolving this matter.

Could we please schedule a meeting at your earliest convenience to discuss the details and explore how we might collaborate to address this challenge?

Your prompt attention to this matter is greatly appreciated.

Best regards,
John Doe

Now, we can put this email into a prompt and ask the LLM to "Classify the above email as IMPORTANT or NOT IMPORTANT as it relates to a software consulting company. Let’s think step by step." Then, we can use this prompt to generate multiple responses and then select the most consistent response.

RESPONSE: The email explicitly mentions facing a critical challenge within the organization, which signifies the importance of the matter. I would classify this email as IMPORTANT.

RESPONSE: The email mentions facing a critical challenge within the organization, but they have not described the issue or challenge. I would classify this email as NOT IMPORTANT.

RESPONSE: While not explicitly stated, the tone of the email and the request for prompt attention suggest a sense of urgency, emphasizing the importance of a timely response. I would classify this email as IMPORTANT.

By exploring multiple paths, the LLM is likely to arrive at a correct or well-informed response. Additionally, it could potentially help counteract biases that may be inherent in the model or the data it is trained on. Although self-consistency prompting has shown an improvement in results when CoT alone has been ineffective, it falls short with tasks that require exploration or strategic lookahead.

Tree-of-Thoughts Prompting

Proposed by Yao et. al (2023) and Long (2023), Tree-of-Thoughts (ToT) is a prompting framework that extends beyond CoT encouraging exploration of thoughts that function as intermediate steps for solving problems more broadly with LLMs. This technique asks the LLM to consider multiple solutions at each level, and the model is encouraged to continually evaluate its results, normalizing its decision-making process, and boosting its confidence in the chosen solution. In other words, it leverages dynamic decision-making by generating intermediate steps and potential solutions, which are then evaluated to determine whether they're on the right path or not.

For example, if the task is to create a business strategy, the LLM would first generate multiple potential initial steps for the strategy, Then, when the initial thoughts are generated, you can task the model with self-critiquing each of these thoughts with respect to the input prompt. Here, the LLM would evaluate how well each thought or step aligns with the problem-solving objective. This assessment phase can involve ranking each thought or if appropriate, assigning scores. Then, the thoughts that are evaluated as less useful or suitable can be discarded, and you can expand upon the remaining thoughts. In this framework, you continue a similar process of self-critique and ranking until you reach a final decision. This technique allows the LLM to evaluate and pursue multiple paths simultaneously.

Here is a step-by-step process to leverage a simplified version of the ToT framework:

Phase 1: Brainstorming - Ask the LLM to generate three or more options while considering various factors.
Phase 2: Evaluation - Ask the LLM to objectively assess each option’s potential success by evaluating their pros and cons.
Phase 3: Expansion - Ask the LLM to go deeper into suitable ideas, refine them, and imagine their implications in real-world contexts.
Phase 4: Decision - Ask the LLM to rank or score each solution based on the evaluations and scenarios generated.

The ToT framework has greatly improved performance over CoT on problems requiring search-like games, crossword puzzles, and even creative writing. However, it requires multiple prompts and generations before reaching a final answer. More recently, Hulbert introduced zero-shot Tree-of-Thought prompting which applies the fundamental concept from the ToT framework as a single prompt. His prompt is:

"Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realises they're wrong at any point then they leave. The question is ..."

Let’s look at Hulbert’s original example below with a response from ChatGPT.

INPUT: Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is...

Bob is in the living room. He walks to the kitchen, carrying a cup. He puts a ball in the cup and carries the cup to the bedroom. He turns the cup upside down, then walks to the garden. He puts the cup down in the garden, then walks to the garage. Where is the ball?

RESPONSE:
Expert 1: The ball is in the living room.

Expert 2: The ball is in the kitchen.

Expert 3: The ball is in the bedroom.

Expert 1: Bob carries the cup to the bedroom, so the ball must be in the cup.

Expert 2: Oh, I see my mistake. Yes, the ball is in the cup.

Expert 3: Agreed, the ball is in the cup in the bedroom.

Expert 1: Next, Bob turns the cup upside down in the bedroom.

Expert 2: After that, Bob walks to the garden and puts the cup down.

Expert 3: Therefore, the ball must have fallen out of the cup when Bob turned it upside down in the bedroom. So, the ball is in the bedroom, not in the cup anymore.

Expert 1: Oh, you're right. I made a mistake. The ball is in the bedroom, not in the cup.

Expert 2: Agreed, the ball is in the bedroom.

Expert 3: Bob then walks to the garage, so the ball remains in the bedroom. It is not in the garden or the garage.

Expert 1: Absolutely, the ball is still in the bedroom.

Expert 2: Yes, the ball hasn't moved from the bedroom.

Expert 3: Therefore, the ball is in the bedroom, not in the garden or the garage.

All three experts agree that the ball is in the bedroom.

Inspired by Hulbert’s original prompt, here are additional variants of zero-shot Tree-of-Thought prompts to try:

"Simulate three brilliant, logical experts collaboratively answering a question. Each one verbosely explains their thought process in real-time, considering the prior explanations of others and openly acknowledging mistakes. At each step, whenever possible, each expert refines and builds upon the thoughts of others, acknowledging their contributions. They continue until there is a definitive answer to the question. For clarity, your entire response should be in a markdown table. The question is ..."
"Imagine three highly intelligent experts working together to answer a question. They will follow a tree of thoughts approach, where each expert shares their thought process step by step. They will consider the input from others, refine their thoughts, and build upon the group's collective knowledge. If an expert realizes their thought is incorrect, they will acknowledge it and withdraw from the discussion. Continue this process until a definitive answer is reached. Present the entire response in a markdown table. The question is ..."
"Three experts with exceptional logical thinking skills are collaboratively answering a question using a tree of thoughts method. Each expert will share their thought process in detail, taking into account the previous thoughts of others and admitting any errors. They will iteratively refine and expand upon each other's ideas, giving credit where it's due. The process continues until a conclusive answer is found. Organize the entire response in a markdown table format. The question is ..."
"Envision a group of three experts working in unison to tackle a question by employing a tree of thoughts strategy. Each expert will thoroughly explain their line of thinking at every step, while also considering the insights provided by their peers. They will openly recognize any mistakes and build upon the group's shared understanding. This iterative process will continue until a definitive solution is reached. Structure the entire response as a markdown table. The question is ..."

We can apply this general strategy to any number of decisions that might be encountered in real-world scenarios as a means of quickly generating several perspectives on an issue. Continuing the hypothetical conversation through prompts might or might not result in a consensus among the experts, but is likely to yield insights about the tradeoffs involved in the decision.

INPUT: Imagine three different experts are arguing about how to design a file-sharing system. The system should scale to millions of users and handle petabytes of data. Each expert should have a different idea for the system architecture, and they should discuss the tradeoffs of each approach.

RESPONSE:

Expert 1: Propose a Decentralized Peer-to-Peer Architecture
In my opinion, a decentralized peer-to-peer (P2P) architecture is the way to go. Users share files directly with each other, eliminating the need for a central server to handle all requests. This distributes the load across the network, making it more scalable. However, the challenge lies in managing security, ensuring data integrity, and handling dynamic network conditions. Also, it may be challenging to implement complex features like access control and versioning.

Expert 2: Argue for a Distributed Cloud-Based Architecture
I advocate for a distributed cloud-based architecture. Leverage cloud storage services and distribute data across multiple servers in different geographical regions. This approach ensures high availability, scalability, and easier management of petabytes of data. The cloud's infrastructure can handle the heavy lifting, but concerns may arise regarding data privacy, potential costs, and reliance on a third-party service. Additionally, careful consideration must be given to data sharding and load balancing to ensure optimal performance.

Expert 3: Argue for a Hybrid Architecture
I propose a hybrid architecture that combines the strengths of both decentralized P2P and distributed cloud-based systems. Use P2P for user-to-user file sharing to offload some of the traffic from centralized servers. Simultaneously, leverage cloud storage for redundancy, backup, and to ensure data availability across the network. This hybrid model aims to balance the advantages of decentralization with the reliability and scalability offered by cloud services. However, managing the synchronization between these two components introduces complexity and potential challenges in maintaining consistency.

Conclusion

In this article, we have covered a few of the latest prompt engineering techniques from academic papers. While these techniques are empirically proven, there is still a lot that we don't know about exactly why they work. What this means is that the best way to become a better prompt engineer is by prompt engineering - trying out various prompts, observing their results, tweaking and iterating on them, rinse, and repeat.

This exercise is worthwhile not only for professional prompt engineers, but anyone who wants to integrate LLMs into their workflows. When done properly, prompting can be used to alter the style and format, improve clarity or accuracy, and ensure consistency of the AI-generated responses. Prompt engineering is an overlooked but incredibly powerful tool for taking advantage of the transformative technology of LLMs.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Maximizing the Utility of Large Language Models (LLMs) through Prompting

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

Few-Shot Prompting

Chain-of-Thought Prompting

Self-Consistency Prompting

Tree-of-Thoughts Prompting

Conclusion

About the Authors

Numa Dhamani

Maggie Engler

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter