AWS recently announced the availability of two new foundation models in Amazon SageMaker JumpStart: Code Llama and Mistral 7B. These models can be deployed with one click to provide AWS users with private inference endpoints for code generation tasks.
Code Llama is a fine-tuned version of Meta's Llama 2 foundation model and carries the same license. It is available in three variants: base, Python, and Instruct; and each has three model sizes: 7B, 13B, and 34B parameters; for a total of nine options. Besides code generation, it can also perform code infilling, and the Instruct models can follow natural language instructions in a chat format. Mistral 7B is a seven billion parameter large language model (LLM) that is available under the Apache 2.0 license. There are two variants of Mistral 7B: base and Instruct. In addition to code generation, with performance that "approaches" that of Code Llama 7B, Mistral 7B is also a general purpose text generation model and outperforms the larger Llama 2 13B foundation model on all NLP benchmarks. According to AWS:
Today, we are excited to announce Code Llama foundation models, developed by Meta, are available for customers through Amazon SageMaker JumpStart to deploy with one click for running inference. Code Llama is a state-of-the-art large language model (LLM) capable of generating code and natural language about code from both code and natural language prompts. Code Llama is free for research and commercial use. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.
The Mistral 7B models support a context length of up to 8k tokens. This long context can be used for "few shot" in-context learning in tasks such as question answering, or for maintaining a chat history. The Instruct variants support a special format for multi-turn prompting:
<s>[INST] {user_prompt_0} [/INST] {assistant_response_0} </s><s>[INST] {user_prompt_1} [/INST]
The various sizes of Code Llama models support different context lengths: 10k, 32k, 48k respectively; however, the 7B models only support 10k tokens on ml.g5.2xlarge instance types. All models can perform code generation, but only the 7B and 13B models can perform code infilling. This task prompts the model with a code prefix and code suffix, and the model generates code to put between them. There are special input tokens, <PRE>, <SUF>, and <MID> to mark their locations in the prompt. The model can accept these pieces in one of two different orderings: suffix-prefix-middle (SPM) and prefix-suffix-middle (PSM). Meta's paper on Code Llama recommends using PSM when "the prefix does not end in whitespace or a token boundary." The PSM format is:
<PRE> {prefix_code} <SUF>{suffix_code} <MID>
The Instruct version of Code Llama is designed for chat-like interaction, and according to Meta "significantly improves performance" on several NLP benchmarks, with a "moderate cost" of code generation. An example application of this model is to generate and explain code-based solutions to problems posed in natural language; for example, how to use Bash commands for certain tasks. Code Llama Instruct uses a special prompt format similar to that of Mistral 7B, with option of a "system" prompt:
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>
{user_prompt_0} [/INST] {assistant_response_0} </s><s>[INST] {user_prompt_1} [/INST]
The Code Llama announcement says that the models are available in the US East (N. Virginia), US West (Oregon) and Europe (Ireland) regions. AWS has not announced the regions where Mistral is available.