Meta AI’s Large Language Model with 10x Fewer Parameters

Meta AI recently released a new large language model called Language Large Models Meta AI (LLaMA) that outperforms foundational models such as GPT-3 and is competitive with PaLM, despite having 10 times fewer parameters. Meta AI is releasing LLaMA versions with sets of language models ranging from 7 billion to 65 billion parameters in size.

Source: LLaMA: Open and Efficient Foundation Language Models

The graph above shows that large language models with fewer parameters than foundational ones such as GPT-3 or PaLM, like LLaMA 7B (the version with 7 billion parameters), can outperform them if provided with more training.

The datasets LLaMA uses contain 1.4 trillion tokens from public sources such as GitHub, Wikipedia, arXiv and Stack Exchange. The tokenization process used byte-pair encoding, using SentencePiece software.

The deep-learning model architecture uses transformer models. Meta AI researchers use pre-normalization as done by normalizing the input of each transformer’s sub-layer. In addition, a state-of-the-art activation function called SwiGLU was used in LLaMA. They also opt in for a new position embedding representation based on rotations, i.e., the relative position between position embeddings called rotary embeddings. The researchers used adaptive gradient algorithms such as AdamW to substantially improve the classic Adam in model generalization as well as gradient clipping to 1.0.

An efficient transformer implementation, xformers, reduces memory usage and running time. To further improve training, the number of activations during backward step was reduced by Meta AI researchers using checkpoints.

LLaMA has better performance than PaLM and GPT-3 in language tasks such as natural questions, common-sense reasoning and mathematical reasoning, having much smaller number of parameters due to mainly being trained for longer and other fine tune techniques mentioned earlier. For instance, using exact match (EM), which measures the proportion of documents where the predicted answer is identical to the correct answer, LLaMA 33B has 24.9 better than GPT-3(14.6), PaLM-540B (21.2) and Chinchilla-70B (16.6).

Source: LLaMA: Open and Efficient Foundation Language Models

LLaMA can be used in generating text, having conversations, summarizing written material, and more complicated tasks like solving math theorems or predicting protein structures. Although, it seems the main application by the community is going towards text generating and having conversations.

Large language models have been shown to reproduce and amplify biases in the training data, and to generate toxic or offensive content. LLaMA 65B has slightly less bias in topics such as gender and religion compared with GPT-3.

For running inference code based on download weights (license only for academic purposes) from cloud you go here:

torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model

You can also check out the model implementation on LLaMA’s GitHub pages. In addition there is a prompt UI developed in HuggingFace using LLaMA 7B. Despite the weights not being licensed beyond academia, the weights were leaked and downloadable using torrent file.

On social media people seem to like the model’s lower computational burden compared with GPT-3 or PaLM with comparable performance, but they inquire about weights sharing only with academics. The AI community also mentioned inference speed compared with GPT-3 or PALM.