RWKV Project Open-Sources LLM Eagle 7B

The RWKV Project recently open-sourced Eagle 7B, a 7.52B parameter large language model (LLM). Eagle 7B is trained on 1.1 trillion tokens of text in over 100 languages and outperforms other similarly-sized models on multilingual benchmarks.

Eagle 7B is based on the Receptance Weighted Key Value (RWKV) architecture, described as an attention-free Transformer that combines the benefits of both Transformers and recurrent neural networks (RNNs) while reducing their drawbacks; in particular, the model has no maximum input context length. The architecture has also been benchmarked as the most energy-efficient, measured by joules per token. Eagle 7B outperforms other 7B-parameter LLMs, including Mistral, Falcon, and Llama 2, on several multi-lingual benchmarks. The RWKV Project is supported by the Linux Foundation, and Eagle 7B carries the Apache 2.0 license, making it available for both personal and commercial use. According to the Project team:

RWKV opens a new route for scalable and efficient architectures to model complex relationships in sequential data. While many alternatives to Transformers have been proposed with similar claims, ours is the first to back up those claims with pretrained models with tens of billions of parameters.

Before Google published their work on Transformers, RNN-based models were the state-of-the-art solution for many AI applications, particularly in multilingual NLP domains such as translation. The Transformer was an attractive alternative, since training RNNs presents challenges, and their inherent serial nature makes them slower than Transformers. However, Transformers have their own drawbacks. In particular, their self-attention mechanism has quadratic complexity in both compute and storage, which limits their input context length.

To solve these problems, RWKV uses a variant of the Attention-Free Transformer (AFT), with a modification that allows the model to be formulated as an RNN. This formulation makes the model efficient during inference, when it is used for autoregressive generation. However, during training, many of the model's matrix operations can be parallelized, as with a standard Transformer.

The RWKV architecture does have known limitations. While it does not have a maximum input context length, it may not perform as well as attention-based models on tasks that require "looking back" in a very long context. For the same reason, it also requires "carefully designed prompts," as prompt information may be lost during inference.

In a discussion about Eagle 7B on Hacker News, one user touted its advantages:

These models don't have a fixed context size and are progressively fine-tuned for longer and longer contexts. The context length also doesn't impact inference cost. Another aspect of performance is not just how well does the trained model perform, but is it data efficient (performance per token trained)?

Lead RWKV developer Peng Bo posted about the model on X, showing its performance on what he called an "uncheatable" benchmark: calculating the model's perplexity on new papers posted to arXiv:

Arxiv is the beginning. We can use latest news, github repos, arxiv paper, blog posts, new wiki entries, and more. The point is to benchmark LLMs on new data - although they can be polluted by ChatGPT too, it is still better than using very old (and actually noisy) evals.

The Eagle 7B code is available on GitHub, and the model weights on Huggingface.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter