Hugging Face introduced the Retrieval Embedding Benchmark (RTEB), a new evaluation framework designed to more accurately measure how well embedding models generalize in real-world retrieval tasks. The beta benchmark aims to establish a community standard for evaluating retrieval accuracy in both open and private datasets.
Retrieval quality is crucial for various AI systems, such as RAG, intelligent agents, enterprise search, and recommendation engines. However, existing benchmarks often do not represent real-world performance accurately. Models may perform well on public benchmarks but often fall short in production due to being indirectly trained on that evaluation data, resulting in a “generalization gap.” This makes it difficult for developers to predict how their models will handle unseen data.
RTEB tackles this problem with a hybrid evaluation strategy. It combines open datasets, which are public and reproducible, with private datasets that remain accessible only to the MTEB maintainers, ensuring that results reflect genuine generalization rather than memorization. For each private dataset, only descriptive statistics and sample examples are released, maintaining transparency while preventing data leakage.
In addition to its methodological improvements, RTEB focuses on real-world applicability. It includes datasets across critical domains such as law, healthcare, finance, and code, covering 20 languages from English and Japanese to Bengali and Finnish. The benchmark’s simplicity is also deliberate: datasets are large enough to be meaningful but small enough to enable efficient evaluation.
The launch of RTEB has already sparked discussion among AI researchers and practitioners. On LinkedIn, Shai Nisan, Ph.D., head of AI at Copyleaks, commented:
Beautiful work! Thank you for this. Anyways, it's highly important to have your own private benchmark on your specific task. That's the best way to predict success.
Tom Aarsen, one of the benchmark’s co-authors and a maintainer of Sentence Transformers at Hugging Face, replied:
That’s the be-all-end-all, but not everyone has that data ready. If you can, though: use your own tests. E.g. Sentence Transformers allows for easily swapping out models.
The team also notes several limitations and future directions for RTEB. The benchmark currently focuses on text-only retrieval and may later expand to include multimodal tasks such as text-to-image search. The maintainers are also working to extend language coverage, particularly for Chinese, Arabic, and low-resource languages, and are encouraging community contributions of new datasets.
With RTEB now live on Hugging Face’s MTEB leaderboard under the new Retrieval section, developers and researchers can already submit their models for evaluation. The project’s maintainers emphasize that this is only the beginning: RTEB will evolve through open collaboration, with the long-term goal of becoming the community’s trusted standard for measuring retrieval performance in AI.