Jina AI's Open-Source Embedding Model Outperforms OpenAI's Ada

Multimodal AI company Jina AI recently released jina-embeddings-v2, a sentence embedding model. The model supports context lengths up to 8192 tokens and outperforms OpenAI's text-embedding-ada-002 on several embedding benchmarks.

The jina-embeddings-v2 model, which is freely available under the Apache 2.0 license, is the second iteration of Jina's embeddings model. The model, which supports only the English language, is based on a BERT architecture and is available in two sizes: small, which has 33 million parameters; and base, with 137 million. A large version with 435 million is "releasing soon." The model was trained on the C4 dataset, along with a new dataset of negated statements created by Jina AI. On the Huggingface leaderboard for the Massive Text Embedding Benchmark (MTEB), jina-embeddings-v2 outperforms OpenAI's text-embedding-ada-002 on several tasks of the benchmark, including text classification, reranking, and summarization. According to Dr. Han Xiao, CEO of Jina AI:

In the ever-evolving world of AI, staying ahead and ensuring open access to breakthroughs is paramount. With jina-embeddings-v2, we've achieved a significant milestone. Not only have we developed the world's first open-source 8K context length model, but we have also brought it to a performance level on par with industry giants like OpenAI. Our mission at Jina AI is clear: we aim to democratize AI and empower the community with tools that were once confined to proprietary ecosystems. Today, I am proud to say, we have taken a giant leap towards that vision.

Sentence embeddings are a mapping of a piece of text into a vector. Spatial relationships between two vectors, such as the cosine distance, are used to measure how related the meanings of the two source texts are. The embedding of a text can be used for several downstream AI tasks, such as text classification or summarization. Embeddings are also used to index documents in a vector database for tasks such as retrieval-augmented generation (RAG). In 2022, InfoQ covered the release of OpenAI's text-embedding-ada-002 model, which replaced five previous task-specific models.

Jina AI pointed out that one common shortcoming of embedding models is that negated statements are often mapped very close to their original positive statement. For example, the statements "a couple is walking together" and "a couple is not walking together" are often embedded more closely together than the statements "a couple walks hand in hand down a street" and "a couple is walking together." To address this, the team used GPT-3.5 to create a negation dataset of "query, positive, negative" triples. During training, the model learned to separate the embeddings of the positive and negative components of the triples.

Several users discussed the model in a thread on Hacker News. One user pointed out that the dimensions of Jina's embedding vector was about half that of OpenAI's, which would make it more performant for database queries. Another user claimed:

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].

Both varieties of jina-embeddings-v2, as well as the v1 models, are available on Huggingface. Jina AI claims that they are developing German and Spanish language models, and will publish an academic paper with the technical details of their work.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter