Six new Ettin rerankers displace the ms-marco-MiniLM baseline

A family of CrossEncoder models from 17M to 1B parameters beats the long-dominant MiniLM rerankers on MTEB and NanoBEIR at a fraction of their size.

Alessandro Benigni

PUBLISHED MAY 20, 2026

4 MIN READ

Follow on Google

MAY 20, 2026

Six new Ettin rerankers displace the ms-marco-MiniLM baseline — featured image for AI Insiders

Six new production-grade rerankers are available on Hugging Face as of May 19, covering a parameter range from 17M to 1B, and the benchmark results challenge a baseline that many retrieval teams have been running untouched for years.

The models are the Ettin Reranker family, released by Tom Aarsen and built on Ettin ModernBERT encoders from Johns Hopkins University. They arrive as Sentence Transformers CrossEncoder models, meaning three lines of Python is all it takes to swap one into an existing pipeline. All six are released under Apache 2.0.

What rerankers do in a production pipeline

A reranker, also called a cross-encoder, takes a (query, document) pair and scores their relevance jointly. Unlike an embedding model, which encodes each text independently and compares vectors, a cross-encoder lets the query and document attend to each other through every transformer layer. That joint attention produces significantly more accurate relevance signals. The cost is that the model must run once per candidate pair, making it too slow to score an entire corpus.

The solution used by nearly every production retrieval system today is retrieve-then-rerank: a fast embedding model pulls the top-K candidates from the full index cheaply, then the reranker re-orders just those K with high accuracy. This is the standard shape of modern RAG (retrieval-augmented generation) architectures. The quality of the final answer depends heavily on whether the reranker returns the right document at rank one.

The legacy baseline being displaced

For several years, cross-encoder/ms-marco-MiniLM-L12-v2, a 33M-parameter model, has served as the default starting point for teams building retrieve-then-rerank systems. It is widely cited in tutorials and benchmarks, and many production stacks still run on it or its smaller L6 and L4 siblings.

The Ettin Reranker benchmarks, published on the Hugging Face blog, show a clear break from that baseline. On MTEB (Massive Text Embedding Benchmark, a standardized 10-task retrieval evaluation suite) and NanoBEIR (a 13-dataset fast subset of the BEIR information retrieval benchmark), every Ettin model outperforms the legacy MiniLM family.

The most striking result is at the small end. The 17M Ettin model beats ms-marco-MiniLM-L12-v2 by 0.051 NDCG@10 on MTEB (0.5576 vs. 0.5066) and by 0.038 on NanoBEIR, at roughly half the parameter count. The 32M model beats the 568M BAAI/bge-reranker-v2-m3 by 0.025 NDCG@10 on MTEB, a 17x parameter gap in favor of the smaller model.

How they were trained

All six models share the same architecture: an Ettin ModernBERT encoder backbone with a four-module classification head. ModernBERT provides up to 8,192 tokens of context, RoPE positional encodings, and native Flash Attention 2 support. The training method is pointwise MSE distillation: the 1.54B-parameter mixedbread-ai/mxbai-rerank-large-v2 was used as a teacher, with the student models trained to match its output scores on a curated dataset combining embedding pre-training and fine-tuning subsets.

The distillation is efficient. The 1B Ettin model lands within 0.0001 NDCG@10 of its 1.54B teacher on MTEB (0.6114 vs. 0.6115), effectively closing the accuracy gap to a model 54% larger. The only model that clearly beats the teacher in these benchmarks is Qwen/Qwen3-Reranker-4B, which scores 0.6367, roughly 0.025 above the 1B Ettin. For most workloads, a 1B model at a quarter the parameters is the more practical choice.

Flash Attention 2 and speed

Speed matters as much as accuracy for rerankers, since they sit in the latency-critical path between retrieval and response. With Flash Attention 2 enabled and bfloat16 precision, the Ettin models deliver a 1.7x to 8.3x throughput improvement over fp32 defaults depending on model size and sequence length, according to benchmarks in the Hugging Face blog post. The sequence unpadding enabled by the Sentence Transformers modular head, rather than HuggingFace’s standard AutoModelForSequenceClassification, is a key contributor to that speedup.

What teams should do now

Any team running ms-marco-MiniLM-L12-v2 or its siblings in a production RAG stack has a low-risk upgrade path. The 17M or 32M Ettin models are drop-in replacements that will deliver a measurable NDCG improvement on both MTEB and NanoBEIR at equal or lower serving cost. Teams with higher accuracy requirements and a budget for a larger model should benchmark the 150M and 1B variants before committing to a multi-billion-parameter alternative. The benchmarks are published; running them against your own retrieval corpus before deployment is the right next step.

Source: Hugging Face blog, “Introducing the Ettin Reranker Family” by Tom Aarsen, published May 19, 2026.

Six new Ettin rerankers displace the ms-marco-MiniLM baseline

The morning brief for people inside the AI industry.

More in Tools

New Proxy Benchmark Cuts Agentic Eval Costs by Over 99%

PyTorch Monarch Now Runs on AMD GPUs, Adding Fault Tolerance at Scale

Current AI Mapped 24,626 Open Source AI Projects to Find the Gaps