Engineering

April 18, 2026 · 10 min read

LLMs as translation judges: Inside GEMBA-MQM v2

The current state-of-the-art translation quality metric: it lists the specific mistakes in the translations, not just how bad the translation is overall without needing a gold-standard reference. Adapted here for live speech-to-speech translation.

Yahya Saleh

Your German audience just heard the opposite of what the speaker said. A standard neural quality metric flagged a low score. Nothing in that number told you what went wrong. That’s how most translation metrics work: they measure failure, but they don’t identify it.

GEMBA-MQM v2 (Junczys-Dowmunt, 2025) is different. It uses a large language model as a translation judge that reads both sides, identifies specific errors by type and severity, and returns a structured annotation you can act on. It ranks first by average correlation with human judgments on the WMT24 translation-quality benchmark (Kocmi et al., 2024), the annual industry-standard evaluation. And it doesn’t need a reference translation.

This article explains what GEMBA is, why it outperforms the alternatives, how it works in detail, and how we apply it to evaluate live speech-to-speech translation systems.

The landscape of translation quality metrics

Translation evaluation has a long lineage of automated metrics. Each generation improved on the last, but all share limitations that GEMBA overcomes.

Surface-overlap metrics

BLEU (Papineni et al., 2002) counts n-gram overlap between a translation and a reference. It launched the field of automated MT evaluation but has well-documented blind spots: it rewards surface similarity, not meaning. A translation that drops a negation but retains all other words scores high. The brevity penalty partially catches omissions but cannot tell you what was dropped, and hallucinations that overlap with the reference go undetected. And it requires a reference translation, which is unavailable in most real-world evaluation scenarios.

chrF/chrF++ (Popović, 2015) replaces word n-grams with character n-grams, making it more robust to morphologically rich languages. But it shares BLEU’s fundamental limitation: textual overlap is not semantic equivalence.

Neural metrics

COMET (Rei et al., 2020) trains a neural model on human Direct Assessment ratings, producing a single score that correlates better with human judgment than BLEU. Reference-free variants exist (Rei et al., 2022). The limitation: COMET assigns generous scores to fluent hallucinations; its training data rarely includes translations that sound excellent but are factually wrong. And it produces an opaque number with no error breakdown.

xCOMET (Guerreiro et al., 2024) bridges scoring and error detection, highlighting error spans while providing segment-level scores. It is strong on WMT benchmarks, but error categories are coarser than the standard MQM taxonomy (no accuracy subtypes like omission vs. mistranslation).

MetricX-24 (Juraska et al., 2024) builds on mT5-XXL (13B parameters) and achieves strong correlation with human ratings. But like COMET, it produces a black-box score. You know how much went wrong but not what.

Speech-native metrics

BLASER 2.0 (Dale and Costa-jussà, 2024) operates directly on speech embeddings without ASR, appealing for low-resource languages without good ASR. But embedding-based similarity cannot tell you whether an omission happened, a term was mistranslated, or a hallucination was introduced. The score is a distance, not a diagnosis.

What’s missing

No existing metric closes both gaps at once:

No error detail. They produce a number, not a diagnosis. You know the translation is bad but not why.
Require references. Surface-overlap metrics need a gold-standard translation that usually doesn’t exist.

GEMBA-MQM v2 closes both gaps.

GEMBA-MQM v2

GEMBA (GPT Estimation Metric Based Assessment) was introduced by Kocmi and Federmann (2023a) at EAMT 2023 as a zero-shot LLM-as-judge approach to translation quality evaluation (code). The original version asks a large language model to rate translation quality on a Likert scale, achieving state-of-the-art system-level accuracy on WMT22 compared to MQM-based human labels.

GEMBA-MQM (Kocmi and Federmann, 2023b) extended this to produce error span annotations in the Multidimensional Quality Metrics (MQM) framework, an open industry standard for classifying translation errors by type and severity, identifying what went wrong, not just how bad. It achieved 96.5% system-level pairwise accuracy on the WMT 2023 blind test set using language-agnostic prompts, with no per-language engineering.

GEMBA-MQM v2 (Junczys-Dowmunt, 2025) is the current version. The key insight: a single LLM judgment is noisy, but ten independent judgments aggregated properly are much more reliable. The paper reports first place by average correlation on WMT24 MQM test sets.

Why GEMBA wins

Property	GEMBA-MQM v2	Neural metrics (COMET, MetricX)	Surface metrics (BLEU, chrF)
Reference-free	Yes	Some variants	No
Error types	Full MQM taxonomy	None or coarse	None
Model-agnostic	Any LLM	Fixed checkpoint	N/A
Interpretable	Error-level detail	Single score	Single score
Correlation with human MQM	First place (WMT24)	Strong	Moderate

The core advantage is interpretability. A GEMBA score is not just a number; it is a list of errors with severity levels, types, and descriptions. You can read the errors and understand exactly what went wrong. This makes the metric actionable: teams can fix specific failure modes rather than trying to improve an opaque score.

Error taxonomy and scoring

The LLM is prompted to annotate translation errors using MQM severity levels, each with a fixed weight:

Severity	Weight	Meaning
CRITICAL	25	Comprehension-blocking: the reader (or listener) cannot understand what was meant
MAJOR	5	Disrupts flow: the meaning is recoverable but with effort
MINOR	1	Awkward but understandable
PUNCTUATION	0.1	Minor formatting (subset of minor)

Each error also carries a type. The five main categories are accuracy (addition, mistranslation, omission, untranslated text), fluency (grammar, spelling, register, punctuation), style, terminology, and non-translation.

The raw MQM score for a segment is the negated weighted sum:

\text{MQM} = -\left(25 \cdot n_{\text{critical}} + 5 \cdot n_{\text{major}} + 1 \cdot n_{\text{minor}} + 0.1 \cdot n_{\text{punctuation}}\right)

A perfect translation scores 0. More negative means worse. To compare texts of different lengths, we normalize to a Normed Penalty Total (NPT) per 1,000 source words, following the MQM scoring model:

\text{NPT} = \frac{\text{MQM} \times 1000}{n_{\text{source words}}}

One way to read a normalized score: −57 means roughly “2 critical + 1 major + 2 minor errors per 1,000 source words” (2×25 + 1×5 + 2×1 = 57).

Multi-pass scoring: why 10 passes matter

An LLM judging translation quality is doing something genuinely hard. It must read two texts in different languages, identify where meaning diverges, decide whether each divergence is a real error or acceptable variation, and assign a severity. Even trained translators routinely disagree about which segments contain errors and how serious they are (Lommel et al., 2014). An LLM faces the same ambiguity, and at temperature > 0, it resolves that ambiguity differently each time.

The GEMBA v2 paper (Junczys-Dowmunt, 2025) formalizes this: score each segment ten times independently, remove statistical outliers, and aggregate with a weighted average that favors the more lenient judgments. The result is a score that reflects the consensus view rather than the luck of a single draw.

In practice, variance is substantial: on one English→German clip, 10 passes at temperature 0.4 produced scores ranging from −29 to −109, a 3.7× spread on the same translation. The 10-pass aggregate settles at −41.9.

Aggregation: outlier removal + RRWA

The 10-pass scores are aggregated in two steps, following Junczys-Dowmunt (2025, Sec. 3.2):

1. Outlier removal. Discard any score beyond 2 standard deviations from the mean. This filters harsh or lenient outlier judgments.

2. Rank-Reciprocal Weighted Average (RRWA). Sort the remaining n scores from best (closest to 0) to worst as s₁, s₂, …, s_n. Weight each by the reciprocal of its rank:

\text{RRWA} = \frac{\displaystyle\sum_{r=1}^{n} \frac{s_r}{r}}{\displaystyle\sum_{r=1}^{n} \frac{1}{r}}

RRWA weights better-ranked scores more heavily: the best score contributes at weight 1/1, the next at 1/2, and so on, so the harshest single pass still contributes but with the smallest weight. Informally, this de-emphasizes the harshest outlier while still using it, since a single harsh pass can over-flag borderline cases.

See our reference implementation below for the code.

Using GEMBA for live speech translation

GEMBA operates on text. To evaluate live speech-to-speech translation, we transcribe both sides first:

Flow: source and target audio each pass through ASR to produce transcripts, which are then scored by GEMBA-MQM v2 to yield a score and error list.

One design choice worth noting: we score the transcription of the delivered audio, not the platform’s own captions. Captions can mask TTS failures. Transcribing the delivered audio captures what the audience actually received. The trade-off is that ASR errors on the delivered audio show up in the score, so a poor target-language ASR will inflate the error count. We accept this because the metric reflects what the listener actually heard.

Walkthrough: example output

We ran GEMBA-MQM v2 on a 6-minute English→Spanish clip (958 source words), scored with Gemini 3.1 Pro. The GEMBA v2 paper prescribes how to aggregate scores across passes but does not specify how to select which pass’s errors to present. Our approach: show the errors from the pass whose score is closest to the final RRWA aggregate, i.e., the most representative single judgment.

For this clip, the representative pass returned 6 major and 3 minor errors, with no critical errors:

Representative pass errors: six major mistranslations and additions plus three minor tone, grammar, and terminology issues.

The 10-pass RRWA aggregate settles at −35.19, normalized to −36.7 per 1,000 source words. (The representative pass above sums to −33 from 6 major (×5) + 3 minor (×1); the RRWA weighs all ten passes.) The audience follows the speaker’s argument; the core message comes through intact. The errors are real but non-blocking: a hallucinated repetition, a few shifted quantifiers, and some tone loss. For live speech translation into Spanish, this is a solid result.

See our reference implementation below for the full prompts.

Reference implementation

The GEMBA-MQM v2 pipeline described above (10 stochastic passes, 2σ outlier removal, rank-reciprocal weighted aggregation) is available at VoiceFrom/live-s2st-eval.

References

D. Dale and M. R. Costa-jussà. 2024. BLASER 2.0: A metric for evaluation and quality estimation of massively multilingual speech and text translation. In Findings of EMNLP. https://aclanthology.org/2024.findings-emnlp.943/
N. M. Guerreiro et al. 2024. xCOMET: Transparent machine translation evaluation through fine-grained error detection. TACL. https://aclanthology.org/2024.tacl-1.54/
M. Junczys-Dowmunt. 2025. GEMBA V2: Ten judgments are better than one. In Proc. WMT, 926–933. https://aclanthology.org/2025.wmt-1.67/
J. Juraska et al. 2024. MetricX-24: The Google submission to the WMT 2024 metrics shared task. In Proc. WMT. https://arxiv.org/abs/2410.03983
T. Kocmi and C. Federmann. 2023a. Large language models are state-of-the-art evaluators of translation quality. In Proc. EAMT, 193–203. https://aclanthology.org/2023.eamt-1.19/
T. Kocmi and C. Federmann. 2023b. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proc. WMT, 768–775. https://aclanthology.org/2023.wmt-1.64/
T. Kocmi et al. 2024. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Proc. WMT. https://aclanthology.org/2024.wmt-1.1/
A. Lommel, M. Popović, and A. Burchardt. 2014. Assessing inter-annotator agreement for translation error annotation. In Proc. MTE Workshop at LREC.
K. Papineni et al. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. ACL.
M. Popović. 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proc. WMT. https://aclanthology.org/W15-3049/
R. Rei et al. 2020. COMET: A neural framework for MT evaluation. In Proc. EMNLP, 2685–2702. https://aclanthology.org/2020.emnlp-main.213/
R. Rei et al. 2022. CometKiwi: IST-Unbabel 2022 submission for the quality estimation shared task. In Proc. WMT, 634–645. https://aclanthology.org/2022.wmt-1.60/

Model note: Walkthrough examples in this article use Gemini 3.1 Pro.

Yahya Saleh

Applied ML Engineer

Yahya is an applied ML engineer at VoiceFrom. He builds the production-grade live speech-to-speech translation pipeline, turning recent research into systems that actually ship.