This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems in- volved in the WMT18 News Transla- tion Task with automatic metrics.
We collected scores of 10 metrics and 8 re- search groups. In addition to that, we computed scores of 8 standard met- rics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as base- lines.
The collected scores were eval- uated in terms of system-level corre- lation (how well each metric's scores correlate with WMT18 official man- ual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence rel- ative to alternate outputs). This year, we employ a single kind of manual eval- uation: direct assessment (DA).