Data Issues in English-to-Hindi Machine Translation

Publication at Faculty of Mathematics and Physics |

2010

Abstract

Statistical machine translation to morphologically richer languages is a challenging task and more so if the source and target languages differ in word order. Current state-of-the-art MT systems thus deliver mediocre results.

Adding more parallel data often helps improve the results; if it doesn't, it may be caused by various problems such as different domains, bad alignment or noise in the new data. In this paper we evaluate the English-to-Hindi MT task from this data perspective.

We discuss several available parallel data sources and provide cross-evaluation results on their combinations using two freely available statistical MT systems. Together with the error analysis, we also present a new tool for viewing aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the target language.

Keywords

data issues english hindi machine translation