Charles Explorer logo
🇬🇧

Domain adaptation of statistical machine translation with domain-focused web crawling

Publication at Faculty of Mathematics and Physics |
2015

Abstract

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for auto- matic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework.

We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English-French and English-Greek) and in both directions: into and from English. In general, MT systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and para