Charles Explorer logo
🇨🇿

N/A

Publikace na Matematicko-fyzikální fakulta |
2012

Abstrakt

(Not yet available. English version repeated) This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low resourced language using a string similarity approach.

Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words. We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU).

Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment. We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data.

This approach gives a statistically significant improvement by u

Klíčová slova