Charles Explorer logo
🇬🇧

UMC005: English-Urdu Parallel Corpus

Publication

Abstract

English-Urdu Parallel Corpus serves training of statistical machine translation between these two languages. It consists of four parts:

1. English-Urdu part of the EMILLE corpus;

2. texts from the Wall Street Journal (Penn Treebank);

3. translations of the Quran;

4. translations of the Bible. Parallel data that existed before (EMILLE) have been completely and newly manually cleaned, corrected alignment and many sentences on the Urdu side.