The utility of linguistic annotation in neural machine translation seemed to had been established in past papers. The experiments were however limited to recurrent sequence-to-sequence architectures and relatively small data settings.
We focus on the state-of-the-art Transformer model and use comparably larger corpora. Specifically, we try to promote the knowledge of source-side syntax using multi-task learning either through simple data manipulation techniques or through a dedicated model component.
In particular, we train one of Transformer attention heads to produce source-side dependency tree. Overall, our results cast some doubt on the utility of multi-task setups with linguistic information.
The data manipulation techniques, recommended in previous works, prove ineffective in large data settings. The treatment of self-attention as dependencies seems much more promising: it helps in translation and reveals that Transformer model can very easily grasp the syntactic structure.
An important but curi