Charles Explorer logo

Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri

Publication at Faculty of Mathematics and Physics |


This paper presents the first dependency treebank for Bhojpuri, an Indo-Aryan language. Bhojpuri is one of the resource-poor Indian languages.

The objective of the Bhojpuri Treebank (BHTB) project is to provide a substantial, syntactically annotated treebank for Bhojpuri which helps in building language technological tools. This project will also help in cross-lingual learning and typological research.

Currently, the treebank consists of 4,881 tokens using the annotation scheme of Universal Dependencies (UD). We develop a Bhojpuri tagger and parser using the machine learning approach.

The accuracy of the model is 57.49% UAS, 45.50% LAS, 79.69% UPOS accuracy and 77.64% XPOS accuracy. Finally, we discuss linguistic analysis and annotation process of the Bhojpuri UD treebank.