In the big data era, text processing tends to be harder as the data increase. There is also the growth of deep learning model for solving natural language processing tasks without a need for hand-crafted rules.
In this research, we provide two big solutions in the area of text preprocessing and distributed training for any neural-based model. We try to solve the most common text preprocessing which are word and sentence tokenization.
Our proposed combined tokenizer is compared by using a single language model and multilanguage model. We also provide a simple communication using MQTT protocol to help the training distribution.