Unlike English, Vietnamese is a single-syllable language. When processing Vietnamese texts, we cannot just split the words by spaces and punctuations.
For example, âm and tính when standing alone may mean something but a phrase like âm tính creates a totally different meaning.
We need to correctly split the phrases in order to do other analysis tasks properly.
VnCoreNLP is one of the excellent libraries developed for that purpose and I find it most accurate. But it’s slow as hell when loading models so I’ve created this service as a wrapper so that it loads the models only once.
It’s a simple Spring Boot application. Check it out here: https://github.com/ndthuan/vi-word-segmenter.
A forked version of the library with some improvements: https://github.com/ndthuan/VnCoreNLP.
Pre-built Docker images: https://hub.docker.com/r/ndthuan/vi-word-segmenter.
Go client: https://github.com/ndthuan/go-vi-wordseg-client.