Word Segmentation Service for Vietnamese

Unlike English, Vietnamese is a single-syllable language. When processing Vietnamese texts, we cannot just split the words by spaces and punctuations.

For example, âm and tính when standing alone may mean something but a phrase like âm tính creates a totally different meaning.

We need to correctly split the phrases in order to do other analysis tasks properly.

VnCoreNLP is one of the excellent libraries developed for that purpose and I find it most accurate. But it’s slow as hell when loading models so I’ve created this service as a wrapper so that it loads the models only once.

It’s a simple Spring Boot application. Check it out here: https://github.com/ndthuan/vi-word-segmenter.

A forked version of the library with some improvements: https://github.com/ndthuan/VnCoreNLP.

Pre-built Docker images: https://hub.docker.com/r/ndthuan/vi-word-segmenter.

Go client: https://github.com/ndthuan/go-vi-wordseg-client.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.