Skip to main content
Version: 1.0.0

Word Tokenization

Word tokenization is a process of defining boundaries between words in a sentence. Tokenization is a method of breaking raw texts into smaller units. Each unit is called a “token” which can be a word, subword, or character. In this model, a token refers to a word. Word tokenization is essential to many Natural Language Processing (NLP) pipelines such as text search, keyword extraction, etc. It is also crucial to doing NLP in Thai language which does not have word boundary in a sentence (no spaces between words).

Base Model - Dictionary based

Provider: PyThaiNLP

We employ PyThaiNLP's dictionary-based word tokenization module¹ for this version of the Thai word tokenization Base model. The dictionary that the Base models used is provided beforehand². As a result, this may not be able to tokenize sentences containing out-of-vocabulary tokens (e.g., product names and person names transliterated from foreign languages). We evaluate the word segmentation performance on the test set VISTEC-TP-TH-2021 corpus³, a collection of 49,997 text samples from Twitter, annotated by Thai linguists.

  1. Repository: PyThaiNLP/nlpo3
  2. The dictionary file that we used is available at PyThaiNLP/pythainlp
  3. VISTEC-TP-TH-2021 corpus is available at OSKut/VISTEC-TP-TH-2021

Authentication

Word Tokenization requires API key for API request. Go to VISAI Console - API Key to create and get your API Key.

  • X-API-Key