Skip to main content
Version: 1.0.0

Speech Segmentation

Speech Segmentation, also known as Voice activity detection (VAD), is the detection of the human speech or non-speech. The speech segmentation is widely used to facilitate in speech processing such as Automatic Speech Recognition (ASR), and Speech Emotion Recognition (SER).

Base Model - VISAI Speech Segmentaion

We used a model from NeMo¹ and fine-tune to support Thai language. The model was trained with a speech dataset from Thai SER², and background datasets from MUSAN³ and ChMusic⁴. The evaluation set was split from the same source as training data and rebalanced to have the same number of segments in each of the classes. If background noise is very loud, the model will fail to detect some words.

Authentication

Speech Segmentation requires API key for API request. Go to VISAI Console - API Key to create and get your API Key.

  • X-API-Key