Improving Vietnamese Short Text Classification with Pretrained Transformers

Iftikhar Ahmad1
1University of Agriculture Faisalabad (UAF), Pakistan
DOI: https://doi.org/10.71448/bcds2453-4
Published: 30/09/2024
Cite this article as: Iftikhar Ahmad. Improving Vietnamese Short Text Classification with Pretrained Transformers. Bulletin of Computer and Data Sciences, Volume 5 Issue 3. Page: 49-59.

Abstract

Short Vietnamese social media posts are challenging to classify due to their brevity, informal spelling, and domain-specific vocabulary. Prior work has shown that a distributed pipeline combining TF–IDF features and Naïve Bayes achieves competitive performance on this task while scaling to tens of thousands of Facebook messages [1]. However, that framework predates recent advances in large pre-trained language models for Vietnamese such as PhoBERT and multilingual Transformers. In this paper, we present a systematic empirical study of contextual Transformer-based models for Vietnamese short text classification, using the same five topical categories (Sports, News, Traveling, Sales, Technology) and extending the original dataset with additional pages and posts. We compare fine-tuned PhoBERT and XLM-RoBERTa against strong bag-of-words baselines, including TF–IDF with Multinomial Naïve Bayes and linear SVM. Our results show that Transformer models improve macro-F1 by approximately 7.2 points over the best TF–IDF baseline, with the largest gains on the hardest class (Sales). We further analyze the impact of tokenization choices (word-segmentation vs. subword units) and model size on accuracy and inference latency. The study confirms that pre-trained contextual representations substantially narrow the performance gap between Vietnamese and higher-resourced languages on short-text classification, while remaining feasible for deployment in latency-sensitive applications.

Keywords: Vietnamese NLP, short text classification, PhoBERT, XLM-R, Na\”ive Bayes, social media analytics, transformer models

Abstract

Short Vietnamese social media posts are challenging to classify due to their brevity, informal spelling, and domain-specific vocabulary. Prior work has shown that a distributed pipeline combining TF–IDF features and Naïve Bayes achieves competitive performance on this task while scaling to tens of thousands of Facebook messages [1]. However, that framework predates recent advances in large pre-trained language models for Vietnamese such as PhoBERT and multilingual Transformers. In this paper, we present a systematic empirical study of contextual Transformer-based models for Vietnamese short text classification, using the same five topical categories (Sports, News, Traveling, Sales, Technology) and extending the original dataset with additional pages and posts. We compare fine-tuned PhoBERT and XLM-RoBERTa against strong bag-of-words baselines, including TF–IDF with Multinomial Naïve Bayes and linear SVM. Our results show that Transformer models improve macro-F1 by approximately 7.2 points over the best TF–IDF baseline, with the largest gains on the hardest class (Sales). We further analyze the impact of tokenization choices (word-segmentation vs. subword units) and model size on accuracy and inference latency. The study confirms that pre-trained contextual representations substantially narrow the performance gap between Vietnamese and higher-resourced languages on short-text classification, while remaining feasible for deployment in latency-sensitive applications.

Keywords: Vietnamese NLP, short text classification, PhoBERT, XLM-R, Na\”ive Bayes, social media analytics, transformer models
Iftikhar Ahmad
University of Agriculture Faisalabad (UAF), Pakistan

DOI

Cite this article as:

Iftikhar Ahmad. Improving Vietnamese Short Text Classification with Pretrained Transformers. Bulletin of Computer and Data Sciences, Volume 5 Issue 3. Page: 49-59.

Publication history

Copyright © 2024 Iftikhar Ahmad. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Browse Advance Search