Short Vietnamese social media posts are challenging to classify due to their brevity, informal spelling, and domain-specific vocabulary. Prior work has shown that a distributed pipeline combining TF–IDF features and Naïve Bayes achieves competitive performance on this task while scaling to tens of thousands of Facebook messages [1]. However, that framework predates recent advances in large pre-trained language models for Vietnamese such as PhoBERT and multilingual Transformers. In this paper, we present a systematic empirical study of contextual Transformer-based models for Vietnamese short text classification, using the same five topical categories (Sports, News, Traveling, Sales, Technology) and extending the original dataset with additional pages and posts. We compare fine-tuned PhoBERT and XLM-RoBERTa against strong bag-of-words baselines, including TF–IDF with Multinomial Naïve Bayes and linear SVM. Our results show that Transformer models improve macro-F1 by approximately 7.2 points over the best TF–IDF baseline, with the largest gains on the hardest class (Sales). We further analyze the impact of tokenization choices (word-segmentation vs. subword units) and model size on accuracy and inference latency. The study confirms that pre-trained contextual representations substantially narrow the performance gap between Vietnamese and higher-resourced languages on short-text classification, while remaining feasible for deployment in latency-sensitive applications.
Short Vietnamese social media posts are challenging to classify due to their brevity, informal spelling, and domain-specific vocabulary. Prior work has shown that a distributed pipeline combining TF–IDF features and Naïve Bayes achieves competitive performance on this task while scaling to tens of thousands of Facebook messages [1]. However, that framework predates recent advances in large pre-trained language models for Vietnamese such as PhoBERT and multilingual Transformers. In this paper, we present a systematic empirical study of contextual Transformer-based models for Vietnamese short text classification, using the same five topical categories (Sports, News, Traveling, Sales, Technology) and extending the original dataset with additional pages and posts. We compare fine-tuned PhoBERT and XLM-RoBERTa against strong bag-of-words baselines, including TF–IDF with Multinomial Naïve Bayes and linear SVM. Our results show that Transformer models improve macro-F1 by approximately 7.2 points over the best TF–IDF baseline, with the largest gains on the hardest class (Sales). We further analyze the impact of tokenization choices (word-segmentation vs. subword units) and model size on accuracy and inference latency. The study confirms that pre-trained contextual representations substantially narrow the performance gap between Vietnamese and higher-resourced languages on short-text classification, while remaining feasible for deployment in latency-sensitive applications.