Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction

Asif Shehzad1, Rahul Sharma1, Pushpak Bhattacharyya1
1Department of Computer Science and Engineering, Indian Institute of Technology Bombay
DOI: https://doi.org/10.71448/bcds2561-3
Published: 30/03/2025
Cite this article as: Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction. Bulletin of Computer and Data Sciences, Volume 6 Issue 1. Page: 39-60.

Abstract

Emoji usage and code-mixed language have become central to informal online communication, especially in multilingual communities. Predicting suitable emojis and inferring sentiment and emotion from noisy, code-mixed social media text is challenging due to non-standard spelling, cross-script mixing, and strong pragmatic effects. Prior work has proposed specialized encoders and multi-task frameworks for joint prediction of multi-label emojis, sentiment, and emotion on English–Hindi code-mixed tweets. However, these approaches are built on relatively shallow architectures and do not fully exploit recent advances in self-supervised and large language model (LLM) based representation learning. In this paper, we introduce CodeMixLM, a self-supervised transformer model specialized for Hinglish code-mixed text. Starting from a multilingual transformer backbone, we continue pretraining on large-scale unlabeled code-mixed social media data with three auxiliary objectives: masked span denoising, script-aware language identification, and emoji-aware contrastive learning. We then fine-tune CodeMixLM in a multi-task setting for (i) multi-label emoji prediction, (ii) three-way sentiment classification, and (iii) seven-way emotion classification on the SENTIMOJI dataset of English–Hindi code-mixed tweets. Across all tasks, CodeMixLM substantially outperforms prior task-specific architectures and strong multilingual transformer baselines, improving macro-F1 for emoji prediction by up to several points while also yielding better calibration and label efficiency under reduced supervision. Detailed analyses show that self-supervised code-mixed pretraining (1) improves robustness to spelling variants and code-switching patterns, and (2) better captures the interaction between emojis, sentiment, and emotion. Our results highlight the importance of domain-specialized self-supervised learning for code-mixed NLP and offer a stronger baseline for future work on emoji-aware affective computing.

Keywords: code-mixed NLP, emoji prediction, sentiment and emotion classification, self-supervised transformer learning

Abstract

Emoji usage and code-mixed language have become central to informal online communication, especially in multilingual communities. Predicting suitable emojis and inferring sentiment and emotion from noisy, code-mixed social media text is challenging due to non-standard spelling, cross-script mixing, and strong pragmatic effects. Prior work has proposed specialized encoders and multi-task frameworks for joint prediction of multi-label emojis, sentiment, and emotion on English–Hindi code-mixed tweets. However, these approaches are built on relatively shallow architectures and do not fully exploit recent advances in self-supervised and large language model (LLM) based representation learning. In this paper, we introduce CodeMixLM, a self-supervised transformer model specialized for Hinglish code-mixed text. Starting from a multilingual transformer backbone, we continue pretraining on large-scale unlabeled code-mixed social media data with three auxiliary objectives: masked span denoising, script-aware language identification, and emoji-aware contrastive learning. We then fine-tune CodeMixLM in a multi-task setting for (i) multi-label emoji prediction, (ii) three-way sentiment classification, and (iii) seven-way emotion classification on the SENTIMOJI dataset of English–Hindi code-mixed tweets. Across all tasks, CodeMixLM substantially outperforms prior task-specific architectures and strong multilingual transformer baselines, improving macro-F1 for emoji prediction by up to several points while also yielding better calibration and label efficiency under reduced supervision. Detailed analyses show that self-supervised code-mixed pretraining (1) improves robustness to spelling variants and code-switching patterns, and (2) better captures the interaction between emojis, sentiment, and emotion. Our results highlight the importance of domain-specialized self-supervised learning for code-mixed NLP and offer a stronger baseline for future work on emoji-aware affective computing.

Keywords: code-mixed NLP, emoji prediction, sentiment and emotion classification, self-supervised transformer learning
Asif Shehzad
Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Rahul Sharma
Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Pushpak Bhattacharyya
Department of Computer Science and Engineering, Indian Institute of Technology Bombay

DOI

Cite this article as:

Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction. Bulletin of Computer and Data Sciences, Volume 6 Issue 1. Page: 39-60.

Publication history

Copyright © 2025 Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Browse Advance Search