Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction

Research article

Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction

^¹, ^¹, ^¹

¹Department of Computer Science and Engineering, Indian Institute of Technology Bombay

Volume 6 Issue 1

DOI: https://doi.org/10.71448/bcds2561-3

Published: 30/03/2025

Cite this article as: Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction. Bulletin of Computer and Data Sciences, Volume 6 Issue 1. Page: 39-60.

Abstract

Emoji usage and code-mixed language have become central to informal online communication, especially in multilingual communities. Predicting suitable emojis and inferring sentiment and emotion from noisy, code-mixed social media text is challenging due to non-standard spelling, cross-script mixing, and strong pragmatic effects. Prior work has proposed specialized encoders and multi-task frameworks for joint prediction of multi-label emojis, sentiment, and emotion on English–Hindi code-mixed tweets. However, these approaches are built on relatively shallow architectures and do not fully exploit recent advances in self-supervised and large language model (LLM) based representation learning. In this paper, we introduce CodeMixLM, a self-supervised transformer model specialized for Hinglish code-mixed text. Starting from a multilingual transformer backbone, we continue pretraining on large-scale unlabeled code-mixed social media data with three auxiliary objectives: masked span denoising, script-aware language identification, and emoji-aware contrastive learning. We then fine-tune CodeMixLM in a multi-task setting for (i) multi-label emoji prediction, (ii) three-way sentiment classification, and (iii) seven-way emotion classification on the SENTIMOJI dataset of English–Hindi code-mixed tweets. Across all tasks, CodeMixLM substantially outperforms prior task-specific architectures and strong multilingual transformer baselines, improving macro-F₁ for emoji prediction by up to several points while also yielding better calibration and label efficiency under reduced supervision. Detailed analyses show that self-supervised code-mixed pretraining (1) improves robustness to spelling variants and code-switching patterns, and (2) better captures the interaction between emojis, sentiment, and emotion. Our results highlight the importance of domain-specialized self-supervised learning for code-mixed NLP and offer a stronger baseline for future work on emoji-aware affective computing.

Keywords: code-mixed NLP, emoji prediction, sentiment and emotion classification, self-supervised transformer learning

Abstract

Keywords: code-mixed NLP, emoji prediction, sentiment and emotion classification, self-supervised transformer learning

Asif Shehzad

Department of Computer Science and Engineering, Indian Institute of Technology Bombay

Rahul Sharma

Department of Computer Science and Engineering, Indian Institute of Technology Bombay

Pushpak Bhattacharyya

Department of Computer Science and Engineering, Indian Institute of Technology Bombay

pb@cse.iitb.ac.in

DOI

https://doi.org/10.71448/bcds2561-3

Cite this article as:

Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. Self-Supervised Code-Mixed Representation Learning for Multi-Label Emoji, Sentiment, and Emotion Prediction. Bulletin of Computer and Data Sciences, Volume 6 Issue 1. Page: 39-60.

Publication history

Received: 10/09/2024
Revised: 26/12/2024
Accepted: 21/02/2025
Published: 30/03/2025

Copyright © 2025 Asif Shehzad, Rahul Sharma, Pushpak Bhattacharyya. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.