Emoji usage and code-mixed language have become central to informal online communication, especially in multilingual communities. Predicting suitable emojis and inferring sentiment and emotion from noisy, code-mixed social media text is challenging due to non-standard spelling, cross-script mixing, and strong pragmatic effects. Prior work has proposed specialized encoders and multi-task frameworks for joint prediction of multi-label emojis, sentiment, and emotion on English–Hindi code-mixed tweets. However, these approaches are built on relatively shallow architectures and do not fully exploit recent advances in self-supervised and large language model (LLM) based representation learning. In this paper, we introduce CodeMixLM, a self-supervised transformer model specialized for Hinglish code-mixed text. Starting from a multilingual transformer backbone, we continue pretraining on large-scale unlabeled code-mixed social media data with three auxiliary objectives: masked span denoising, script-aware language identification, and emoji-aware contrastive learning. We then fine-tune CodeMixLM in a multi-task setting for (i) multi-label emoji prediction, (ii) three-way sentiment classification, and (iii) seven-way emotion classification on the SENTIMOJI dataset of English–Hindi code-mixed tweets. Across all tasks, CodeMixLM substantially outperforms prior task-specific architectures and strong multilingual transformer baselines, improving macro-F1 for emoji prediction by up to several points while also yielding better calibration and label efficiency under reduced supervision. Detailed analyses show that self-supervised code-mixed pretraining (1) improves robustness to spelling variants and code-switching patterns, and (2) better captures the interaction between emojis, sentiment, and emotion. Our results highlight the importance of domain-specialized self-supervised learning for code-mixed NLP and offer a stronger baseline for future work on emoji-aware affective computing.
Emoji usage and code-mixed language have become central to informal online communication, especially in multilingual communities. Predicting suitable emojis and inferring sentiment and emotion from noisy, code-mixed social media text is challenging due to non-standard spelling, cross-script mixing, and strong pragmatic effects. Prior work has proposed specialized encoders and multi-task frameworks for joint prediction of multi-label emojis, sentiment, and emotion on English–Hindi code-mixed tweets. However, these approaches are built on relatively shallow architectures and do not fully exploit recent advances in self-supervised and large language model (LLM) based representation learning. In this paper, we introduce CodeMixLM, a self-supervised transformer model specialized for Hinglish code-mixed text. Starting from a multilingual transformer backbone, we continue pretraining on large-scale unlabeled code-mixed social media data with three auxiliary objectives: masked span denoising, script-aware language identification, and emoji-aware contrastive learning. We then fine-tune CodeMixLM in a multi-task setting for (i) multi-label emoji prediction, (ii) three-way sentiment classification, and (iii) seven-way emotion classification on the SENTIMOJI dataset of English–Hindi code-mixed tweets. Across all tasks, CodeMixLM substantially outperforms prior task-specific architectures and strong multilingual transformer baselines, improving macro-F1 for emoji prediction by up to several points while also yielding better calibration and label efficiency under reduced supervision. Detailed analyses show that self-supervised code-mixed pretraining (1) improves robustness to spelling variants and code-switching patterns, and (2) better captures the interaction between emojis, sentiment, and emotion. Our results highlight the importance of domain-specialized self-supervised learning for code-mixed NLP and offer a stronger baseline for future work on emoji-aware affective computing.