The data is in the following format

Comment label

Intha padam vantha piragu yellarum Thala ya kondaduvanga positive

Tamil-English: Train: 35,657 Validation: 3,963 and Test: 4,403

Malayalam-English: Train: 15,889 Validation: 1,767 and Test: 1,963

Kannada-English: Train:6213 Validation:692 and Test: 768

We present Tamil-English, Kannada-English and Malayalam-English, a dataset of YouTube video comments. The dataset contains all the three types of code-mixed sentences Inter-Sentential switch, Intra-Sentential switch and Tag switching.  Most comments were written in native script and Roman script with either Tamil / Malayalam / Kannada grammar with English lexicon or English grammar with Tamil / Malayalam / Kannada lexicon. Some comments were written in Tamil / Malayalam / Kannada script with English expressions in between.

To get full data register at Codalab link: CodaLab link

More details about the dataset are in the papers "A Sentiment Analysis Dataset for Code-Mixed Malayalam-English" and "Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text"


IIIT Tiruchirappalli
Madurai Kamaraj University, India
Eastern University, Sri Lanka
Insight SFI Research Centre for Data Analytics
Data Science Institute
NUI Galway
IIITM-K Trivandrum
University of Moratuwa, Sri Lanka
Sri Sivasubramaniya Nadar (SSN) Institutions, India
Thomsan Reuters