Overview

Offensive language detection is a critical task in natural language processing, particularly in the context of online discourse, where harmful content can spread rapidly. Identifying offensive language is challenging due to the varied ways in which offense is conveyed, including subtle linguistic cues, code-mixing, and cultural context. Code-mixing is a prevalent phenomenon in a multilingual community, and the code-mixed texts are sometimes written in non-native scripts. Systems trained on monolingual data fail on code-mixed data due to the complexity of code-switching at different linguistic levels in the text.

This shared task presents a corpus for offensive language identification of code-mixed text in Dravidian languages (Tamil-English, Malayalam-English, Kannada-English, and Tulu-English). This task is further complicated in low-resource languages where limited annotated datasets exist for offensive speech detection.

This shared task presents a gold-standard dataset for offensive language detection in Tamil, Malayalam, Kannada, and Tulu, enabling researchers to develop robust classification models. The dataset consists of social media comments and posts that are categorized into four classes:

Not Offensive (NO): Content without any offensive elements.
Offensive Untargeted (OU): Offensive content that is not directed at a specific individual or entity.
Offensive Targeted (OT): Direct attacks on an individual or group, including hate speech targeting a community, ethnicity, caste, or gender.
Not Tamil/Not Malayalam/Not Kannada/Not Tulu (NT): Content that does not contain the Tamil, Malayalam, Kannada, or Tulu languages.

The primary goal of this shared task is to build and evaluate systems that can automatically classify social media text into these four categories. Participants will be provided with training, development, and test datasets to develop their models. Given the real-world class imbalance in offensive content, models must be designed to handle the skewed distribution of data effectively.

As far as we know, this is the first shared task on offensive language detection in Tulu. By organizing this task, we aim to foster research in under-resourced languages, improve computational approaches for offense detection in multilingual and code-mixed settings, and contribute to the responsible use of AI in moderating harmful content online.

Broad Categories

Natural Language Processing (NLP)
Machine Learning (ML)

Use Cases

Offensive language detection is a crucial task in natural language processing, particularly in the era of digital communication, where social media platforms serve as primary spaces for public discourse. The rise of online abuse, cyberbullying, hate speech, and toxic interactions necessitates the development of automated systems that can detect and mitigate offensive content. The challenge becomes even more complex due to code-mixing with Kannada, English, and other languages, as well as the use of non-native scripts, making traditional offensive detection models ineffective.

By releasing a gold-standard dataset of offensive comments collected from YouTube discussions on news, entertainment, and social issues, this shared task aims to:

Enable researchers to develop robust offensive language detection models for the Dravidian languages Malayalam, Tamil, Kannada, and Tulu.
Assist social media platforms in moderating harmful content and ensuring safe online interactions.
Support law enforcement agencies and policymakers in identifying and mitigating the spread of hate speech, cyberbullying, and targeted abuse.
Help businesses and brands monitor online discourse related to their products, ensuring that discussions remain civil and constructive.

This shared task will bring together academia, industry, and social media platforms to develop AI-driven solutions for identifying offensive content in the Dravidian languages, contributing to safer and more inclusive digital spaces.