Dataset

The data is in the following format:

Comment	Label
actor na pudar surendra bantwal.thulunaddha maryadi depwer	non-offensive

Languages

Language	Train	Development	Test	Total
Tamil	35,139	4,388	4,392	43,919
Malayalam	16,010	1,999	2,001	20,010
Kannada	6,217	777	778	7,772
Tulu	2,692	577	576	3,845

Evaluation Plan

The classification systems’ performance will be measured in terms of macro averaged precision, macro averaged recall, and macro averaged F-Score across all the classes. Participants are encouraged to check their system with Scikit-learn's classification report:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

Participants are required to submit the predicted data in a tab-separated single file named predictions.csv. The predictions.csv file should have columns named ID (if it is there in the dataset) and class label (predictions).

To get the full data, register at Codabench: https://www.codabench.org/competitions/8494/