Alan turing portal (#815)
* [alan-turing-portal][m] - initial commit * [alan-turing][m] - first page with search * [alan-turing][m] - cleanup
This commit is contained in:
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "Abusive Language Detection on Arabic Social Media (Al Jazeera)"
|
||||
link-to-publication: https://www.aclweb.org/anthology/W17-3008
|
||||
link-to-data: http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx
|
||||
task-description: Ternary (Obscene, Offensive but not obscene, Clean)
|
||||
details-of-task: Incivility
|
||||
size-of-dataset: 32000
|
||||
percentage-abusive: 0.81
|
||||
language: Arabic
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["AlJazeera"]
|
||||
medium: ["Text"]
|
||||
reference: "Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56."
|
||||
---
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Detecting Abusive Albanian
|
||||
link-to-publication: https://arxiv.org/abs/2107.13592
|
||||
link-to-data: https://doi.org/10.6084/m9.figshare.19333298.v1
|
||||
task-description: Hierarchical (offensive/not; untargeted/targeted; person/group/other)
|
||||
details-of-task: Detect and categorise abusive language in social media data
|
||||
size-of-dataset: 11874
|
||||
percentage-abusive: 13.2
|
||||
language: Albanian
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Instagram", "Youtube"]
|
||||
medium: ["Text"]
|
||||
reference: "Nurce, E., Keci, J., Derczynski, L., 2021. Detecting Abusive Albanian. arXiv:2107.13592"
|
||||
---
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "Hate Speech Detection in the Bengali language: A Dataset and its Baseline Evaluation"
|
||||
link-to-publication: https://arxiv.org/pdf/2012.09686.pdf
|
||||
link-to-data: https://www.kaggle.com/naurosromim/bengali-hate-speech-dataset
|
||||
task-description: Binary (hateful, not)
|
||||
details-of-task: "Several categories: sports, entertainment, crime, religion, politics, celebrity and meme"
|
||||
size-of-dataset: 30000
|
||||
percentage-abusive: 0.33
|
||||
language: Bengali
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Youtube", "Facebook"]
|
||||
medium: ["Text"]
|
||||
reference: "Romim, N., Ahmed, M., Talukder, H., & Islam, M. S. (2021). Hate speech detection in the bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 457-468). Springer, Singapore."
|
||||
---
|
||||
9
examples/alan-turing-portal/content/index.md
Normal file
9
examples/alan-turing-portal/content/index.md
Normal file
@@ -0,0 +1,9 @@
|
||||
This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.
|
||||
|
||||
The list is maintained by Leon Derczynski, Bertie Vidgen, Hannah Rose Kirk, Pica Johansson, Yi-Ling Chung, Mads Guldborg Kjeldgaard Kongsbak, Laila Sprejer, and Philine Zeinert.
|
||||
|
||||
We provide a list of datasets and keywords. If you would like to contribute to our catalogue or add your dataset, please see the instructions for contributing.
|
||||
|
||||
If you use these resources, please cite (and read!) our paper: Directions in Abusive Language Training Data: Garbage In, Garbage Out. And if you would like to find other resources for researching online hate, visit The Alan Turing Institute’s Online Hate Research Hub or read The Alan Turing Institute’s Reading List on Online Hate and Abuse Research.
|
||||
|
||||
If you’re looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at ‘Resources and benchmark corpora for hate speech detection: a systematic review’ by Poletto et al. in Language Resources and Evaluation.
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language"
|
||||
link-to-publication: https://arxiv.org/abs/2103.10195
|
||||
link-to-data: https://drive.google.com/file/d/1mM2vnjsy7QfUmdVUpKqHRJjZyQobhTrW/view
|
||||
task-description: Binary (misogyny/none) and Multi-class (none, discredit, derailing, dominance, stereotyping & objectification, threat of violence, sexual harassment, damning)
|
||||
details-of-task: Introducing an Arabic Levantine Twitter dataset for Misogynistic language
|
||||
size-of-dataset: 6603
|
||||
percentage-abusive: 48.76
|
||||
language: Arabic
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Twitter"]
|
||||
medium: ["Text", "Images"]
|
||||
reference: "Hala Mulki and Bilal Ghanem. 2021. Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 154–163, Kyiv, Ukraine (Virtual). Association for Computational Linguistics"
|
||||
---
|
||||
Reference in New Issue
Block a user