[examples/turing] - rename it to turing
This commit is contained in:
5
examples/turing/content/about.md
Normal file
5
examples/turing/content/about.md
Normal file
@@ -0,0 +1,5 @@
|
||||
---
|
||||
title: About
|
||||
---
|
||||
|
||||
This is an about page, left here as an example
|
||||
14
examples/turing/content/datasets/abusive-eval-v1-0.md
Normal file
14
examples/turing/content/datasets/abusive-eval-v1-0.md
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: AbuseEval v1.0
|
||||
link-to-publication: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.760.pdf
|
||||
link-to-data: https://github.com/tommasoc80/AbuseEval
|
||||
task-description: Explicitness annotation of offensive and abusive content
|
||||
details-of-task: "Enriched versions of the OffensEval/OLID dataset with the distinction of explicit/implicit offensive messages and the new dimension for abusive messages. Labels for offensive language: EXPLICIT, IMPLICT, NOT; Labels for abusive language: EXPLICIT, IMPLICT, NOTABU"
|
||||
size-of-dataset: 14100
|
||||
percentage-abusive: 20.75
|
||||
language: English
|
||||
level-of-annotation: ["Tweets"]
|
||||
platform: ["Twitter"]
|
||||
medium: ["Text"]
|
||||
reference: "Caselli, T., Basile, V., Jelena, M., Inga, K., and Michael, G. 2020. \"I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language\". The 12th Language Resources and Evaluation Conference (pp. 6193-6202). European Language Resources Association."
|
||||
---
|
||||
@@ -0,0 +1,16 @@
|
||||
---
|
||||
title: "Abusive Language Detection on Arabic Social Media (Al Jazeera)"
|
||||
link-to-publication: https://www.aclweb.org/anthology/W17-3008
|
||||
link-to-data: http://alt.qcri.org/~hmubarak/offensive/AJCommentsClassification-CF.xlsx
|
||||
task-description: Ternary (Obscene, Offensive but not obscene, Clean)
|
||||
details-of-task: Incivility
|
||||
size-of-dataset: 32000
|
||||
percentage-abusive: 0.81
|
||||
language: Arabic
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["AlJazeera"]
|
||||
medium: ["Text"]
|
||||
reference: "Mubarak, H., Darwish, K. and Magdy, W., 2017. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, Canada: Association for Computational Linguistics, pp.52-56."
|
||||
---
|
||||
|
||||
SOMETHING TEST
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "CoRAL: a Context-aware Croatian Abusive Language Dataset"
|
||||
link-to-publication: https://aclanthology.org/2022.findings-aacl.21/
|
||||
link-to-data: https://github.com/shekharRavi/CoRAL-dataset-Findings-of-the-ACL-AACL-IJCNLP-2022
|
||||
task-description: Multi-class based on context dependency categories (CDC)
|
||||
details-of-task: Detectioning CDC from abusive comments
|
||||
size-of-dataset: 2240
|
||||
percentage-abusive: 100
|
||||
language: "Croatian"
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Posts"]
|
||||
medium: ["Newspaper Comments"]
|
||||
reference: "Ravi Shekhar, Mladen Karan and Matthew Purver (2022). CoRAL: a Context-aware Croatian Abusive Language Dataset. Findings of the ACL: AACL-IJCNLP."
|
||||
---
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Detecting Abusive Albanian
|
||||
link-to-publication: https://arxiv.org/abs/2107.13592
|
||||
link-to-data: https://doi.org/10.6084/m9.figshare.19333298.v1
|
||||
task-description: Hierarchical (offensive/not; untargeted/targeted; person/group/other)
|
||||
details-of-task: Detect and categorise abusive language in social media data
|
||||
size-of-dataset: 11874
|
||||
percentage-abusive: 13.2
|
||||
language: Albanian
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Instagram", "Youtube"]
|
||||
medium: ["Text"]
|
||||
reference: "Nurce, E., Keci, J., Derczynski, L., 2021. Detecting Abusive Albanian. arXiv:2107.13592"
|
||||
---
|
||||
@@ -0,0 +1,15 @@
|
||||
---
|
||||
title: "Hate Speech Detection in the Bengali language: A Dataset and its Baseline Evaluation"
|
||||
link-to-publication: https://arxiv.org/pdf/2012.09686.pdf
|
||||
link-to-data: https://www.kaggle.com/naurosromim/bengali-hate-speech-dataset
|
||||
task-description: Binary (hateful, not)
|
||||
details-of-task: "Several categories: sports, entertainment, crime, religion, politics, celebrity and meme"
|
||||
size-of-dataset: 30000
|
||||
percentage-abusive: 0.33
|
||||
language: Bengali
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Youtube", "Facebook"]
|
||||
medium: ["Text"]
|
||||
reference: "Romim, N., Ahmed, M., Talukder, H., & Islam, M. S. (2021). Hate speech detection in the bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence (pp. 457-468). Springer, Singapore."
|
||||
---
|
||||
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Large-Scale Hate Speech Detection with Cross-Domain Transfer
|
||||
link-to-publication: https://aclanthology.org/2022.lrec-1.238/
|
||||
link-to-data: https://github.com/avaapm/hatespeech
|
||||
task-description: Three-class (Hate speech, Offensive language, None)
|
||||
details-of-task: Hate speech detection on social media (Twitter) including 5 target groups (gender, race, religion, politics, sports)
|
||||
size-of-dataset: "100k English (27593 hate, 30747 offensive, 41660 none)"
|
||||
percentage-abusive: 58.3
|
||||
language: English
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Twitter"]
|
||||
medium: ["Text", "Image"]
|
||||
reference: "Cagri Toraman, Furkan Şahinuç, Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2215–2225, Marseille, France. European Language Resources Association."
|
||||
---
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: "Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language"
|
||||
link-to-publication: https://arxiv.org/abs/2103.10195
|
||||
link-to-data: https://drive.google.com/file/d/1mM2vnjsy7QfUmdVUpKqHRJjZyQobhTrW/view
|
||||
task-description: Binary (misogyny/none) and Multi-class (none, discredit, derailing, dominance, stereotyping & objectification, threat of violence, sexual harassment, damning)
|
||||
details-of-task: Introducing an Arabic Levantine Twitter dataset for Misogynistic language
|
||||
size-of-dataset: 6603
|
||||
percentage-abusive: 48.76
|
||||
language: Arabic
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Twitter"]
|
||||
medium: ["Text", "Images"]
|
||||
reference: "Hala Mulki and Bilal Ghanem. 2021. Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 154–163, Kyiv, Ukraine (Virtual). Association for Computational Linguistics"
|
||||
---
|
||||
14
examples/turing/content/datasets/measuring-hate-speech.md
Normal file
14
examples/turing/content/datasets/measuring-hate-speech.md
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Measuring Hate Speech
|
||||
link-to-publication: https://arxiv.org/abs/2009.10277
|
||||
link-to-data: https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech
|
||||
task-description: 10 ordinal labels (sentiment, (dis)respect, insult, humiliation, inferior status, violence, dehumanization, genocide, attack/defense, hate speech), which are debiased and aggregated into a continuous hate speech severity score (hate_speech_score) that includes a region for counterspeech & supportive speeech. Includes 8 target identity groups (race/ethnicity, religion, national origin/citizenship, gender, sexual orientation, age, disability, political ideology) and 42 identity subgroups.
|
||||
details-of-task: Hate speech measurement on social media in English
|
||||
size-of-dataset: "39,565 comments annotated by 7,912 annotators on 10 ordinal labels, for 1,355,560 total labels."
|
||||
percentage-abusive: 25
|
||||
language: English
|
||||
level-of-annotation: ["Social media comment"]
|
||||
platform: ["Twitter", "Reddit", "Youtube"]
|
||||
medium: ["Text"]
|
||||
reference: "Kennedy, C. J., Bacon, G., Sahn, A., & von Vacano, C. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application. arXiv preprint arXiv:2009.10277."
|
||||
---
|
||||
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Offensive Language and Hate Speech Detection for Danish
|
||||
link-to-publication: http://www.derczynski.com/papers/danish_hsd.pdf
|
||||
link-to-data: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805
|
||||
task-description: "Branching structure of tasks: Binary (Offensive, Not), Within Offensive (Target, Not), Within Target (Individual, Group, Other)"
|
||||
details-of-task: Group-directed + Person-directed
|
||||
size-of-dataset: 3600
|
||||
percentage-abusive: 0.12
|
||||
language: Danish
|
||||
level-of-annotation: ["Posts"]
|
||||
platform: ["Twitter", "Reddit", "Newspaper comments"]
|
||||
medium: ["Text"]
|
||||
reference: "Sigurbergsson, G. and Derczynski, L., 2019. Offensive Language and Hate Speech Detection for Danish. ArXiv."
|
||||
---
|
||||
52
examples/turing/content/index.mdx
Normal file
52
examples/turing/content/index.mdx
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: Hate Speech Dataset Catalogue
|
||||
---
|
||||
|
||||
This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.
|
||||
|
||||
The list is maintained by [Leon Derczynski](https://www.derczynski.com/), [Bertie Vidgen](https://www.turing.ac.uk/people/researchers/bertie-vidgen), [Hannah Rose Kirk](https://www.hannahrosekirk.com/), Pica Johansson, [Yi-Ling Chung](https://yilingchung.github.io/), Mads Guldborg Kjeldgaard Kongsbak, [Laila Sprejer](https://www.turing.ac.uk/people/researchers/laila-sprejer), and Philine Zeinert.
|
||||
|
||||
We provide a list of [datasets](#Datasets-header) and [keywords](#Keywords-header). If you would like to contribute to our catalogue or add your dataset, please see the [instructions for contributing](#Contributing-header).
|
||||
|
||||
If you use these resources, please cite (and read!) our paper: [Directions in Abusive Language Training Data: Garbage In, Garbage Out](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243300). And if you would like to find other resources for researching online hate, visit The Alan Turing Institute's [Online Hate Research Hub](https://www.turing.ac.uk/research/research-programmes/public-policy/online-hate-research-hub) or read The Alan Turing Institute's [Reading List on Online Hate and Abuse Research](https://docs.google.com/document/d/1WVkVGp29Jt6d-4fBnZ5OWVYuFn_03rzz-KBqPsu6gTM/edit?usp=sharing).
|
||||
|
||||
If you're looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at ['Resources and benchmark corpora for hate speech detection: a systematic review'](https://link.springer.com/article/10.1007/s10579-020-09502-8) by Poletto et al. in *Language Resources and Evaluation*.
|
||||
|
||||
Accompanying [data statements](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00041) preferred for all corpora.
|
||||
|
||||
<a href="#Datasets-header" className="w-fit mx-auto no-underline rounded-md py-3 px-6 outline-offset-2 transition !active:transition-none bg-zinc-800 !font-semibold !text-zinc-100 hover:bg-zinc-700 active:bg-zinc-800 active:text-zinc-100/70 dark:bg-zinc-700 dark:hover:bg-zinc-600 !dark:active:bg-zinc-700 dark:active:text-zinc-100/70">See datasets</a>
|
||||
|
||||
<h2 id="Contributing-header">How to contribute</h2>
|
||||
|
||||
We accept entries to our catalogue based on pull requests to the content folder. The dataset must be avaliable for download to be included in the list. If you want to add an entry, follow these steps!
|
||||
|
||||
Please send just one dataset addition/edit at a time - edit it in, then save. This will make everyone’s life easier (including yours!)
|
||||
|
||||
### Create file
|
||||
|
||||
Go to the repo url file and click the "Add file" dropdown and then click on "Create new file".
|
||||
|
||||

|
||||
|
||||
### Choose location
|
||||
|
||||
In the following page type `content/datasets/<name-of-the-file>.md`. if you want to add an entry to the datasets catalog or `content/keywords/<name-of-the-file>.md` if you want to add an entry to the lists of abusive keywords, if you want to just add an static page you can leave in the root of `content` it will automatically get assigned an url eg: `/content/about.md` becomes the `/about` page
|
||||
|
||||

|
||||
|
||||
### Fill in content
|
||||
|
||||
Copy the contents of `templates/dataset.md` or `templates/keywords.md` respectively to the camp below, filling out the fields with the correct data format
|
||||
|
||||

|
||||
|
||||
### Commit changes
|
||||
|
||||
Click on "Commit changes", on the popup make sure you give some brief detail on the proposed change. and then click on Propose changes
|
||||
|
||||
<img src='https://i.imgur.com/BxuxKEJ.png' style={{ maxWidth: '50%', margin: '0 auto' }}/>
|
||||
|
||||
### Submit PR
|
||||
|
||||
Submit the pull request on the next page when prompted.
|
||||
|
||||
10
examples/turing/content/keywords/hurtlex.md
Normal file
10
examples/turing/content/keywords/hurtlex.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
title: Hurtlex
|
||||
description: HurtLex is a lexicon of offensive, aggressive, and hateful words in over 50 languages. The words are divided into 17 categories, plus a macro-category indicating whether there is stereotype involved.
|
||||
data-link: https://github.com/valeriobasile/hurtlex
|
||||
reference: http://ceur-ws.org/Vol-2253/paper49.pdf, Proc. CLiC-it 2018
|
||||
---
|
||||
|
||||
## Markdown TEST
|
||||
|
||||
Some text
|
||||
5
examples/turing/content/keywords/jiang-et-al.md
Normal file
5
examples/turing/content/keywords/jiang-et-al.md
Normal file
@@ -0,0 +1,5 @@
|
||||
---
|
||||
title: SexHateLex is a Chinese lexicon of hateful and sexist words.
|
||||
data-link: https://doi.org/10.5281/zenodo.4773875
|
||||
reference: http://ceur-ws.org/Vol-2253/paper49.pdf, Journal of OSNEM, Vol.27, 2022, 100182, ISSN 2468-6964.
|
||||
---
|
||||
Reference in New Issue
Block a user