Hatespeech Detection NLP

This is the corresponding repository to my bachelor thesis with the title "Anwendung von Natural Language Processing (NLP) zur Erkennung von Hassrede in Social Media Daten" (engl. 'Applying NLP-Techniques to detect hate speech in social media data.')

Setup

This project uses anaconda for dependency management and the creation of a virtual environment. Please make sure to install conda version 23.9.* or later.

Linux & MacOS

To create the virtual environment, please execute the following set of instructions.

conda env create -f ./env.yml
conda activate hatespeech-detection-NLP

Note: If the dependencies specified in env.yml change, you will need to update them manually by running the following command.

conda env update --file ./env.yml

Important Notice

The repository has been restructured. In 1.Pre-Processing.ipynb select the dataset by commenting in/out the respective dataset in In[3]. If you want to save the results permanently, create a directory (e.g. experiment-x) and copy the notebooks which have already been executed.

Further Information

It is highly recommended to use a GPU with CUDA support for training the LSTM and BERT Models. I strongly discourage using CPU-Based training because it takes multiple hours to complete training and predictions.

Datasets

To execute the respective notebooks. Download the corresponding datasets. Rename them and move them to the respective directory. (The respective files were too big to commit to git)

Name	Renamed File	Result Directory	Further Info
Toxic Tweets Dataset	`data/combined/train_toxic_tweets_dataset.csv`	`experiment-1`
Dynamically Generated Hate Speech Dataset	`data/combined/train_dynamically_generated_hate_dataset.csv`	`experiment-2`	Official Paper

Notebooks

The following notebooks are included in this repository. Please run the Pre-Processing.ipynb first in order to pre-process the needed data. The necessary download links are provided in the notebook.

Note: The data will be stored in a pickle file, which then can be used across the different notebooks. Also balancing of the data will be done in each file, therefore random_state was set to in lib/constants.py to assure the same training/validation data is used for different models.

Name	Description	Info
`1.Pre-processing.ipynb`	Pre-process datasets and save to `pickle`-file.	Execute first
`2.EDA.ipynb`	Explorative Data Analysis of datasets	`Optional`
`3.ML.ipynb`	Notebook for detecting hatespeech using various `ML` models.	`Done`
`4.LSTM.ipynb`	Notebook for detecting hatespeech using `LSTM` neural network.	`Done`
`5.BERT.ipynb`	Notebook for detecting hatespeech using `BERT` model.	`Done`

Project structure

.
├── 1.Pre-Processing.ipynb
├── 2.EDA.ipynb
├── 3.ML.ipynb
├── 4.LSTM.ipynb
├── 5.BERT.ipynb
├── env.yml
├── LICENSE
├── README.md
├── TODO.md
├── data
│   ├── combined
│   │   ├── train_dynamically_generated_hate_dataset.csv
│   │   └── train_toxic_tweets_dataset.csv
│   └── processed
│       └── combined_dataframes.pickle
├── experiment-1
│   ├── 1.Pre-Processing.ipynb
│   ├── 2.EDA.ipynb
│   ├── 3.ML.ipynb
│   ├── 4.LSTM.ipynb
│   └── 5.BERT.ipynb
├── experiment-2
│   ├── 1.Pre-Processing.ipynb
│   ├── 2.EDA.ipynb
│   ├── 3.ML.ipynb
│   ├── 4.LSTM.ipynb
│   └── 5.BERT.ipynb
└── lib
    ├── __init__.py
    ├── constants.py
    ├── data_balancing.py
    ├── evaluation.py
    ├── preprocessing_pipeline.py
    ├── preprocessing_utils.py
    └── resampled_df_sanity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hatespeech Detection NLP

Setup

Linux & MacOS

Important Notice

Further Information

Datasets

Notebooks

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
data/processed		data/processed
experiment-1		experiment-1
experiment-2		experiment-2
lib		lib
tables		tables
.gitignore		.gitignore
1.Pre-Processing.ipynb		1.Pre-Processing.ipynb
2.EDA.ipynb		2.EDA.ipynb
3.ML.ipynb		3.ML.ipynb
4.LSTM.ipynb		4.LSTM.ipynb
5.BERT.ipynb		5.BERT.ipynb
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

License

ifTerzic/hatespeech-detection-NLP

Folders and files

Latest commit

History

Repository files navigation

Hatespeech Detection NLP

Setup

Linux & MacOS

Important Notice

Further Information

Datasets

Notebooks

Project structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages