Skip to content

ifTerzic/hatespeech-detection-NLP

Repository files navigation

Hatespeech Detection NLP

This is the corresponding repository to my bachelor thesis with the title "Anwendung von Natural Language Processing (NLP) zur Erkennung von Hassrede in Social Media Daten" (engl. 'Applying NLP-Techniques to detect hate speech in social media data.')

Setup

This project uses anaconda for dependency management and the creation of a virtual environment. Please make sure to install conda version 23.9.* or later.

Linux & MacOS

To create the virtual environment, please execute the following set of instructions.

conda env create -f ./env.yml
conda activate hatespeech-detection-NLP

Note: If the dependencies specified in env.yml change, you will need to update them manually by running the following command.

conda env update --file ./env.yml

Important Notice

The repository has been restructured. In 1.Pre-Processing.ipynb select the dataset by commenting in/out the respective dataset in In[3]. If you want to save the results permanently, create a directory (e.g. experiment-x) and copy the notebooks which have already been executed.

Further Information

It is highly recommended to use a GPU with CUDA support for training the LSTM and BERT Models. I strongly discourage using CPU-Based training because it takes multiple hours to complete training and predictions.

Datasets

To execute the respective notebooks. Download the corresponding datasets. Rename them and move them to the respective directory. (The respective files were too big to commit to git)

Name Renamed File Result Directory Further Info
Toxic Tweets Dataset data/combined/train_toxic_tweets_dataset.csv experiment-1
Dynamically Generated Hate Speech Dataset data/combined/train_dynamically_generated_hate_dataset.csv experiment-2 Official Paper

Notebooks

The following notebooks are included in this repository. Please run the Pre-Processing.ipynb first in order to pre-process the needed data. The necessary download links are provided in the notebook.

Note: The data will be stored in a pickle file, which then can be used across the different notebooks. Also balancing of the data will be done in each file, therefore random_state was set to in lib/constants.py to assure the same training/validation data is used for different models.

Name Description Info
1.Pre-processing.ipynb Pre-process datasets and save to pickle-file. Execute first
2.EDA.ipynb Explorative Data Analysis of datasets Optional
3.ML.ipynb Notebook for detecting hatespeech using various ML models. Done
4.LSTM.ipynb Notebook for detecting hatespeech using LSTM neural network. Done
5.BERT.ipynb Notebook for detecting hatespeech using BERT model. Done

Project structure

.
├── 1.Pre-Processing.ipynb
├── 2.EDA.ipynb
├── 3.ML.ipynb
├── 4.LSTM.ipynb
├── 5.BERT.ipynb
├── env.yml
├── LICENSE
├── README.md
├── TODO.md
├── data
│   ├── combined
│   │   ├── train_dynamically_generated_hate_dataset.csv
│   │   └── train_toxic_tweets_dataset.csv
│   └── processed
│       └── combined_dataframes.pickle
├── experiment-1
│   ├── 1.Pre-Processing.ipynb
│   ├── 2.EDA.ipynb
│   ├── 3.ML.ipynb
│   ├── 4.LSTM.ipynb
│   └── 5.BERT.ipynb
├── experiment-2
│   ├── 1.Pre-Processing.ipynb
│   ├── 2.EDA.ipynb
│   ├── 3.ML.ipynb
│   ├── 4.LSTM.ipynb
│   └── 5.BERT.ipynb
└── lib
    ├── __init__.py
    ├── constants.py
    ├── data_balancing.py
    ├── evaluation.py
    ├── preprocessing_pipeline.py
    ├── preprocessing_utils.py
    └── resampled_df_sanity.py

About

Hate Speech Detection in social media data using NLP

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages