This is the corresponding repository to my bachelor thesis with the title "Anwendung von Natural Language Processing (NLP) zur Erkennung von Hassrede in Social Media Daten" (engl. 'Applying NLP-Techniques to detect hate speech in social media data.')
This project uses anaconda for dependency management
and the creation of a virtual environment.
Please make sure to install conda version 23.9.*
or later.
To create the virtual environment, please execute the following set of instructions.
conda env create -f ./env.yml
conda activate hatespeech-detection-NLP
Note: If the dependencies specified in env.yml
change, you will need to
update them manually by running the following command.
conda env update --file ./env.yml
The repository has been restructured.
In 1.Pre-Processing.ipynb
select the dataset by commenting in/out the respective
dataset in In[3]
.
If you want to save the results permanently, create a directory (e.g. experiment-x
)
and copy the notebooks which have already been executed.
It is highly recommended to use a GPU with CUDA support for training the LSTM and BERT Models. I strongly discourage using CPU-Based training because it takes multiple hours to complete training and predictions.
To execute the respective notebooks. Download the corresponding datasets. Rename them and move them to the respective directory. (The respective files were too big to commit to git)
Name | Renamed File | Result Directory | Further Info |
---|---|---|---|
Toxic Tweets Dataset | data/combined/train_toxic_tweets_dataset.csv |
experiment-1 |
|
Dynamically Generated Hate Speech Dataset | data/combined/train_dynamically_generated_hate_dataset.csv |
experiment-2 |
Official Paper |
The following notebooks are included in this repository.
Please run the Pre-Processing.ipynb
first in order to
pre-process the needed data. The necessary download links are provided in the
notebook.
Note: The data will be stored in a pickle
file, which then can be used across the different notebooks. Also balancing of
the data will be done in each file, therefore random_state was set to in
lib/constants.py
to assure the same training/validation data is used for
different models.
Name | Description | Info |
---|---|---|
1.Pre-processing.ipynb |
Pre-process datasets and save to pickle -file. |
Execute first |
2.EDA.ipynb |
Explorative Data Analysis of datasets | Optional |
3.ML.ipynb |
Notebook for detecting hatespeech using various ML models. |
Done |
4.LSTM.ipynb |
Notebook for detecting hatespeech using LSTM neural network. |
Done |
5.BERT.ipynb |
Notebook for detecting hatespeech using BERT model. |
Done |
.
├── 1.Pre-Processing.ipynb
├── 2.EDA.ipynb
├── 3.ML.ipynb
├── 4.LSTM.ipynb
├── 5.BERT.ipynb
├── env.yml
├── LICENSE
├── README.md
├── TODO.md
├── data
│ ├── combined
│ │ ├── train_dynamically_generated_hate_dataset.csv
│ │ └── train_toxic_tweets_dataset.csv
│ └── processed
│ └── combined_dataframes.pickle
├── experiment-1
│ ├── 1.Pre-Processing.ipynb
│ ├── 2.EDA.ipynb
│ ├── 3.ML.ipynb
│ ├── 4.LSTM.ipynb
│ └── 5.BERT.ipynb
├── experiment-2
│ ├── 1.Pre-Processing.ipynb
│ ├── 2.EDA.ipynb
│ ├── 3.ML.ipynb
│ ├── 4.LSTM.ipynb
│ └── 5.BERT.ipynb
└── lib
├── __init__.py
├── constants.py
├── data_balancing.py
├── evaluation.py
├── preprocessing_pipeline.py
├── preprocessing_utils.py
└── resampled_df_sanity.py