- Introduction
- Motivation
- Dataset
- Methodology
- Project Structure
- Notebooks Overview and Links
- Dual Bert Encoder Architecture Diagram
This project, developed as part of our NLP module, focuses on the automatic evaluation of essays written for IELTS (International English Language Testing System) Writing Tasks. The primary goals are to predict essay scores across various criteria and to categorize essays based on their respective prompts.
- Core Objective: We aim to develop a system that can automatically score IELTS essays and assign them to categories based on the essay prompt.
- Driving Force: Our motivation is to provide students preparing for the IELTS test with a tool to get preliminary feedback on their essays, helping them identify areas for improvement.
- Key Questions: We seek to answer: "How good is my essay?" and "Does my essay align with others that address the same prompt?"
We are utilizing a pre-existing dataset from HuggingFace: IELTS Writing Task 2 Evaluation. For each essay prompt, the dataset provides the corresponding essay response. The evaluation scores are broken down into:
- Task Achievement
- Coherence and Cohesion
- Lexical Resource
- Grammatical Range and Accuracy
These individual scores contribute to an Overall Band Score.
Our project approach is structured as follows:
- Data Cleaning: Preprocessing the raw essay data.
- Statistical Analysis: Performing exploratory data analysis to gain insights into the dataset.
- Baseline Model Training: Training conventional machine learning models to establish benchmark performance.
- BERT Model Training: Fine-tuning BERT-based transformer models for essay scoring.
- Clustering Implementation: Developing clustering mechanisms to group similar essays.
Everything is runnable with Python Version 3.10.
To install the requirements run: pip install -r requirements.txt
We plan to train the following types of models:
- For Automated Essay Scoring (AES):
- Conventional Models (Baselines): Linear Regression, Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbor (KNN).
- Transformer-based Models: BERT, EuroBERT (for regression or classification tasks on scores).
- For Essay Clustering:
- K-Means Clustering
- Hierarchical Clustering
To uncover initial insights and correlations within the data, we intend to perform the following statistical analysis:
- Investigate the correlation between text length and band score.
- Analyze the influence of word diversity on the evaluation criteria.
- Determine the vocabulary distribution for each essay prompt.
- Examine if there are differences in the demands on the writer across the sub-score categories (Task Achievement, Coherence, Lexical Resource, Grammar).
- Compare the perceived difficulty of different essay prompts.
- Explore the correlation between specific words/phrases and the Overall Band Score.
This section provides direct links to the Jupyter notebooks used in this project.
- Data Cleaning - Initial data loading, cleaning, and preprocessing.
- Data Exploration - General exploration of the dataset features.
- Prompt Binning/Categorization - Analysis and grouping of essay prompts.
These notebooks correspond to the statistical analyses outlined in the methodology:
- Analysis 1 (Length vs. Score)
- Analysis 2 (Word Diversity vs. Criteria)
- Analysis 3 (Vocabulary per Prompt)
- Analysis 4 (Sub-score Demands)
- Analysis 5 (Prompt Difficulty)
- Analysis 6 (Specific Words vs. Score)
- Linear Regression - Baseline regression model
- Logistic Regression - Baseline classification model
- Support Vector Machines (SVM) - Baseline classification model
- K-Nearest Neighbors (KNN) - Baseline classification model
- BERT - Multiple BERT model architectures
- EuroBERT - BERT Model with larger context window
- K Means Clustering - Clustering of the essays and comparison to the clustered prompts
| Model | Training Type | Method/Configuration | Accuracy |
|---|---|---|---|
| Linear Regression | Regression | - | 53.30% |
| Logistic Regression | Classification | - | 59.91% |
| SVM | Classification | - | 59.47% |
| KNN | Classification | - | 64.0% |
| Basic BERT | Classification | Pooling hidden states | 27.75% |
| Basic BERT | Classification | CLS token | 32.60% |
| Basic BERT | Regression | CLS token | 57.50% |
| Twin BERT Encoder | Regression | CLS token appended | 61.23% |
| Twin BERT Encoder | Regression | Cross-attention | 80.83% |
| EuroBert | Regression | - | 53.0% |
