This project builds a machine learning pipeline to classify academic and research-related documents sourced from the Internet Archive. It includes stages for downloading raw data, processing and engineering features, training classifiers, and analyzing model results using advanced statistical tools.
The core of the project is organized into four Jupyter notebooks:
-
Download_and_Organize_files.ipynb
- Downloads HTML and PDF documents from sources like FatCat and GWB
- Extracts and stores structured metadata
- Handles encoding issues, language detection, and multiprocessing
-
Pipeline_and_Feature_Engineering.ipynb
- Parses text from documents
- Extracts lexical, structural, and contextual features
- Supports tokenization, translation, and domain extraction
-
Modelling_and_Evaluation.ipynb
- Trains classifiers (e.g., SVM, Random Forest, XGBoost, LDA/QDA)
- Performs hyperparameter tuning with GridSearchCV
- Evaluates models using confusion matrices, accuracy, and cross-validation
-
BMA_Analysis.ipynb
- Applies Bayesian Model Averaging (using PyMC3) for deeper evaluation
- Visualizes uncertainty and feature contributions
- Summarizes statistical insights with ArviZ
pip install -r requirements.txt
jupyter notebook
- Download_and_Organize_files.ipynb
- Pipeline_and_Feature_Engineering.ipynb
- Modelling_and_Evaluation.ipynb
- BMA_Analysis.ipynb
- pandas, numpy, matplotlib, seaborn
- scikit-learn, xgboost, statsmodels
- nltk, langdetect, tldextract
- pymc3, arviz, theano
- bs4, pdfminer.six, requests, ftfy, selectolax
- google_trans_new, mtranslate
See requirements.txt for details.
- John McNulty
- Sarai Alvarez
- Michael Langmayr
- Text structure and formatting patterns (e.g., word count, punctuation)
Language detection and translation
- Metadata such as URL domains
- Statistical modeling with PyMC3 and Bayesian averaging
This project is licensed under the MIT License.