Internet Archive Research Classification

This project builds a machine learning pipeline to classify academic and research-related documents sourced from the Internet Archive. It includes stages for downloading raw data, processing and engineering features, training classifiers, and analyzing model results using advanced statistical tools.

📂 Project Structure

The core of the project is organized into four Jupyter notebooks:

Download_and_Organize_files.ipynb
- Downloads HTML and PDF documents from sources like FatCat and GWB
- Extracts and stores structured metadata
- Handles encoding issues, language detection, and multiprocessing
Pipeline_and_Feature_Engineering.ipynb
- Parses text from documents
- Extracts lexical, structural, and contextual features
- Supports tokenization, translation, and domain extraction
Modelling_and_Evaluation.ipynb
- Trains classifiers (e.g., SVM, Random Forest, XGBoost, LDA/QDA)
- Performs hyperparameter tuning with GridSearchCV
- Evaluates models using confusion matrices, accuracy, and cross-validation
BMA_Analysis.ipynb
- Applies Bayesian Model Averaging (using PyMC3) for deeper evaluation
- Visualizes uncertainty and feature contributions
- Summarizes statistical insights with ArviZ

🛠️ Setup Instructions

Install Dependencies

Create a virtual environment and install dependencies using:

pip install -r requirements.txt

Launch the Notebooks:

jupyter notebook

Run the notebooks in the following order for full functionality:

- Download_and_Organize_files.ipynb

- Pipeline_and_Feature_Engineering.ipynb

- Modelling_and_Evaluation.ipynb

- BMA_Analysis.ipynb

📌 Dependencies

- pandas, numpy, matplotlib, seaborn

- scikit-learn, xgboost, statsmodels

- nltk, langdetect, tldextract

- pymc3, arviz, theano

- bs4, pdfminer.six, requests, ftfy, selectolax

- google_trans_new, mtranslate

See requirements.txt for details.

👥 Authors

John McNulty
Sarai Alvarez
Michael Langmayr

📊 Features Used

- Text structure and formatting patterns (e.g., word count, punctuation)
  Language detection and translation

- Metadata such as URL domains

- Statistical modeling with PyMC3 and Bayesian averaging

📘 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
BMA Analysis.ipynb		BMA Analysis.ipynb
Download and Organzie files.ipynb		Download and Organzie files.ipynb
FinalFeaturesCleaned_Apr10.csv		FinalFeaturesCleaned_Apr10.csv
FinalFeatures_Apr10.csv		FinalFeatures_Apr10.csv
GWBFATCAT_with_initial_features_Mar31.csv		GWBFATCAT_with_initial_features_Mar31.csv
LICENSE		LICENSE
Modelling and Evaluation.ipynb		Modelling and Evaluation.ipynb
Pipeline and Feature Engineering.ipynb		Pipeline and Feature Engineering.ipynb
README.md		README.md
data_withfeatures_Apr7.csv		data_withfeatures_Apr7.csv
final_english.csv		final_english.csv
final_nonenglish.csv		final_nonenglish.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Internet Archive Research Classification

📂 Project Structure

🛠️ Setup Instructions

Install Dependencies

Create a virtual environment and install dependencies using:

Launch the Notebooks:

Run the notebooks in the following order for full functionality:

📌 Dependencies

👥 Authors

📊 Features Used

📘 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mikelangmayr/internet-archive-research-classification

Folders and files

Latest commit

History

Repository files navigation

Internet Archive Research Classification

📂 Project Structure

🛠️ Setup Instructions

Install Dependencies

Create a virtual environment and install dependencies using:

Launch the Notebooks:

Run the notebooks in the following order for full functionality:

📌 Dependencies

👥 Authors

📊 Features Used

📘 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages