Installation

Install micromamba for dependency management.

Tip

Adding alias mm="micromamba" to your .bashrc or .zshrc shortens your commands.

Create a new environment from environment.yml: $ mm create -n materials-concepts -f environment.yml

Note

If changes were made to the environment, update the environment file: $ mm install -f environment.yml.

Activate the environment: $ mm activate materials-concepts

Install the local package in editable mode: pip install --no-build-isolation --no-deps --disable-pip-version-check -e .

Dataset Creation

Create data/ folder

Create data/ top-level folder and a table/ subfolder to store the dataset.

Data Fetching

$ python materials_concepts/dataset/downloader/download_sources.py --query 'materials science' --out data/table/test_3810.csv

This will create a data/materials-science.sources.csv file with all the sources.

Fetch works from single source:

$ python materials_concepts/dataset/downloader/download_works.py fetchsingle --source S82336448 --out data/table/S82336448.works.csv

This will create a S82336448.csv file with all the works belonging to that source.

Fetch works from all sources:

$ python materials_concepts/dataset/downloader/download_works.py fetchall --sources data/materials-science.sources.csv --out data/table/materials-science.works.csv

During fetching, this will create a {source}.csv file for each source in cache listing all the works which belong to that source. After downloading, these are merged automatically into a single file out. If the download gets interrupted, the downloaded files serve as a cache. Re-run the script, it will automatically skip sources for which the data was already fetched.

Data Filtering

Filter the data to improve its quality:

$ python materials_concepts/dataset/filtering/filter_works.py --source data/table/materials-science.works.csv --out data/table/materials-science.filtered.works.csv --njobs 8 --min-abstract-length 250 --max-abstract-length 3000 --topic "Materials science"

This will output a file materials-science.filtered.works.csv in the data/table/ containing all works which sufficed the conditions.

Data Preparation

Cleaning abstracts

Clean the abstracts:

$ python preparation/clean_abstracts.py materials-science.filtered.works.csv --folder data/

This will output a file materials-science.cleaned.works.csv in the specified folder containing all works with cleaned abstracts.

Data Enrichment

Note: As these operations are very time consuming, the scripts make use of parallelization.

Materials Extraction

Extract 'chemical elements' from abstracts:

$ python preparation/extract_elements.py materials-science.cleaned.works.csv --folder data/

This will output a file materials-science.elements.works.csv in the specified folder containing all works with extracted chemical elements in a separate columns elements.

Concept Extraction (DEPRECATED)

Extract 'concepts' from abstracts using several methods (RAKE, keyBERT, OpenAlex concept list, Searching 'keywords' in abstracts):

$ python preparation/extract_concepts.py materials-science.elements.works.csv {method} {colname} --folder data/

e.g.:

$ python preparation/extract_concepts.py materials-science.elements.works.csv rake rake_concepts --folder data/

This will output a file materials-science.rake.works.csv in the specified folder containing all works with extracted concepts according to rake ({method}) in a separate columns rake_concepts ({colname}).

Concept Extraction (LLM)

The used concepts are generated by utilizing a LLM (LLaMa-2-13B) that is fine-tuned to this downstream task.

To see how the concepts are generated, check out this repository.

If you were to replicate the process, you would have to copy the materials-science.elements.works.csv file to the concept extraction repository. After extraction there, you would have to copy the resulting materials-science.llamaX.works.csv file back to this repository.

Classification

Build Graph

Build concepts graph by executing the following command:

python graph/build.py \
  --input_path data/table/materials-science.llama2.works.csv \
  --output_path data/graph/edges.S.pkl \
  --output_lookup_path data/table/lookup/lookup.S.csv \
  --colname llama_concepts \
  --min_occurence 3 \
  --min_words 3 \
  --max_words 20 \
  --min_occurence_elements 3 \
  --min_amount_elements 2

Produces a pickled file graph/edges.pkl containing the graph:

{
  "num_of_vertices": 123456,
  "edges": [(v1, v2, timestamp), (v1, v2, timestamp), ...],
}

Because of the sparse nature of the graph, it is stored as edge list. The timestamp is the number of days passed since 01-01-1970.

Note: If you want use rake concepts, you have to first extract the rake concepts and then replace llama_concepts with rake_concepts in the command above.

Note: The concepts are run against a filter mechanism to remove concepts which are not relevant for the domain. The filters are stored in the same file and can be extended or modified as needed.

Generate Raw Classification Task Data

Generate training and test data for classification task: Given {n} vertex pairs, decide whether they will be connected or not in {delta} years.

python model/create_data.py \
 --graph_path data/graph/edges.pkl \
 --data_path data/model/data.pkl \
 --year_start_train 2016 \
 --year_start_test 2019 \
 --year_delta 3 \
 --edges_used_train 5_000_000 \
 --edges_used_test 2_000_000 \
 --train_val_split 0.8 \
 --min_links 1 \
 --max_v_degree=None \
 --verbose=True

Output:

{
  "year_train": 2016,
  "year_test": 2019,
  "year_delta": 3,
  "min_links": 1,
  "max_v_degree": None,
  "X_train": [(v1, v2), ...] unnconnected vertex pairs until 2016, (80%)
  "y_train": [0, 1, 1, 0, ...] indicating whether the vertex pairs will be connected in 2019 (2016 + 3) (80%)
  "X_val": (20%) unnconnected vertex pairs until 2016,
  "y_val": (20%) whether the vertex pairs will be connected in 2019,
  "X_test": [(v1, v2), ...] unnconnected vertex pairs until 2019,
  "y_test": whether the vertex pairs will be connected in 2022,
}

Classification Process

The classification process can typically be divided into two steps:

Generate embeddings for nodes
Train a (binary) classifier on the (concatenated) embeddings

Baseline Model

Generate the embeddings

Embeddings for training:

python -u model/combi/pre_compute.py \
  --graph_path data/graph/edges.M.pkl \
  --output_path data/model/baseline/features.2016.binary.M.pkl.gz \
  --binary True \
  --years "[2012, 2013, 2014, 2015, 2016]"

Embeddings for validation:

python -u model/combi/pre_compute.py \
  --graph_path data/graph/edges.M.pkl \
  --output_path data/model/baseline/features.2019.binary.M.pkl.gz \
  --binary True \
  --years "[2015, 2016, 2017, 2018, 2019]"

Train the model

python model/baseline/train.py \
  --data_path data/model/data.pkl \
  --embeddings_path data/model/baseline/embeddings.pkl \
  --lr 0.001 \
  --batch_size 100 \
  --num_epochs 1 \
  --train_model True \
  --save_model data/model/baseline/model.pt \
  --metrics_path data/model/baseline/metrics.pkl \
  --eval_mode False

MLP

Use concatentation of baseline features and word embeddings as input. Take a look at the chapter Word Embeddings to see how to generate word embeddings.
Train the model

python -u model/combi/train.py \
  --data_path data/model/data.pkl \
  --emb_f_train_path data/model/combi/features_2016.M.pkl.gz \
  --emb_f_test_path data/model/combi/features_2019.M.pkl.gz \
  --emb_c_train_path data/model/concept_embs/av_embs_2016.M.pkl.gz \
  --emb_c_test_path data/model/concept_embs/av_embs_2019.M.pkl.gz \
  --lr 0.001 \
  --gamma 0.8 \
  --batch_size 100 \
  --num_epochs 1000 \
  --pos_ratio 0.3 \
  --dropout 0.1 \
  --layers "[1556, 1024, 512, 256, 64, 32, 16, 8, 4, 1]" \
  --step_size 40 \
  --log_interval 10 \
  --log_file "logs.log" \
  --save_model False \
  --sliding_window 5 \
  --use_loader False
,

Word Embeddings

Generate Word Embeddings

Word embeddings are generated using BERT or a fine-tuned version of BERT e.g. MatSciBERT. To extract ambeddings for all concepts (all embedded tokens comprising a concept are averaged), run:

python -u word_embeddings/generate.py \
  --concepts_path data/table/materials-science.llama.works.csv \
  --lookup_path data/table/lookup/lookup.Ls.csv \
  --output_path data/embeddings/large/ \
  --log_to_stdout False \
  --step_size 500 \
  --start 0 \
  --end 750000

Currently, if a concept is not exactly contained in the abstract (this can happen because LLMs can apply some "normalization" during extraction), the embedding vector is set to the average of all tokens in the abstract. On GPU4_A100 generating embeddings for 80k abstracts takes about 6h.

Average Word Embeddings

Averaging word (concept) embeddings so that they can be used as feature vectors for classification.

python word_embeddings/average_embs.py \
  --concepts_path data/table/materials-science.llama.works.csv \
  --lookup_path data/table/lookup/lookup_large.csv \
  --filter_path data/table/lookup/lookup_small.csv \
  --embeddings_dir data/embeddings/large/ \
  --output_path data/model/concept_embs/av_embs_small_2016.pkl.gz \
  --store_concepts_plain False \
  --until_year 2016

Interview

LLM Report

Generate distilled version of reports:

python materials_concepts/report/pdf/generation/hack_llm_ready_report.py

Generate the LLM report (prompt engineering + some report sections => LLM APIs) from the "distilled" version of the reports:

export RESEARCHER="...";

python materials_concepts/report/generate_llm_selection.py --txt_path materials_concepts/report/prompt_sec3.txt --tex_path materials_concepts/report/pdf/generation/${RESEARCHER}/distilled/plain_suggestions.tex --output_path materials_concepts/report/pdf/generation/${RESEARCHER}/llm_report_sec3.txt

python materials_concepts/report/generate_llm_selection.py --txt_path materials_concepts/report/prompt_sec5.txt --tex_path materials_concepts/report/pdf/generation/${RESEARCHER}/distilled/plain_suggestions.tex --output_path materials_concepts/report/pdf/generation/${RESEARCHER}/llm_report_sec5.txt

Name		Name	Last commit message	Last commit date
Latest commit History 413 Commits
.github/workflows		.github/workflows
materials_concepts		materials_concepts
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Dataset Creation

Create data/ folder

Data Fetching

Data Filtering

Data Preparation

Cleaning abstracts

Data Enrichment

Materials Extraction

Concept Extraction (DEPRECATED)

Concept Extraction (LLM)

Classification

Build Graph

Generate Raw Classification Task Data

Classification Process

Baseline Model

MLP

Word Embeddings

Generate Word Embeddings

Average Word Embeddings

Interview

LLM Report

About

Uh oh!

Releases

Packages

Uh oh!

Languages

aimat-lab/materials_concepts

Folders and files

Latest commit

History

Repository files navigation

Installation

Dataset Creation

Create data/ folder

Data Fetching

Data Filtering

Data Preparation

Cleaning abstracts

Data Enrichment

Materials Extraction

Concept Extraction (DEPRECATED)

Concept Extraction (LLM)

Classification

Build Graph

Generate Raw Classification Task Data

Classification Process

Baseline Model

MLP

Word Embeddings

Generate Word Embeddings

Average Word Embeddings

Interview

LLM Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages