Demiurge is a modular and fully automated Python-based platform designed to generate machine learning input data from both simulated spectral and structural molecular representations. Specifically developed to support QSPR studies, it processes chemical structures provided as SMILES strings alongside target properties (e.g., CHI logD), and produces ready-to-use feature matrices for classical ML models and deep neural networks.
The tool supports four representation modes: predicted ¹H NMR spectra, ¹³C NMR spectra, concatenated ¹H | ¹³C spectral vectors and ECFP4 molecular fingerprints. Input files in .csv format are validated for SMILES integrity and formatting. Molecular structures are reconstructed using RDKit or obabel, optimized in 3D, then flattened to 2D to comply with NMR prediction tools.
NMR spectral predictions are performed locally using a standalone Java-based engine built on the NMRshiftDB2 database, utilizing HOSE-code pattern matching. The resulting chemical shift lists are then transformed into fixed-length spectral vectors through a custom bucketing strategy (200 bins per nucleus, but can be changed in bucketing module script), enabling compatibility with ML pipelines. In the fused representation, the 1H and 13C vectors are joined together head to tail.
For ECFP4 generation, RDKit's Morgan fingerprinting (radius = 2) is used to construct 2048-bit binary descriptors. All generated feature matrices are merged with property labels (e.g., CHI logD), headers are appended, and the final datasets are saved as .csv files.
Demiurge is optimized for parallel execution on 8-core CPUs, achieving processing times of ~6 minutes for ¹H NMR spectra, ~15 minutes for ¹³C NMR, and under 2 minutes for ECFP4 on datasets of ~1000 molecules.
The architecture is fully extensible and easily adaptable to other endpoints such as logP, TPSA, or logS, and to other types of spectral or molecular representations.
The tool uses the NMRshiftDB2 predictor, which can be accessed here.
For more detailed information, check the original Open Access research paper:
Leniak, A.; Pietruś, W.; Kurczab, R. From NMR to AI: Fusing 1H and 13C Representations for Enhanced QSPR Modeling. J Chem Inf Model 2025. https://doi.org/10.1021/acs.jcim.5c01791.
If you use this software in your research, please cite our publication.
The script was run as an example for the prediction of 13C NMR spectra with an input file containing a misdefined one of the rows. In addition, a comma was inserted as the decimal separator and a semicolon was inserted as the column separator.
- DEMIURGE
- NMR or ECFP4-based Machine Learning Input Generator
- 🖥 Examples of Working Program
- 💡 Key Features
- ✅ Requirements
- ⚙️ Installation
- 🗂 Directory Structure
- 🚀 Usage
- 📄 Command Line Arguments
- 📄 Example Usage
- 📄 Input CSV Format
- ⚙️ Script Workflow for NMR-based Output Data (1H / 13C)
- ⚙️ Script Workflow for ECFP4-based Output Data (FP)
- 🛠 Troubleshooting
- 📜 License
- Molecule Generation: Converts SMILES codes into 3D molecular structures and saves them as flattened 2D .molfiles using RDKit.
- NMR Spectrum Prediction: Predicts NMR spectra for each molecule using a custom Java-based NMRshiftDB2 predictor.
- ECFP4 Fingerprints Generation: Generates feature space using ECPF4 fingerprints with radius 2. (activated via --predictor FP)
- Bucketization: Converts predicted NMR spectra into a uniform matrix using a bucketing technique.
- Data Merging: Merges the bucketized spectra/fingerprints with property labels to form a consolidated dataset.
- Label Insertion: Adds a target property column to the merged dataset based on a specified label column.
- Custom Headers: Adds headers to the final dataset for easy identification and readability.
- Data Concatenation: 1H and 13C Representations are fused into new hybrid representation.
- Optional Cleanup: Deletes all intermediate files and folders to save space and reduce clutter.
Important: The script was tested under Windows 10 using PowerShell and works reliably in this environment on Python 3.11.4. It has not been tested on Linux or other operating systems.
Ensure the following software and libraries are installed:
- 
Python Libraries: - rdkit
- pandas
- numpy
- art
- tqdm
 Install the required Python packages using: pip install rdkit pandas numpy tqdm art or predefined Python script, which will check if the necessary libraries are installed. If not it will install them: python install_modules.py 
- 
Open Babel Open Babel is needed to process "exotic" structures that standard RDKit library can not process. Make sure the obabelcommand is available in your system's PATH.
- 
Java SDK: - Java Development Kit (JDK) is required to compile and run the Java batch processor for NMR spectrum prediction. Make sure the javacandjavacommands are available in your system's PATH.
 
- Java Development Kit (JDK) is required to compile and run the Java batch processor for NMR spectrum prediction. Make sure the 
If you prefer to use Conda - utilize the provided environment file to create your Conda environment:
conda env create -f conda_environment.yml
conda activate predictor_logDThen Download & Install Java SDK (tested on version 23). Ensure java and javac are accessible in your PATH.
Clone the repository from GitHub and navigate to the project directory:
git clone https://github.com/Prospero1988/Demiurge.git
cd DemiurgeThe project is organized into the following directories and files:
demiurge/
│
├── demiurge.py                    # Main script for executing the pipeline
├── input_example.csv              # Example of the input file
├── install_modules.py             # Installs required Python packages
├── predictor/
│   ├── predictorh.jar             # Java-based predictor for 1H spectra [NMRshiftDB2]
│   ├── predictor13C.jar           # Java-based predictor for 13C spectra [NMRshiftDB2]
│   ├── cdk-2.9.jar                # CDK library required for spectrum prediction.
│   ├── BatchProcessor1H.java      # Java batch processor for 1H spectra [NMRshiftDB2]
│   └── BatchProcessor13C.java     # Java batch processor for 13C spectra [NMRshiftDB2]
├── logD_predictor_bin/            # Directory containing helper modules
│   ├── csv_checker.py             # Verifies and preprocesses CSV files
│   ├── concatenator.py            # Concatenate 1H and 13C matrices into new fused matrix
│   ├── gen_mols.py                # Generates .mol files from SMILES strings
│   ├── bucket.py                  # Buckets NMR spectra
│   ├── merger.py                  # Merges bucketed spectra CSVs
|   ├── labeler.py                 # Adds labels to the merged spectra file.
│   ├── custom_header.py           # Adds custom headers to the final dataset
│   ├── fp_generator.py            # ECFP4 Fingerprints generator
│   └── model_query.py             # Queries machine learning models
└── README.md                      # Project documentation (this file)
To run the script, use the following command:
python demiurge.py --csv_path <input_csv_file> --predictor <NMR_type> --label_column <column_number> [--clean]If using conda remember to activate enviriment:
conda activate predictor_logD| Argument | Required | Accepted Values | Description | 
|---|---|---|---|
| --csv_path | ✅ Yes | (any valid CSV path) | Path to the input CSV file containing compound names and SMILES codes. | 
| --predictor | ✅ Yes | 1H,13C,hybrid,FP | Type of predictor: 1Hor13Cfor NMR,hybridfor fused 1H/13C, orFPfor ECFP4 fingerprints. | 
| --label_column | ✅ Yes | (integer ≥ 1) | Column index (1-based) in the input CSV file that contains the target property values. | 
| --clean | ❌ No | (flag, no value) | If set, the script will delete all intermediate temporary files and folders after execution. | 
python demiurge.py --csv_path 'test.csv' --predictor '1H' --label_column 3 --cleanIn this example:
- The script will read the input CSV file test.csv.
- It will generate .molfiles for each molecule based on its SMILES code.
- It will predict the 1H NMR spectra for each molecule.
- The spectra will be bucketized and merged into a single file.
- The target property values from column 3intest.csvwill be added as labels.
- All intermediate files and directories will be deleted after execution due to the --cleanoption.
python demiurge.py --csv_path 'test.csv' --predictor 'hybrid' --label_column 3 --cleanIn this example:
- The script will read the input CSV file test.csv.
- It will generate .molfiles for each molecule based on its SMILES code.
- It will predict the 1H NMR spectra for each molecule.
- The spectra will be bucketized and merged into a single file.
- The target property values from column 3intest.csvwill be added as labels.
- Script predicts 13C, buckets, merges and adds propety colums as for 1H data.
- 1H and 13C matrices are fused into concatenated representation 1H|13C.
- All intermediate files and directories will be deleted after execution due to the --cleanoption.
python demiurge.py --csv_path 'test.csv' --predictor 'FP' --label_column 3 --cleanIn this example:
- The script will read the input CSV file test.csv.
- It will generate .molfiles for each molecule based on its SMILES code.
- It will calculate ECFP4 fingerprints for each molecule using RDKit.
- The fingerprint matrix will be merged with the target property values from column 3intest.csv.
- All intermediate files and directories will be deleted after execution due to the --cleanoption.
The input CSV file should have at least the following columns:
- MOLECULE_NAME: The name or identifier of the molecule.
- SMILES: The SMILES code of the molecule.
- <Label>: The property values to be modeled (must be specified in the- --label_columnparameter).
Example test.csv:
| MOLECULE_NAME | SMILES | Label | 
|---|---|---|
| Compound1 | CCCO | 5.3 | 
| Compound2 | CCC(=O)O | 2.1 | 
| Compound3 | CCN(CC)CC | 7.8 | 
- 
Step 1: Generate .molFiles:- Reads SMILES codes from the input CSV file and generates corresponding .molfiles using RDKit.
 
- Reads SMILES codes from the input CSV file and generates corresponding 
- 
Step 2: Predict NMR Spectra: - Uses the Java-based BatchProcessor1HorBatchProcessor13Cto predict NMR spectra for each molecule. Predictor is part of NMRshiftDB2 database.
 
- Uses the Java-based 
- 
Step 3: Bucketize Spectra: - Converts the predicted spectra into a uniform bucketized matrix for easy analysis and machine learning input generation.
 
- 
Step 4: Merge Spectra and Labels: - Merges the bucketized spectra with the specified label column from the input CSV file.
 
- 
Step 5: Add Custom Headers: - Adds descriptive headers to the final merged CSV file, making it easier to interpret and use for machine learning tasks.
 
- 
Step 6: Cleanup (Optional): - Deletes all intermediate files and directories if the --cleanoption is specified.
 
- Deletes all intermediate files and directories if the 
- 
Step 1: Generate .molFiles:- Reads SMILES codes from the input CSV file and generates corresponding .molfiles using RDKit.
 
- Reads SMILES codes from the input CSV file and generates corresponding 
- 
Step 2: Predict NMR Spectra: - Uses the Java-based BatchProcessor1HorBatchProcessor13Cto predict NMR spectra for each molecule. Predictor is part of NMRshiftDB2 database.
 
- Uses the Java-based 
- 
Step 3: Bucketize Spectra: - Converts the predicted spectra into a uniform bucketized matrix for easy analysis and machine learning input generation.
 
- 
Step 4: Merge Spectra and Labels: - Merges the bucketized spectra with the specified label column from the input CSV file.
 
- 
Step 5: Add Custom Headers: - Adds descriptive headers to the final merged CSV file, making it easier to interpret and use for machine learning tasks.
 
- 
Step 5: Concatenation od 1H adn 13C: - The script connects 1H and 13C vectors head-to-tail. It combines representations for the same MOLECULE_NAME and for the same label value.
 
- 
Step 7: Cleanup (Optional): - Deletes all intermediate files and directories if the --cleanoption is specified.
 
- Deletes all intermediate files and directories if the 
- 
Step 1: Generate .molFiles- Reads SMILES strings from the input CSV file and converts them into 2D .molfiles using RDKit.
 
- Reads SMILES strings from the input CSV file and converts them into 2D 
- 
Step 2: Calculate ECFP4 Fingerprints - For each molecule, extended-connectivity fingerprints (ECFP4) are generated from the .molstructures using the RDKit implementation.
 
- For each molecule, extended-connectivity fingerprints (ECFP4) are generated from the 
- 
Step 3: Merge Fingerprints and Labels - Merges the computed ECFP4 vectors with the label column (e.g., logD values) from the input CSV file into a unified DataFrame.
 
- 
Step 4: Add Custom Headers - Assigns descriptive headers to the final CSV file, improving interpretability and downstream machine learning usability.
 
- 
Step 5: Cleanup (Optional) - Deletes all intermediate files and directories if the --cleanflag is used.
 
- Deletes all intermediate files and directories if the 
- 
Java Compilation Issues: - Ensure that the javacandjavacommands are available and the Java SDK is installed.
- If javacis not recognized, check the system'sPATHvariable and make sure it includes the path to the JDKbindirectory.
 
- Ensure that the 
- 
Missing Dependencies: - Ensure that all required Python libraries (rdkit,pandas, andnumpy) are installed.
 
- Ensure that all required Python libraries (
- 
File Not Found Errors: - Verify the paths to input files and directories. Ensure that the input CSV file and other necessary files are correctly specified.
 
- 
Memory or Performance Issues: - If handling a large dataset, consider increasing the memory allocation for the Java runtime by adjusting the -Xmxparameter in the script.
 
- If handling a large dataset, consider increasing the memory allocation for the Java runtime by adjusting the 
This project is licensed under the MIT License - see the LICENSE file for details.


