Skip to content

Conversation

enjyashraf18
Copy link
Collaborator

To address Issues #150 and #151, the storage of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based system with a structured HDF5 format.

Changes Made For Each Dataset

  1. Refactored run module.

  2. Added h5file_creator.py as the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties in datasets_data.h5 file.

  3. Migrated existing data into the new HDF5 file.

Storage and Compression

  1. Applied PyTables compression methods to minimize storage.

  2. Benchmarked several methods: Blosc2: LZ4 achieved the best results in both speed and compression ratio, outperforming Blosc2, Zlib, and LZO

Average dataset size is now reduced to 400 MB – 1 GB.

1. Updated run module for numeric dataset
2. Created customized HDF5 file creator for numeric
3. Migrated all old files from msgpack to HDF5

 #handle wildcard case while loading the element
1. Updated run module for nist dataset
2. Created customized HDF5 file creator for nist
3. Migrated all old files from msgpack to HDF5
@msricher
Copy link
Collaborator

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

@gabrielasd
Copy link
Collaborator

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

Hi @msricher, could it be that the load method in the species module is invoking the datasets run.py file? I was scrolling through the CI run outputs and noticed lines like:

atomdb/species.py:791: in load
dataset_submodule = import_module(f"atomdb.datasets.{dataset}.h5file_creator")
...
atomdb/datasets/gaussian/h5file_creator.py:6: in
from atomdb.datasets.gaussian.run import NPOINTS
...
from gbasis.evals.density import evaluate_density as eval_dens
E ModuleNotFoundError: No module named 'gbasis'

My understanding of how our pytest CI worflow works is partial, but I think it only installs the direct dependencies of atomdb, and leaves out IOData, Grid, etc (i.e. the dependencies we need during development to compile the datasets). But if the run.py files are being called, these modules are in the import, so my guess is that this is what's making the test fail.

@marco-2023 what do you think?

@enjyashraf18
Copy link
Collaborator Author

@gabrielasd @marco-2023 I think one of you set up the Python dependencies in the Github Action, are you able to see why the tests here are failing?

Hi @msricher, could it be that the load method in the species module is invoking the datasets run.py file? I was scrolling through the CI run outputs and noticed lines like:

atomdb/species.py:791: in load
dataset_submodule = import_module(f"atomdb.datasets.{dataset}.h5file_creator")
...
atomdb/datasets/gaussian/h5file_creator.py:6: in
from atomdb.datasets.gaussian.run import NPOINTS
...
from gbasis.evals.density import evaluate_density as eval_dens
E ModuleNotFoundError: No module named 'gbasis'

My understanding of how our pytest CI worflow works is partial, but I think it only installs the direct dependencies of atomdb, and leaves out IOData, Grid, etc (i.e. the dependencies we need during development to compile the datasets). But if the run.py files are being called, these modules are in the import, so my guess is that this is what's making the test fail.

@marco-2023 what do you think?

Hi @gabrielasd, I tried to remove run imports from h5file_creator.py and hardcoded NPOINTS as a trial to see if that was the root cause, so the only place run is called is inside compile_species (same as before), But the CI is still failing on the missing modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants