Skip to content

Conversation

enjyashraf18
Copy link
Collaborator

@enjyashraf18 enjyashraf18 commented Jul 23, 2025

To address Issue #148, the handling of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based storage system with a structured HDF5 format.

Modifications

  1. Atomic datasets are now stored in datasets_data.h5 file under the /Datasets group, with hierarchical organization. For example:
/Datasets
├── /slater
│   ├── /H
│   │   └── /H_000_002_000
│   ├── /C
│   │   └── /C_000_003_000
│ 
├── /gaussian
├── /hci
├── /nist
├── /numeric
├── /uhf_augccpvdz
  1. Enhanced Query Capabilities
  • Supports efficient filtering by atomic properties.
  • Replaces old searches techniques with direct HDF5 path queries.
  1. HDF5 Backend Infrastructure
    Created h5file_creator.py as the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties in datasets_data.h5 file, replacing the old MessagePack approach.

  2. Class Structure
    The SpeciesData dataclass has been removed and replaced with a dynamic DefinitionClass imported per dataset.

The current modifications focus on the Slater dataset, and all MessagePack files for this dataset have been successfully migrated to datasets_data.h5.

max_norba = 30 # needs to be calculated
hdf5_file = files("atomdb.data").joinpath("datasets_data.h5")

SLATER_PROPERTY_CONFIGS = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the number of basis functions (nbasis) in this list

value = pt.Float64Col(shape=(max_norba,))


def create_properties_tables(hdf5_file, parent_folder, config, value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the code - for this and the functions below, add doctrings explaining what the function does and describing its input parameters



def create_tot_array(h5file, parent_folder, key, array_data):
tot_gradient_array = h5file.create_carray(parent_folder, key, pt.Float64Atom(), shape=(10000,))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the shape parameter in this line is hardcoded.
Since it corresponds to the number of radial points where density properties get evaluated, import the variable NPOINTS from the run script.

https://github.com/enjyashraf18/AtomDB/blob/9be8ea95418584503b818b9cc0360e41a2c0f427/atomdb/datasets/slater/run.py#L37


def create_hdf5_file(fields, dataset, elem, charge, mult, nexc):
dataset = dataset.lower()
shape = 10000 * max_norba
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also replace the hardcoded value for the number of radial points (NPOINTS=10000)

datapath : str, optional
Path to the local AtomDB cache, by default DEFAULT_DATAPATH variable value.
def dump(fields, dataset, elem, charge, mult, nexc):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the signature of the other functions in this module, I suggest modifying the order of the input arguments to:
dump(elem, charge, mult, nexc, dataset, fields)

Or, if it is possible to get the parameters elem, charge, mult and nexc from the parameter fields, modify the function to:
dump(fields, dataset)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely possible, instead of passing all the parameters again like in the old structure, we can extract them directly from the fields to make it cleaner

"at_radius": "at_radius",
"polarizability": "polarizability",
"dispersion_c6": "dispersion_c6",
"dispersion": "dispersion_c6",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create duplicated versions of the dispersion_c6 property?

Copy link
Collaborator Author

@enjyashraf18 enjyashraf18 Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just a temporary fix, when extracting dispersion from the fields passed by the run modules, it wasn’t being recognized as it exists as dispersion_c6 in elements_data.h5. So I mapped it as a quick fix to test other things. It will be handled properly after the modifications.

"""
# Ensure directories exist
makedirs(path.join(datapath, dataset.lower(), "db"), exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line checking for a db directory is no longer necessary.
This folder is for the old database structure where the compiled .msg files got stored under db. If the structure of the database has changed to a single HDF5 file containing all compiled data, why check and create a db folder?

What should be checked is whether under the folder defined by datapath there exists an HDF5 file that contains the specified dataset, and if not, add it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since it was part of the old structure, I left it as is for now while waiting for our discussion about the data paths (and also the remote URLs later), just to make sure we wouldn’t end up needing any part of it.


# print all fields

# fields = asdict(fields)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment lines L835-L844

from numbers import Integral

elements_hdf5_file = files("atomdb.data").joinpath("elements_data.h5")
datasets_hdf5_file = files("atomdb.data").joinpath("datasets_data.h5")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file datasets_data.h5 should be placed under atomdb/datasets intead of atomdb/data

As a note for the final stage of atomdb refactor, when setting the path where the dataset HDF5 is going to be stored, we should support custom paths; for example if the user wants to compile the dataset outside the atomdb package.


# dump the data to the HDF5 file
dump(fields, dataset, elem, charge, mult, nexc)
# dump(fields, dataset, elem, charge, mult, nexc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the datasets HDF5 file gets created if the dump step in L742 is commented?
This compile_species function should compile and dump the data for a specified dataset, not return an instance of Species.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning species and printing the fields was just for quick demos during our meetings (mainly for the test file). But this part will be removed now, since we've moved past the testing of compile_species and the focus is back on loading and dumping the data properly.

- Included nbasis in fields
- Replaced hardcoded radial points with imported NPOINTS
- Placed datasets_data.h5 under datasets folder
- Added docstrings
Copy link
Collaborator

@msricher msricher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main functionality here should be good. If you can walk me through it tomorrow, we can fix it up. Thanks for keeping it up while I was away!

# DATAPATH = os.path.abspath(DATAPATH._paths[0])


@dataclass
Copy link
Collaborator

@msricher msricher Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't have to use dataclass here, you should be using this instead: https://www.pytables.org/usersguide/libref/declarative_classes.html#isdescriptionclassdescr

(I think, at least, we can cover this tomorrow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants