Refactor Species Class #159

enjyashraf18 · 2025-07-23T19:20:47Z

To address Issue #148, the handling of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based storage system with a structured HDF5 format.

Modifications

Atomic datasets are now stored in datasets_data.h5 file under the /Datasets group, with hierarchical organization. For example:

/Datasets
├── /slater
│   ├── /H
│   │   └── /H_000_002_000
│   ├── /C
│   │   └── /C_000_003_000
│ 
├── /gaussian
├── /hci
├── /nist
├── /numeric
├── /uhf_augccpvdz

Enhanced Query Capabilities

Supports efficient filtering by atomic properties.
Replaces old searches techniques with direct HDF5 path queries.

HDF5 Backend Infrastructure
Created h5file_creator.py as the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties in datasets_data.h5 file, replacing the old MessagePack approach.
Class Structure
The SpeciesData dataclass has been removed and replaced with a dynamic DefinitionClass imported per dataset.

The current modifications focus on the Slater dataset, and all MessagePack files for this dataset have been successfully migrated to datasets_data.h5.

gabrielasd · 2025-07-24T22:34:15Z

atomdb/datasets/slater/h5file_creator.py

+max_norba = 30 # needs to be calculated
+hdf5_file = files("atomdb.data").joinpath("datasets_data.h5")
+
+SLATER_PROPERTY_CONFIGS = [


Include the number of basis functions (nbasis) in this list

gabrielasd · 2025-07-24T22:53:46Z

atomdb/datasets/slater/h5file_creator.py

+    value = pt.Float64Col(shape=(max_norba,))
+
+
+def create_properties_tables(hdf5_file, parent_folder, config, value):


Document the code - for this and the functions below, add doctrings explaining what the function does and describing its input parameters

gabrielasd · 2025-07-24T23:19:08Z

atomdb/datasets/slater/h5file_creator.py

+
+
+def create_tot_array(h5file, parent_folder, key, array_data):
+    tot_gradient_array = h5file.create_carray(parent_folder, key, pt.Float64Atom(), shape=(10000,))


The value of the shape parameter in this line is hardcoded.
Since it corresponds to the number of radial points where density properties get evaluated, import the variable NPOINTS from the run script.

https://github.com/enjyashraf18/AtomDB/blob/9be8ea95418584503b818b9cc0360e41a2c0f427/atomdb/datasets/slater/run.py#L37

gabrielasd · 2025-07-24T23:20:46Z

atomdb/datasets/slater/h5file_creator.py

+
+def create_hdf5_file(fields, dataset, elem, charge, mult, nexc):
+    dataset = dataset.lower()
+    shape = 10000 * max_norba


Here also replace the hardcoded value for the number of radial points (NPOINTS=10000)

gabrielasd · 2025-07-25T00:05:20Z

atomdb/species.py

-    datapath : str, optional
-        Path to the local AtomDB cache, by default DEFAULT_DATAPATH variable value.
-
+def dump(fields, dataset, elem, charge, mult, nexc):


For consistency with the signature of the other functions in this module, I suggest modifying the order of the input arguments to:
dump(elem, charge, mult, nexc, dataset, fields)

Or, if it is possible to get the parameters elem, charge, mult and nexc from the parameter fields, modify the function to:
dump(fields, dataset)

Yes, it's definitely possible, instead of passing all the parameters again like in the old structure, we can extract them directly from the fields to make it cleaner

gabrielasd · 2025-07-25T00:08:56Z

atomdb/species.py

    "at_radius": "at_radius",
    "polarizability": "polarizability",
    "dispersion_c6": "dispersion_c6",
+    "dispersion": "dispersion_c6",


Why create duplicated versions of the dispersion_c6 property?

That was just a temporary fix, when extracting dispersion from the fields passed by the run modules, it wasn’t being recognized as it exists as dispersion_c6 in elements_data.h5. So I mapped it as a quick fix to test other things. It will be handled properly after the modifications.

gabrielasd · 2025-07-25T00:24:31Z

atomdb/species.py


    """
    # Ensure directories exist
    makedirs(path.join(datapath, dataset.lower(), "db"), exist_ok=True)


I think this line checking for a db directory is no longer necessary.
This folder is for the old database structure where the compiled .msg files got stored under db. If the structure of the database has changed to a single HDF5 file containing all compiled data, why check and create a db folder?

What should be checked is whether under the folder defined by datapath there exists an HDF5 file that contains the specified dataset, and if not, add it

Yes, since it was part of the old structure, I left it as is for now while waiting for our discussion about the data paths (and also the remote URLs later), just to make sure we wouldn’t end up needing any part of it.

gabrielasd · 2025-07-25T00:28:53Z

atomdb/species.py


-    # print all fields
+
+    # fields = asdict(fields)


Remove the comment lines L835-L844

gabrielasd · 2025-07-25T01:26:38Z

atomdb/species.py

 from numbers import Integral

 elements_hdf5_file = files("atomdb.data").joinpath("elements_data.h5")
+datasets_hdf5_file = files("atomdb.data").joinpath("datasets_data.h5")


The file datasets_data.h5 should be placed under atomdb/datasets intead of atomdb/data

As a note for the final stage of atomdb refactor, when setting the path where the dataset HDF5 is going to be stored, we should support custom paths; for example if the user wants to compile the dataset outside the atomdb package.

gabrielasd · 2025-07-25T01:45:09Z

atomdb/species.py


    # dump the data to the HDF5 file
-    dump(fields, dataset, elem, charge, mult, nexc)
+    # dump(fields, dataset, elem, charge, mult, nexc)


How does the datasets HDF5 file gets created if the dump step in L742 is commented?
This compile_species function should compile and dump the data for a specified dataset, not return an instance of Species.

Returning species and printing the fields was just for quick demos during our meetings (mainly for the test file). But this part will be removed now, since we've moved past the testing of compile_species and the focus is back on loading and dumping the data properly.

- Included nbasis in fields - Replaced hardcoded radial points with imported NPOINTS - Placed datasets_data.h5 under datasets folder - Added docstrings

msricher

I think the main functionality here should be good. If you can walk me through it tomorrow, we can fix it up. Thanks for keeping it up while I was away!

msricher · 2025-08-05T19:42:48Z

atomdb/datasets/slater/run.py

 # DATAPATH = os.path.abspath(DATAPATH._paths[0])


+@dataclass


You shouldn't have to use dataclass here, you should be using this instead: https://www.pytables.org/usersguide/libref/declarative_classes.html#isdescriptionclassdescr

(I think, at least, we can cover this tomorrow.)

enjyashraf18 added 23 commits June 14, 2025 19:45

tables description for elements data

7968da8

filter data and creating tables

4d06543

migration script for elements_data

e4bf935

data_info migration

0645745

replace pd by the CSV module / renaming

df637af

fix chopped cols value / pythonic naming

1bc44ff

migration script documentation

60fe63f

migration script documentation

a4cf3ec

fix data_info empty cols issue

9f9d35a

Track HDF5 files with Git LFS

02b1ad7

add elements_data HDF5 file

bf3e715

migration script modifications

b7bb79e

refactoring species (scalar)

e6db81e

finalizing migration of periodic data

d71480c

np / pytables version mismatch

df82230

data converted to atomic units / change value type in scalar

0d78b49

create Slater Definition Class for run function

9b71f9b

create datasets data script

75ec145

hdf5 creator for slater dataset

e3d092f

remove speciesData / update dump method

f72034b

refactor load / datafile methods

fa3441c

handle wildcard case while loading the element

762f2b5

remove old periodic dependency

9be8ea9

enjyashraf18 requested review from gabrielasd and msricher July 23, 2025 19:22

gabrielasd requested changes Jul 25, 2025

View reviewed changes

enjyashraf18 added 3 commits July 28, 2025 20:00

add nbasis, use NPOINTS, add docstrings

8841c63

- Included nbasis in fields - Replaced hardcoded radial points with imported NPOINTS - Placed datasets_data.h5 under datasets folder - Added docstrings

update CI and packaging configs

ba26664

Improving hdf5 file querying

49b65ef

msricher reviewed Aug 5, 2025

View reviewed changes

enjyashraf18 added 2 commits August 6, 2025 23:03

add gbasis

eb2fb57

converting Slater dataset files from msgpack to HDF5 format

192c4b7

		value = pt.Float64Col(shape=(max_norba,))


		def create_properties_tables(hdf5_file, parent_folder, config, value):



		def create_tot_array(h5file, parent_folder, key, array_data):
		tot_gradient_array = h5file.create_carray(parent_folder, key, pt.Float64Atom(), shape=(10000,))

Refactor Species Class #159

Are you sure you want to change the base?

Refactor Species Class #159

Uh oh!

Conversation

enjyashraf18 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enjyashraf18 Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msricher left a comment

Choose a reason for hiding this comment

Uh oh!

msricher Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enjyashraf18 commented Jul 23, 2025 •

edited

Loading

enjyashraf18 Jul 25, 2025 •

edited

Loading

msricher Aug 5, 2025 •

edited

Loading