-
Notifications
You must be signed in to change notification settings - Fork 18
Refactor Species Class #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev-gsoc
Are you sure you want to change the base?
Conversation
max_norba = 30 # needs to be calculated | ||
hdf5_file = files("atomdb.data").joinpath("datasets_data.h5") | ||
|
||
SLATER_PROPERTY_CONFIGS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the number of basis functions (nbasis
) in this list
value = pt.Float64Col(shape=(max_norba,)) | ||
|
||
|
||
def create_properties_tables(hdf5_file, parent_folder, config, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the code - for this and the functions below, add doctrings explaining what the function does and describing its input parameters
|
||
|
||
def create_tot_array(h5file, parent_folder, key, array_data): | ||
tot_gradient_array = h5file.create_carray(parent_folder, key, pt.Float64Atom(), shape=(10000,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of the shape
parameter in this line is hardcoded.
Since it corresponds to the number of radial points where density properties get evaluated, import the variable NPOINTS
from the run
script.
|
||
def create_hdf5_file(fields, dataset, elem, charge, mult, nexc): | ||
dataset = dataset.lower() | ||
shape = 10000 * max_norba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also replace the hardcoded value for the number of radial points (NPOINTS=10000)
atomdb/species.py
Outdated
datapath : str, optional | ||
Path to the local AtomDB cache, by default DEFAULT_DATAPATH variable value. | ||
def dump(fields, dataset, elem, charge, mult, nexc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the signature of the other functions in this module, I suggest modifying the order of the input arguments to:
dump(elem, charge, mult, nexc, dataset, fields)
Or, if it is possible to get the parameters elem
, charge
, mult
and nexc
from the parameter fields
, modify the function to:
dump(fields, dataset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's definitely possible, instead of passing all the parameters again like in the old structure, we can extract them directly from the fields to make it cleaner
atomdb/species.py
Outdated
"at_radius": "at_radius", | ||
"polarizability": "polarizability", | ||
"dispersion_c6": "dispersion_c6", | ||
"dispersion": "dispersion_c6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create duplicated versions of the dispersion_c6
property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was just a temporary fix, when extracting dispersion
from the fields
passed by the run modules, it wasn’t being recognized as it exists as dispersion_c6
in elements_data.h5
. So I mapped it as a quick fix to test other things. It will be handled properly after the modifications.
atomdb/species.py
Outdated
""" | ||
# Ensure directories exist | ||
makedirs(path.join(datapath, dataset.lower(), "db"), exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line checking for a db
directory is no longer necessary.
This folder is for the old database structure where the compiled .msg files got stored under db
. If the structure of the database has changed to a single HDF5 file containing all compiled data, why check and create a db
folder?
What should be checked is whether under the folder defined by datapath
there exists an HDF5 file that contains the specified dataset, and if not, add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since it was part of the old structure, I left it as is for now while waiting for our discussion about the data paths (and also the remote URLs later), just to make sure we wouldn’t end up needing any part of it.
atomdb/species.py
Outdated
|
||
# print all fields | ||
|
||
# fields = asdict(fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment lines L835-L844
atomdb/species.py
Outdated
from numbers import Integral | ||
|
||
elements_hdf5_file = files("atomdb.data").joinpath("elements_data.h5") | ||
datasets_hdf5_file = files("atomdb.data").joinpath("datasets_data.h5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file datasets_data.h5
should be placed under atomdb/datasets
intead of atomdb/data
As a note for the final stage of atomdb refactor, when setting the path where the dataset HDF5 is going to be stored, we should support custom paths; for example if the user wants to compile the dataset outside the atomdb package.
atomdb/species.py
Outdated
|
||
# dump the data to the HDF5 file | ||
dump(fields, dataset, elem, charge, mult, nexc) | ||
# dump(fields, dataset, elem, charge, mult, nexc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the datasets HDF5 file gets created if the dump step in L742 is commented?
This compile_species function should compile and dump the data for a specified dataset, not return an instance of Species.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning species
and printing the fields was just for quick demos during our meetings (mainly for the test file). But this part will be removed now, since we've moved past the testing of compile_species
and the focus is back on loading and dumping the data properly.
- Included nbasis in fields - Replaced hardcoded radial points with imported NPOINTS - Placed datasets_data.h5 under datasets folder - Added docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main functionality here should be good. If you can walk me through it tomorrow, we can fix it up. Thanks for keeping it up while I was away!
# DATAPATH = os.path.abspath(DATAPATH._paths[0]) | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't have to use dataclass here, you should be using this instead: https://www.pytables.org/usersguide/libref/declarative_classes.html#isdescriptionclassdescr
(I think, at least, we can cover this tomorrow.)
To address Issue #148, the handling of atomic species data in AtomDB has been refactored to replace the previous MessagePack-based storage system with a structured HDF5 format.
Modifications
datasets_data.h5
file under the /Datasets group, with hierarchical organization. For example:HDF5 Backend Infrastructure
Created
h5file_creator.py
as the core module for generating the HDF5 structure. It creates organized folders for any atomic species with defined properties indatasets_data.h5
file, replacing the old MessagePack approach.Class Structure
The
SpeciesData
dataclass has been removed and replaced with a dynamicDefinitionClass
imported per dataset.The current modifications focus on the Slater dataset, and all MessagePack files for this dataset have been successfully migrated to
datasets_data.h5
.