Skip to content

Concurrent hsload #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jonatantreijs
Copy link

@jonatantreijs jonatantreijs commented Apr 22, 2025

I'm not completely sure everything is thread-safe.
I have run a few tests of hdf files with a lot of attributes and a couple of datasets and groups with a follow up hsdiff with the same results as without these changes.

  • Copy attributes, objects and linking in parallel.
  • Adds a --no-checks flag to hsload with the effect of not checking for existence of a resource before creating it.

For an example hdf file stored in a faily slow hsds instance, this cut the upload time to about one third.
The speed gain is of course dependent on many things like file layout and hsds communication latency.

@jonatantreijs jonatantreijs marked this pull request as ready for review April 22, 2025 07:56
@jreadey
Copy link
Member

jreadey commented Apr 24, 2025

Thanks for the PR, this is very interesting. Have you seen any problems with HSDS getting overloaded (e.g. 503 responses) due to more requests being sent? It might be useful to have an optional command line flag to set the number of workers to allow users to have a way to throttle this.

I'm actually working on a somewhat different approach that should also speed up hsload times. See this post: https://forum.hdfgroup.org/t/the-next-hsds-release-john-readey-on-call-the-doctor-4-8-25/13163/5. This work involves changes to hsds and h5pyd to reduce the number of requests that need to be sent (sort of a GraphQL style). As part of this work, I think a lot of the logic that is currently in apps/utillib.py will migrate to the h5json package. In any case if it works it should lesson or eliminate the need for a python futures multi-threading.

As I mentioned in the post, it will be a while before this is ready. How would you feel about leaving your change in the branch for now? As the h5json works develops, we can determine how best to proceed.

@jonatantreijs
Copy link
Author

Thanks for the PR, this is very interesting. Have you seen any problems with HSDS getting overloaded (e.g. 503 responses) due to more requests being sent? It might be useful to have an optional command line flag to set the number of workers to allow users to have a way to throttle this.

I'm actually working on a somewhat different approach that should also speed up hsload times. See this post: https://forum.hdfgroup.org/t/the-next-hsds-release-john-readey-on-call-the-doctor-4-8-25/13163/5. This work involves changes to hsds and h5pyd to reduce the number of requests that need to be sent (sort of a GraphQL style). As part of this work, I think a lot of the logic that is currently in apps/utillib.py will migrate to the h5json package. In any case if it works it should lesson or eliminate the need for a python futures multi-threading.

As I mentioned in the post, it will be a while before this is ready. How would you feel about leaving your change in the branch for now? As the h5json works develops, we can determine how best to proceed.

Yes, that sounds like an ok path forward, lets leave it in the branch.
I wrote this just before I saw your post about the upcoming changes and I imagine most of this code will become deprecated with the new release.

I have not seen any increase in 503s.
I have tested this against a HSDS service with a POSIX backend with no problems.
I have also tested it against two different HSDS services with an S3 storage backend, one with 4 service nodes and one with 6. The only difference I observed was that the one with 6 was slightly quicker.
I even ran two hsload commands simultaneously against the HSDS with s3 storage, this resulted in a slight increase in ingestion speed, but only by a little.
I have also run some post-checks using hsdiff to make sure the modified hsload does not miss any data as compared to a non-modified one, and it seems to behave as expected. Interestingly, hsdiff returned some differences when comparing a newly uploaded hdf file with its domain, both when using the modified and the non-modified hsload. The same differences where reported in both cases though.

We had a quite dire need to increase our hsds write speed so we investigated a couple of different paths, this being one of them. We have since then fiddled with our hardware and load balancers so our write speed is now tolerable, though not great.

If we can establish that this way of parallelizing hsload is thread-safe, then we will use it until your new release is out.

I found a statement about h5py being thread-safe but I have found no such information about h5pyd.
What are your thought about using the h5pyd lib in this way, is it thread-safe or are we risking running into race conditions and loosing data?
Similarly, what are your thoughts about skipping the checks using the --no-checks flag?

@jreadey
Copy link
Member

jreadey commented Apr 29, 2025

Ok, great - glad to hear you are not seeing any HSDS load issues.

The only time I've seen threading issues is when using sockets rather than tcp connections (i.e. hsds/.runall.sh --no-docker vs. hsds/.ruall.sh --no-docker-tcp). Otherwise there shouldn't be any problems. There's no threadlock like with h5py, so unless there's something in the requests package gumming up the works, more threads should help performance.

Would you create to open an issue regarding hsload performance? That would be useful to track which approach(es) are most effective.

@jonatantreijs
Copy link
Author

Great to hear there should be no issues with thread safety.
I created an issue regarding the hsload performance: #258.

Do you think there is a danger in excluding the "existence checks" with the --no-checks flag when uploading to a newly created domain?

@jreadey
Copy link
Member

jreadey commented Apr 30, 2025

Thanks for creating the issue.

Sorry, but what's the --no-checks flag you are referring to?

@jonatantreijs
Copy link
Author

In this PR I also added a --no-checks flag to hsload that has the effect of skipping checking for existence of a dataset or group before creating it.

Like this:

if not ctx["no_checks"] and gobj.name in fout:

I realize now that this probably will not work when appending to a domain, but for creating a completely new one, perhaps this is ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants