Concurrent hsload #257

jonatantreijs · 2025-04-22T06:55:04Z

I'm not completely sure everything is thread-safe.
I have run a few tests of hdf files with a lot of attributes and a couple of datasets and groups with a follow up hsdiff with the same results as without these changes.

Copy attributes, objects and linking in parallel.
Adds a --no-checks flag to hsload with the effect of not checking for existence of a resource before creating it.

For an example hdf file stored in a faily slow hsds instance, this cut the upload time to about one third.
The speed gain is of course dependent on many things like file layout and hsds communication latency.

jreadey · 2025-04-24T19:48:07Z

Thanks for the PR, this is very interesting. Have you seen any problems with HSDS getting overloaded (e.g. 503 responses) due to more requests being sent? It might be useful to have an optional command line flag to set the number of workers to allow users to have a way to throttle this.

I'm actually working on a somewhat different approach that should also speed up hsload times. See this post: https://forum.hdfgroup.org/t/the-next-hsds-release-john-readey-on-call-the-doctor-4-8-25/13163/5. This work involves changes to hsds and h5pyd to reduce the number of requests that need to be sent (sort of a GraphQL style). As part of this work, I think a lot of the logic that is currently in apps/utillib.py will migrate to the h5json package. In any case if it works it should lesson or eliminate the need for a python futures multi-threading.

As I mentioned in the post, it will be a while before this is ready. How would you feel about leaving your change in the branch for now? As the h5json works develops, we can determine how best to proceed.

jonatantreijs · 2025-04-28T06:47:17Z

Thanks for the PR, this is very interesting. Have you seen any problems with HSDS getting overloaded (e.g. 503 responses) due to more requests being sent? It might be useful to have an optional command line flag to set the number of workers to allow users to have a way to throttle this.

I'm actually working on a somewhat different approach that should also speed up hsload times. See this post: https://forum.hdfgroup.org/t/the-next-hsds-release-john-readey-on-call-the-doctor-4-8-25/13163/5. This work involves changes to hsds and h5pyd to reduce the number of requests that need to be sent (sort of a GraphQL style). As part of this work, I think a lot of the logic that is currently in apps/utillib.py will migrate to the h5json package. In any case if it works it should lesson or eliminate the need for a python futures multi-threading.

As I mentioned in the post, it will be a while before this is ready. How would you feel about leaving your change in the branch for now? As the h5json works develops, we can determine how best to proceed.

Yes, that sounds like an ok path forward, lets leave it in the branch.
I wrote this just before I saw your post about the upcoming changes and I imagine most of this code will become deprecated with the new release.

I have not seen any increase in 503s.
I have tested this against a HSDS service with a POSIX backend with no problems.
I have also tested it against two different HSDS services with an S3 storage backend, one with 4 service nodes and one with 6. The only difference I observed was that the one with 6 was slightly quicker.
I even ran two hsload commands simultaneously against the HSDS with s3 storage, this resulted in a slight increase in ingestion speed, but only by a little.
I have also run some post-checks using hsdiff to make sure the modified hsload does not miss any data as compared to a non-modified one, and it seems to behave as expected. Interestingly, hsdiff returned some differences when comparing a newly uploaded hdf file with its domain, both when using the modified and the non-modified hsload. The same differences where reported in both cases though.

We had a quite dire need to increase our hsds write speed so we investigated a couple of different paths, this being one of them. We have since then fiddled with our hardware and load balancers so our write speed is now tolerable, though not great.

If we can establish that this way of parallelizing hsload is thread-safe, then we will use it until your new release is out.

I found a statement about h5py being thread-safe but I have found no such information about h5pyd.
What are your thought about using the h5pyd lib in this way, is it thread-safe or are we risking running into race conditions and loosing data?
Similarly, what are your thoughts about skipping the checks using the --no-checks flag?

jreadey · 2025-04-29T06:37:18Z

Ok, great - glad to hear you are not seeing any HSDS load issues.

The only time I've seen threading issues is when using sockets rather than tcp connections (i.e. hsds/.runall.sh --no-docker vs. hsds/.ruall.sh --no-docker-tcp). Otherwise there shouldn't be any problems. There's no threadlock like with h5py, so unless there's something in the requests package gumming up the works, more threads should help performance.

Would you create to open an issue regarding hsload performance? That would be useful to track which approach(es) are most effective.

jonatantreijs · 2025-04-29T08:10:27Z

Great to hear there should be no issues with thread safety.
I created an issue regarding the hsload performance: #258.

Do you think there is a danger in excluding the "existence checks" with the --no-checks flag when uploading to a newly created domain?

jreadey · 2025-04-30T16:21:16Z

Thanks for creating the issue.

Sorry, but what's the --no-checks flag you are referring to?

jonatantreijs · 2025-05-05T09:08:59Z

In this PR I also added a --no-checks flag to hsload that has the effect of skipping checking for existence of a dataset or group before creating it.

Like this:

h5pyd/h5pyd/_apps/utillib.py

Line 1691 in f07edab

if not ctx["no_checks"] and gobj.name in fout:

I realize now that this probably will not work when appending to a domain, but for creating a completely new one, perhaps this is ok?

jonatantreijs added 7 commits April 16, 2025 13:41

Concurrent copy attributes, linking and object copying

cf19372

Don't check for existence before creating groups

e750974

Same number of threads as connections in the pool

706b664

Thread-safe cache lookup

424dfbc

Re-raise future errors

0b70859

Add no_checks argument

36a592b

Remove -n flag as it conflicts with other flag

bba8f97

jonatantreijs marked this pull request as ready for review April 22, 2025 07:56

cli argument for thread-count

f07edab

jreadey mentioned this pull request Apr 30, 2025

poor hsload performance #258

Open

add type ignore for excisting working code

73a8d3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Concurrent hsload #257

Concurrent hsload #257

Uh oh!

jonatantreijs commented Apr 22, 2025 •

edited

Loading

Uh oh!

jreadey commented Apr 24, 2025

Uh oh!

jonatantreijs commented Apr 28, 2025

Uh oh!

jreadey commented Apr 29, 2025

Uh oh!

jonatantreijs commented Apr 29, 2025

Uh oh!

jreadey commented Apr 30, 2025

Uh oh!

jonatantreijs commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Concurrent hsload #257

Are you sure you want to change the base?

Concurrent hsload #257

Uh oh!

Conversation

jonatantreijs commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreadey commented Apr 24, 2025

Uh oh!

jonatantreijs commented Apr 28, 2025

Uh oh!

jreadey commented Apr 29, 2025

Uh oh!

jonatantreijs commented Apr 29, 2025

Uh oh!

jreadey commented Apr 30, 2025

Uh oh!

jonatantreijs commented May 5, 2025

Uh oh!

Uh oh!

jonatantreijs commented Apr 22, 2025 •

edited

Loading