-
-
Notifications
You must be signed in to change notification settings - Fork 42
Concurrent hsload #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Concurrent hsload #257
Conversation
Thanks for the PR, this is very interesting. Have you seen any problems with HSDS getting overloaded (e.g. 503 responses) due to more requests being sent? It might be useful to have an optional command line flag to set the number of workers to allow users to have a way to throttle this. I'm actually working on a somewhat different approach that should also speed up hsload times. See this post: https://forum.hdfgroup.org/t/the-next-hsds-release-john-readey-on-call-the-doctor-4-8-25/13163/5. This work involves changes to hsds and h5pyd to reduce the number of requests that need to be sent (sort of a GraphQL style). As part of this work, I think a lot of the logic that is currently in apps/utillib.py will migrate to the h5json package. In any case if it works it should lesson or eliminate the need for a python futures multi-threading. As I mentioned in the post, it will be a while before this is ready. How would you feel about leaving your change in the branch for now? As the h5json works develops, we can determine how best to proceed. |
Yes, that sounds like an ok path forward, lets leave it in the branch. I have not seen any increase in 503s. We had a quite dire need to increase our hsds write speed so we investigated a couple of different paths, this being one of them. We have since then fiddled with our hardware and load balancers so our write speed is now tolerable, though not great. If we can establish that this way of parallelizing hsload is thread-safe, then we will use it until your new release is out. I found a statement about h5py being thread-safe but I have found no such information about h5pyd. |
Ok, great - glad to hear you are not seeing any HSDS load issues. The only time I've seen threading issues is when using sockets rather than tcp connections (i.e. hsds/.runall.sh --no-docker vs. hsds/.ruall.sh --no-docker-tcp). Otherwise there shouldn't be any problems. There's no threadlock like with h5py, so unless there's something in the requests package gumming up the works, more threads should help performance. Would you create to open an issue regarding hsload performance? That would be useful to track which approach(es) are most effective. |
Great to hear there should be no issues with thread safety. Do you think there is a danger in excluding the "existence checks" with the --no-checks flag when uploading to a newly created domain? |
Thanks for creating the issue. Sorry, but what's the --no-checks flag you are referring to? |
In this PR I also added a --no-checks flag to hsload that has the effect of skipping checking for existence of a dataset or group before creating it. Like this: Line 1691 in f07edab
I realize now that this probably will not work when appending to a domain, but for creating a completely new one, perhaps this is ok? |
I'm not completely sure everything is thread-safe.
I have run a few tests of hdf files with a lot of attributes and a couple of datasets and groups with a follow up hsdiff with the same results as without these changes.
For an example hdf file stored in a faily slow hsds instance, this cut the upload time to about one third.
The speed gain is of course dependent on many things like file layout and hsds communication latency.