-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Add OpenMP Parallelization to FarthestPointSampling #6287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for (std::size_t i = 0; i < size; ++i) | ||
{ | ||
if (distances_to_selected_points[i] == -1.0) | ||
continue; | ||
distances_to_selected_points[i] = std::min(distances_to_selected_points[i], geometry::distance((*input_)[toCloudIndex(i)], max_index_point)); | ||
if (distances_to_selected_points[i] > distances_to_selected_points[next_max_index]) | ||
next_max_index = i; | ||
if (distances_to_selected_points[i] > distances_to_selected_points[max_index]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem thread-safe: max_index
could be changed simultaneously by another thread, which could cause this condition to evaluate to false even though it should be true. You mentioned you have another version where each thread computes its own maximum first. That seems like a good approach, could you show that code? Alternatively, we could use such an approach: https://stackoverflow.com/a/39993717/6540043 (and another example of the same idea: https://stackoverflow.com/a/23957676/6540043 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. I implemented the method from those links, they ended up performing better than either of the two versions had before, probably due to it now being almost completely lock-free. Had some weird timing issues at high thread counts but eventually discovered it was caused by hyper-threading on my laptop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loaded pointcloud with 638087 points
1 thread:
10: 1 ms
100: 16 ms
1000: 167 ms
10000: 1656 ms
2 threads:
10: 0 ms
100: 8 ms
1000: 85 ms
10000: 881 ms
5 threads:
10: 0 ms
100: 4 ms
1000: 49 ms
10000: 507 ms
10 threads:
10: 0 ms
100: 3 ms
1000: 33 ms
10000: 327 ms
14 threads:
10: 0 ms
100: 2 ms
1000: 25 ms
10000: 254 ms
Alright, I think I've addressed everything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
Hi, bumping on this in case there's anything else that needs to be done before the merge |
I am happy with the changes, not sure if @larshg has any further requests. However, I should mention that we cannot immediately merge this pull request because by adding the |
Alright sounds good. Thanks for the update |
setNumberOfThreads (unsigned int nr_threads) | ||
{ | ||
#ifdef _OPENMP | ||
nr_threads_ = nr_threads == 0 ? omp_get_num_procs() : nr_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nr_threads_ = nr_threads == 0 ? omp_get_num_procs() : nr_threads; | |
nr_threads_ = nr_threads != 0 ? nr_threads : omp_get_num_procs(); |
To keep it more consistent with existing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this one 😄
Maybe also change nr_threads_ to num_threads_ (used in most other classes) |
FarthestPointSampling is a relatively recent addition to the filters module and has not yet been upgraded with a parallel implementation. The inner loop which recalculates the maximum minimum distance involves a large number of (mostly) independent operations, making it a good candidate for parallelization. The outer loop is serial due to the nature of the filter and so cannot be parallelized.
This PR adds OpenMP-based parallelization to that inner loop, and introduces an API to allow the user to select the number of threads, with the default being the maximum CPU thread count. I also had another lock-free version where each thread would independently calculate it's own maximum and then all the results would be reduced at the end, but that performed about the same but had more complicated (i.e. bug-prone) control flow.
The following code was used to benchmark performance:
which gave this output, where the number on the left is the number of samples drawn at each trial, and the number on the right is the average execution time over 10 trials: