Skip to content

Conversation

ognian-
Copy link
Collaborator

@ognian- ognian- commented Jul 21, 2025

This set of changes updates the TBB dependency version to latest stable (oneTBB 2022.2.0). It is also introducing the Move_Found_Callback from Larch.

Do not merge until all executables are properly tested!

@AngieHinrichs
Copy link
Contributor

Hi @ognian-, I have a test case for you: I'm getting a SEGV when trying to add one sample to the public SARS-CoV-2 tree, which has 8,379,881 samples/leaves. Here's how to reproduce the error (path to usher-sampled may need to be tweaked):

curl -O https://hgwdev.gi.ucsc.edu/~angie/PP765594.1.mpl
curl -O https://hgwdev.gi.ucsc.edu/~angie/NC_045512.2.fa
curl -O https://hgdownload.gi.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2025/07/22/public-2025-07-22.all.masked.pb.gz

gdb --args ~/github/matsengrp/usher/build/usher-sampled \
    -T 16 \
    -i public-2025-07-22.all.masked.pb.gz \
    --diff PP765594.1.mpl \
    --ref NC_045512.2.fa \
    -o test.pb.gz

The -T is just to be polite to my coworkers on a shared server, I still get the SEGV with other reasonable values like 8 or 50. (With -T 2 or 5 which are not really reasonable, there is a different error in tbb/arena_slot.cpp). Anyway, when run, usher-sampled seems to initialize as usual, and its stderr output gets this far:

Sample start idx 10168852, end index 10168853
main dfs size 5858841
===================Node ID so far 10168853 ===================output size 29904, nuc size 29904, pos_mutated size 29904
Found 1 missing samples.

Adding missing samples to the tree.
sorting done

Then it gets a SEGV with a stack that's over 100,000 calls deep. The stack starts like this:

#0  0x00005555555c5fc3 in std::__new_allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > >::allocate (
    this=0x7fffe510f020, __n=1) at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/new_allocator.h:126
#1  std::allocator_traits<std::allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > > >::allocate (
    __n=1, __a=...) at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/alloc_traits.h:482
#2  std::_Vector_base<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > >, std::allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > > >::_M_allocate (__n=1, this=0x7fffe510f020)
    at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/stl_vector.h:381
#3  std::_Vector_base<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > >, std::allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > > >::_M_create_storage (__n=1, this=0x7fffe510f020)
    at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/stl_vector.h:398
#4  std::_Vector_base<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > >, std::allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > > >::_Vector_base (__a=..., __n=1, this=0x7fffe510f020)
    at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/stl_vector.h:335
#5  std::vector<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > >, std::allocator<std::unordered_map<int, unsigned char, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, unsigned char> > > > >::vector (__a=..., __n=1, this=0x7fffe510f020)
    at /hive/users/angie/miniconda3/envs/usher/lib/gcc/x86_64-conda-linux-gnu/13.3.0/include/c++/bits/stl_vector.h:557
#6  assign_descendant_possible_muts_recursive (root=0x5556dc450980, output=std::unordered_map with 0 elements, tg=...)
    at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/sampler.cpp:49
#7  0x00005555555c65b4 in operator() (__closure=0x7fff9b839860) at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/sampler.cpp:61
#8  tbb::detail::d2::(anonymous namespace)::task_ptr_or_nullptr<const assign_descendant_possible_muts_recursive(Mutation_Annotated_Tree::Node*, std::unordered_map<int, unsigned char>&, tbb::detail::d2::task_group&)::<lambda()>&> (f=...) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:131
#9  tbb::detail::d2::function_task<assign_descendant_possible_muts_recursive(Mutation_Annotated_Tree::Node*, std::unordered_map<int, unsigned char>&, tbb::detail::d2::task_group&)::<lambda()> >::execute(tbb::detail::d1::execution_data &) (this=0x7fff9b839800, ed=...)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:89
#10 0x00007ffff7d03b0b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fff9b839800, this=<optimized out>)
    at build/_deps/onetbb-src/src/tbb/task_dispatcher.h:334
#11 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=0x0, this=<optimized out>) at build/_deps/onetbb-src/src/tbb/task_dispatcher.h:470
#12 tbb::detail::r1::task_dispatcher::execute_and_wait (t=t@entry=0x0, wait_ctx=..., w_ctx=...) at build/_deps/onetbb-src/src/tbb/task_dispatcher.cpp:168
#13 0x00007ffff7d04151 in tbb::detail::r1::wait (wait_ctx=..., w_ctx=...) at build/_deps/onetbb-src/src/tbb/task_dispatcher.cpp:126
#14 0x00005555555c6371 in tbb::detail::d1::wait (ctx=..., wait_ctx=...)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/detail/_task.h:267
#15 tbb::detail::d2::task_group_base::wait()::{lambda()#1}::operator()() const (__closure=<optimized out>)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:554
#16 tbb::detail::d0::try_call_proxy<tbb::detail::d2::task_group_base::wait()::{lambda()#1}>::on_completion<tbb::detail::d2::task_group_base::wait()::{lambda()#2}>(tbb::detail::d2::task_group_base::wait()::{lambda()#2}) (on_completion_body=..., this=<optimized out>)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/detail/_template_helpers.h:234
#17 tbb::detail::d2::task_group_base::wait (this=0x7fffffffbaf0) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:555

Then the pattern of #6 - #17 repeats until almost the bottom of the very deep stack, which ends like this:

#112218 assign_descendant_possible_muts_recursive (root=0x55565e3f3720, output=std::unordered_map with 0 elements, tg=...)
    at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/sampler.cpp:66
#112219 0x00005555555c65b4 in operator() (__closure=0x7fffa0490760) at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/sampler.cpp:61
#112220 tbb::detail::d2::(anonymous namespace)::task_ptr_or_nullptr<const assign_descendant_possible_muts_recursive(Mutation_Annotated_Tree::Node*, std::unordered_map<int, unsigned char>&, tbb::detail::d2::task_group&)::<lambda()>&> (f=...) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:131
#112221 tbb::detail::d2::function_task<assign_descendant_possible_muts_recursive(Mutation_Annotated_Tree::Node*, std::unordered_map<int, unsigned char>&, tbb::detail::d2::task_group&)::<lambda()> >::execute(tbb::detail::d1::execution_data &) (this=0x7fffa0490700, ed=...)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/task_group.h:89
#112222 0x00007ffff7d069aa in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=<optimized out>, waiter=..., this=0x7fffec6b8980)
    at build/_deps/onetbb-src/src/tbb/task_dispatcher.h:334
#112223 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fffec6b8980)
    at build/_deps/onetbb-src/src/tbb/task_dispatcher.h:470
#112224 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at build/_deps/onetbb-src/src/tbb/arena.cpp:215
#112225 tbb::detail::r1::thread_dispatcher_client::process (td=..., this=<optimized out>) at build/_deps/onetbb-src/src/tbb/thread_dispatcher_client.h:41
#112226 tbb::detail::r1::thread_dispatcher::process (this=<optimized out>, j=...) at build/_deps/onetbb-src/src/tbb/thread_dispatcher.cpp:195
#112227 0x00007ffff7cfe5dc in tbb::detail::r1::rml::private_worker::run (this=0x7fffec6a3f80) at build/_deps/onetbb-src/src/tbb/private_server.cpp:271
#112228 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffec6a3f80) at build/_deps/onetbb-src/src/tbb/private_server.cpp:221
#112229 0x00007ffff7089d22 in start_thread (arg=<optimized out>) at pthread_create.c:443
#112230 0x00007ffff710ed40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(112218 - 6) / 12 = 9351 recursions(?) deep which seems like more than a simple recursive traversal of the tree should require -- I would expect a few hundred maybe. But then I don't know how TBB works. Hopefully it will make more sense to you!

@ognian-
Copy link
Collaborator Author

ognian- commented Jul 25, 2025

@AngieHinrichs Thanks for providing the test case! I've pushed a fix for the infinite recursion.

@AngieHinrichs
Copy link
Contributor

Thanks @ognian- , it's working now! I ran your branch's usher-sampled and matOptimize on the SARS-CoV-2 full tree files (17 million sequences). They both came up with comparable results to the current version. matOptimize has about the same runtime, but for some reason usher-sampled is about 5x slower (it takes about 70 minutes with the current version, but 383 - 386 minutes with the changes). It's still good enough to maintain a daily update of the SARS-CoV-2 tree, but it would be nice to see if that can be improved without too much effort. I will look into compiling in profiler support (although I don't know cmake very well) to see if I can get more details. (There's always "poor man's profiler" too 🙂)

@ognian-
Copy link
Collaborator Author

ognian- commented Jul 28, 2025

@AngieHinrichs Such slowdown is not acceptable, thanks for letting me know! I'll get back to you with an update when I gather some insight.

@AngieHinrichs
Copy link
Contributor

At least part of the problem may be that I use --batch_size_per_process 10. I just tried running with 100 instead of 10, and it took only 95 minutes.

IIRC I'm using such a small batch size because I used to occasionally hit apparent deadlock in usher-sampled, and Cheng suggested that I use a smaller batch size. That, and reducing the number of threads, seemed to help. I hardly ever see usher-sampled lock up for the SARS-CoV-2 build, but it happens (maybe one in 20 or 50 runs?) even with -T 16 for my nightly builds for viruses with a lot fewer available sequences. So the smaller batch size didn't seem to cause too much overhead with the current version of usher-sampled, but with the newer TBB, I guess the overhead is a lot more.

If the changes make it so that usher-sampled can't deadlock any more, that would be great!

Re: profiling -- I tried adding these 3 lines at various stages of CMakeLists.txt:

SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pg")
SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -pg")
SET(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -pg")

(as seen on stackoverflow -- but I kept getting errors from various stages of the compile about needing -fPIE instead of -fPIC, and it looked like the .c compiler commands were not using CMAKE_CXX_FLAGS anyway (a completely different set of options were used). I lost patience with cmake and tried "poor man's profiler", i.e. running in gdb, hitting ctrl-C and looking at the stack. Every time (25 out of 25 tries) the stack was this:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff7cef1bb in tbb::detail::r1::futex_wait (comparand=2, futex=0x7fffffffaff8) at build/_deps/onetbb-src/src/tbb/semaphore.h:100
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffffffaff8) at build/_deps/onetbb-src/src/tbb/semaphore.h:261
#3  tbb::detail::r1::sleep_node<unsigned long>::wait (this=0x7fffffffafd0) at build/_deps/onetbb-src/src/tbb/concurrent_monitor.h:170
#4  tbb::detail::r1::concurrent_monitor_base<unsigned long>::commit_wait (this=<optimized out>, node=...) at build/_deps/onetbb-src/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::concurrent_monitor_base<unsigned long>::commit_wait (node=..., this=0x7fffd5dc3c28) at build/_deps/onetbb-src/src/tbb/concurrent_monitor.h:228
#6  tbb::detail::r1::concurrent_monitor_base<unsigned long>::wait<tbb::detail::r1::sleep_node<long unsigned int>, tbb::detail::r1::wait_bounded_queue_monitor(concurrent_monitor*, std::size_t, std::ptrdiff_t, tbb::detail::d1::delegate_base&)::<lambda()> > (node=..., pred=..., 
    this=0x7fffd5dc3c28) at build/_deps/onetbb-src/src/tbb/concurrent_monitor.h:262
#7  tbb::detail::r1::wait_bounded_queue_monitor (monitors=<optimized out>, monitor_tag=monitor_tag@entry=1, target=target@entry=22, predicate=warning: RTTI symbol not found for class 'tbb::detail::d1::delegated_function<tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#1}>'
...) at build/_deps/onetbb-src/src/tbb/concurrent_bounded_queue.cpp:57
#8  0x00005555555b815a in tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_wait<tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#1}>(tbb::detail::r1::concurrent_monitor*, unsigned long, long, tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#1}) (pred=..., target=22, monitor_tag=1, monitors=<optimized out>, this=0x7fffffffb2a0)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/concurrent_queue.h:329
#9  tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#2}::operator()() const (__closure=<optimized out>) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/concurrent_queue.h:621
#10 tbb::detail::d0::try_call_proxy<tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#2}>::on_exception<tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#3}>(tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop(void*)::{lambda()#3}) (on_exception_body=..., this=<optimized out>)
    at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/detail/_template_helpers.h:227
#11 tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::internal_pop (dst=<synthetic pointer>, this=0x7fffffffb2a0) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/concurrent_queue.h:622
#12 tbb::detail::d2::concurrent_bounded_queue<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*, tbb::detail::d1::cache_aligned_allocator<std::tuple<std::vector<Main_Tree_Target, std::allocator<Main_Tree_Target> >, Sample_Muts*, bool>*> >::pop (result=<synthetic pointer>: <optimized out>, this=0x7fffffffb2a0) at /cluster/home/angie/github/matsengrp/usher/build/_deps/onetbb-src/src/tbb/../../include/tbb/../oneapi/tbb/concurrent_queue.h:499
#13 place_sample_thread (sample_start_idx=<optimized out>, do_print=<optimized out>, printer_node=..., multi_processing=<optimized out>, max_uncertainty=<optimized out>, max_parsimony=<optimized out>, dry_run=false, parsimony_increase_threshold=<optimized out>, curr_idx=..., 
    stop=..., all_size=<optimized out>, retry_queue=..., found_place_queue=..., send_queue=..., deleted_nodes=std::vector of length 5, capacity 47665 = {...}, main_tree=..., start_idx=<optimized out>)
    at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/place_sample.cpp:460
#14 place_sample_leader (sample_to_place=std::vector of length 47665, capacity 65536 = {...}, main_tree=..., batch_size=batch_size@entry=10, curr_idx=..., parsimony_increase_threshold=<optimized out>, dry_run=dry_run@entry=false, placement_stats_file=<optimized out>, 
    max_parsimony=<optimized out>, max_uncertainty=<optimized out>, low_confidence_samples=..., samples_clade=..., sample_start_idx=<optimized out>, idx_map=<optimized out>, do_print=<optimized out>)
    at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/place_sample.cpp:793
#15 0x0000555555582fc1 in leader_thread (options=..., batch_size_per_process=<optimized out>) at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/driver/main.cpp:350
#16 main (argc=<optimized out>, argv=<optimized out>) at /cluster/home/angie/github/matsengrp/usher/src/usher-sampled/driver/main.cpp:512

-- although maybe running something highly parallel in gdb and hitting ctrl-C would always hit the main thread anyway?? Dunno. Just thought I'd include that in case it rings a bell.

@ognian-
Copy link
Collaborator Author

ognian- commented Jul 29, 2025

@AngieHinrichs Thanks for the details! When you compare to the pre-TBB update build that runs in 70 minutes, is that from the latest main branch or some previous release tag? I've based the TBB branch off current main.

@AngieHinrichs
Copy link
Contributor

When you compare to the pre-TBB update build that runs in 70 minutes, is that from the latest main branch or some previous release tag?

latest main branch. But the speed of usher-sampled has been stable (on my server anyway) for a long time.

@ognian-
Copy link
Collaborator Author

ognian- commented Jul 31, 2025

I found a separate bug, that triggers an assert fail only in Debug builds:

Assertion shared_mutations.back().get_descendant_mut()&shared_[16] mutations.back().get_mut_one_hot()' failed.`

This is the relevant stack trace:

Down_Sibling_Hook::both at src/usher-sampled/main_mapper.cpp:206
Combine_Hook<Down_Decendant_Hook, Down_Sibling_Hook>::both at src/usher-sampled/main_mapper.cpp:37
generic_merge<Combine_Hook<Down_Decendant_Hook, Down_Sibling_Hook>>  at src/usher-sampled/main_mapper.cpp:282
search_serial at src/usher-sampled/main_mapper.cpp:346
search_serial at src/usher-sampled/main_mapper.cpp:362
main_tree_search_recursive at src/usher-sampled/main_mapper.cpp:389
main_tree_search_recursive at src/usher-sampled/main_mapper.cpp:422
place_main_tree at src/usher-sampled/main_mapper.cpp:468
find_place at src/usher-sampled/place_sample_shared.cpp:178
Finder::operator() at src/usher-sampled/place_sample.cpp:685

@AngieHinrichs
Copy link
Contributor

Hi @ognian-, thanks for your ongoing work on this! Here is a smallish performance test:

curl -O https://raw.githubusercontent.com/cov-lineages/pangolin-data/refs/heads/main/pangolin_data/data/lineageTree.pb
curl -O https://hgwdev.gi.ucsc.edu/~angie/test.1000.mpl
# Same NC_045512.2.fa as before

usher-sampled -T 8 \
    -i lineageTree.pb \
    --diff test.1000.mpl \
    --ref NC_045512.2.fa \
    --optimization_radius 0 \
    --batch_size_per_process 10 \
    -o test.pb.gz \
    >& test.log

Ony my server, that takes around 15s for the main branch usher-sampled and 17-19s with the changes.

@AngieHinrichs
Copy link
Contributor

Reducing the --batch_size_per_process to 5 increases the difference between the times: ~17s vs. 26-29s.

@AngieHinrichs
Copy link
Contributor

And just to clarify, regarding the Debug-only assertion failure that you found: do we need to fix that or did you fix it?

@ognian-
Copy link
Collaborator Author

ognian- commented Jul 31, 2025

Regarding the bug in debug mode: now with the small tree I was able to progress on it, found a total of 4 asserts in main_mapper.cpp that fail, and if commented out the run finishes normally to success. I'll keep on investigating.

@AngieHinrichs
Copy link
Contributor

I ran this branch's matOptimize on the pre-optimization full SARS-CoV-2 tree from 2025-08-25. It ran without errors in 107.5 minutes and produced a tree with almost identical matUtils summary stats to the daily build (main branch, ~102 minutes) matOptimize run. Awesome work @ognian- ! 🎉

I will try usher-sampled next.

@AngieHinrichs
Copy link
Contributor

AngieHinrichs commented Sep 2, 2025

Sorry about the delay, I am too easily distracted. I tried the matsengrp branch usher-sampled on the full tree of 17 million sequences and it got good results (similar output about the same samples, similar matUtils summary -i output), but it took 396 minutes while the main branch usher took 67 minutes. It is slower, but the daily build will still work.

@AngieHinrichs
Copy link
Contributor

This issue of usher-sampled relative slowness came up on July 28th, let me try running again with --batch_size_per_process 100 instead of 10, stay tuned.

@AngieHinrichs
Copy link
Contributor

With --batch_size_per_process 100 instead of 10, usher-sampled took 95 minutes, much better.

@AngieHinrichs
Copy link
Contributor

AngieHinrichs commented Sep 25, 2025

With the latest changes, the performance with --batch-size-per-process 100 has stayed the same at 95 minutes (92 minutes in a second run), while the performance with --batch-size-per-process 10 has improved a lot from 396 minutes to 164 minutes.

@yatisht yatisht marked this pull request as ready for review October 7, 2025 19:54
@AngieHinrichs
Copy link
Contributor

Great improvement with taskflow, now it takes only 91 minutes with --batch-size-per-process 10! (and the stdout log of sample placements and the matUtils summary results on the resulting tree look good)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants