Releases: shenwei356/LexicMap
Releases · shenwei356/LexicMap
LexicMap v0.8.0
v0.8.0 - 2025-09-10
No changes to the index format (see Index format changelog).
- New commands:
lexicmap utils merge-search-results: Merge a query's search results from multiple indexes.lexicmap utils edit-genome-ids: Edit genome IDs in the index via a regular expression.
It's helpful when users forgot to use the flag-N/--ref-name-regexp
to extract the genome ID from the sequence file during indexing.
This command help to fix it without rebuilding the index.
lexicmap index:- Significantly reduce the memory usage (by up to 25%) in the merge step.
Also reduce some for huge data, such as long-reads or contigs in the Logan project.
- Significantly reduce the memory usage (by up to 25%) in the merge step.
lexicmap search:- Reduce memory usage, particularly for batch searching (by up to 50%).
- Improve search speed, mainly for batch searching.
- Support limiting search by TaxId(s) via
-t/--taxidsor--taxid-file.
Only genomes with descendant TaxIds of the specific ones or themselves are searched,
in a similar way with BLAST+ 2.15.0 or later versions.
Negative values are allowed as a black list.
For example, searching non-Escherichia (561) genera of Enterobacteriaceae (543) family with-t 543,-561.
Users only need to provide NCBI-format taxdump files (-T/--taxdump, can also create from
any taxonomy data with TaxonKit)
and a genome-ID-to-TaxId mapping file (-G/--genome2taxid).
There's no need to rebuild the index. - Check if the output file and the log file are the same.
- Reduce the time of seed matching when using
-w. - Change the default value of
--max-query-concfrom 12 to 8. - New flag
--gc-interval(default 64, 0 for disable) for forcing garbage collection every N queries. This decreases memory usage a lot.
lexicmap utils subseq:- Accept the output file of
lexicmap searchas the input.
So one can extract matched sequences (including flanking regions) from the index, after alignment withlexicmap searchwith or without using the flag-a/--all. - Support extending aligned regions with
-U/--upstreamand/or-D/--downstream.
- Accept the output file of
LexicMap v0.7.0
v0.7.0 - 2025-04-11
Please rebuild the index, as some seeds in the genome end regions were missed during computation.
lexicmap index:- Fix a little bug in seed desert filling -- forgot to fill the region (a few hundred bases) behind the last seed.
lexicmap search:- Improve seed chaining -- more accurate for complex anchors.
- Improve pseudoalignment in repetitive regions.
- Change the default value of
--seed-max-gapfrom 200 to 50.
LexicMap v0.6.1
v0.6.1 - 2025-03-31
lexicmap search:- Fix the program hang in the debug mode when no chaining result is returned.
lexicmap version:- Do not show commit hash by default.
LexicMap v0.6.0
v0.6.0 - 2025-03-25
This version is compatible with indexes created by previous versions (requires a one-time, automatic preprocessing),
but rebuilding the index is recommended for more accurate results on short queries (<500bp).
However, indexes created by this version are not compatible with previous versions when the number of batches is <= 512.
lexicmap index:- Change default option values to bring a higher sensitivity for short (<=500, especially <=250) queries,
faster indexing speed, and faster seed-matching speed, at a cost of slightly larger index.-m/--masks: 40,000 -> 20,000.
40k is unnecessary especially for small genomes, where seeds would be very crowded,
with a big proportion of seed distance being between 0-50 bp.-D/--seed-max-desert: 200 -> 100. This provides a smaller seed window guarantee.
- Reduce index size by using 3 bytes rather than 4 for saving seed data when the number of batches is <= 512,
which requires only 9 (17 minus 8) bits to store the batch index.
We also recommend controlling the number of batches for better performance. - Fix seed desert filling near gap regions.
- Change default option values to bring a higher sensitivity for short (<=500, especially <=250) queries,
lexicmap search:- Improve pseudoalignment to produce longer alignment regions.
- Add 3 extra columns:
cls,evalueandbitscore, and a new option-e/--max-evalue. - Reduce memory usage.
- Remove flag
--pseudo-align. - Add a progress bar for
--debug.
lexicmap utils seed-pos:- Change default option values of sliding window.
LexicMap v0.5.0
v0.5.0 - 2024-12-18
This version is compatible with indexes created by LexicMap v0.4.0, but rebuilding the index is recommended for more accurate results.
- New commands:
lexicmap utils remerge: Rerun the merging step for an unfinished index.
lexicmap index:- Big genomes with thousands of contigs (big yet fragmented assemblies) are automatically split into multiple chunks, and alignments from these chunks will be merged.
- Change the default value of
--partitionsfrom 1024 to 4096, which increases the seed-matching speed at the cost of 2 GiB more memory occupation.
For existing lexicmap indexes, just runlexicmap utils reindex-seeds --partitions 4096to re-create seed indexes. - Do not save seeds of low-complexity.
- Fix high memory usage in writing seed data.
- Change the default value of
-c/--chunksfrom all available CPUs to the value of-j/--threads. - Change the default value of
--max-open-filesfrom 512 to 1024. - Add a new flag
--debug.
lexicmap search:- Improving chaining, pseudoalignment, and alignment for highly repetitive sequences.
- More accurate chaining score with better chaining of overlapped anchors, this produces more accurate results with
-n/--top-n-genomes:- Merging two overlapped non-gapped anchors into a longer one.
- For these with gaps, only the non-overlapped part of the second anchor is used to compute the weight.
- Using the score of the best chain (rather than the sum) for sorting genomes when using
-n.
- Fix positions and alignment texts for queries with highly repetitive sequences in end regions. #9
- Skip seeds of low-complexity.
- Change the default value of
--max-open-filesfrom 512 to 1024. - Change the default value of
--align-bandfrom 50 to 100. - Improve the speed of anchor deduplication, genome information extraction, and result ordering.
- Improve the speed of chaining for long queries.
- Improve the speed of seed matching when using
-w/--load-whole-seeds. - Improve the speed of alignment, and reduce the memory usage.
- Remain compatible after the change of
lexicmap index. - Add a new flag
--debug.
lexicmap utils genomes:- Do not sort genome ids.
- Add a header line and add another column to show if the reference genome is chunked.
lexicmap utils subseq:- Remain compatible after the change of
lexicmap index.
- Remain compatible after the change of
lexicmap utils seed-pos:- Remain compatible after the change of
lexicmap index, while histograms are plotted separately for multiple genome chunks.
- Remain compatible after the change of
lexicmap utils reindex-seeds:- Change the default value of
--partitionsfrom 1024 to 4096.
- Change the default value of
LexicMap 0.4.0
v0.4.0 - 2024-08-15
- New commands:
lexicmap utils 2blast: Convert the default search output to blast-style format.
lexicmap index:- Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data.
- Better sketching desert filling for highly-repetitive regions.
- Change the default value of
--seed-max-desertfrom 900 to 200 to increase alignment sensitivity. - Mask gap regions (N's).
- Fix skipping interval regions by further including the last k-1 bases of contigs.
- Fix a bug in indexing small genomes.
- Change the default value of
-b, --batch-sizefrom 10,000 to 5,000. - Improve lexichash data structure.
- Write and merge seed data in parallel, new flag
-J/--seed-data-threads. - Improve the log.
lexicmap search:- Fix chaining for highly-repetitive regions.
- Perform more accurate alignment with WFA.
- Use buffered reader for seeds file reading.
- Fix object recycling and reduce memory usage.
- Fix alignment against genomes with many short contigs.
- Fix early quit when meeting a sequence shorter than k.
- Add a new option
-J/--max-query-concto limit the miximum number of concurrent queries,
with a default valule of 12 instead of the number of CPUs, which reduces the memory usage
in batch searching. - Result format:
- Cluster alignments of each target sequence.
- Remove the column
seeds. - Add columns
gaps,cigar,align, which can be reformated withlexicmap utils 2blast.
lexicmap utils kmers:- Fix the progress bar.
- Fix a bug where some masks do not have any k-mer.
- Add a new column
prefixto show the length of common prefix between the seed and the probe. - Add a new column
reversedto indicate if the k-mer is reversed for suffix matching.
lexicmap utils masks:- Add the support of only outputting a specific mask.
lexicmap utils seed-pos:- New columns:
sseqidandpos_seq. - More accurate seed distance.
- Add histograms of numbers of seed in sliding windows.
- New columns:
lexicmap utils subseq:- Fix a bug when the given end position is larger than the sequence length.
- Add the strand ("+" or "-") in the sequence header.
LexicMap v0.3.0
v0.3.0 - 2024-05-14
lexicmap index:- Better seed coverage by filling sketching deserts.
- Use longer (1000bp N's, previous: k-1) intervals between contigs.
- Fix a concurrency bug between genome data writing and k-mer-value data collecting.
- Change the format of k-mer-value index file, and fix the computation of index partitions.
- Optionally save seed positions which can be outputted by
lexicmap utils seed-pos.
lexicmap search:- Improved seed-chaining algorithm.
- Better support of long queries.
- Add a new flag
-w/--load-whole-seedsfor loading the whole seed data into memory for faster search. - Parallelize alignment in each query, so it's faster for a single query.
- Optional outputing matched query and subject sequences.
- 2-5X searching speed with a faster masking method.
- Change output format.
- Add output of query start and end positions.
- Fix a target sequence extracting bug.
- Keep indexes of genome data in memory.
lexicmap utils kmers:- Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair.
- New commands:
lexicmap utils gen-masksfor generating masks from the top N largest genomes.lexicmap utils seed-posfor extracting seed positions via reference names.lexicmap utils reindex-seedsfor recreating indexes of k-mer-value (seeds) data.lexicmap utils genomesfor list genomes IDs in the index.
LexicMap v0.2.0
v0.2.0 - 2014-02-02
- Software architecture and index formats are redesigned to reduce searching memory occupation.
- Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged.
- Searching: seeds matching is performed on disk yet it's ultra-fast.
LexicMap v0.1.0
v0.1.0 - 2024-01-15
- The first release.
- Seed indexing and querying are performed in RAM.
- GTDB r214 with 10k masks: index size 75GB, RAM: 130GB.