Skip to content

Conversation

@JonahPlusPlus
Copy link

@JonahPlusPlus JonahPlusPlus commented Feb 25, 2025

Fixes #39 (and supersedes #36 and #37)

  • Introduce a new Label<Token> trait for iterating over tokens and implement it over common types and blanket types.
  • Merge push and insert builder methods together.
  • Replace all arguments that use impl AsRef<[Label]> to the Label<Token> trait.
  • Add IntoLabel<Token> extension trait and LabelIter<Token> wrapper for converting impl Iterator<Item = Token> to a Label<Token> object.
  • Introduce new children methods to Trie and IncSearch that iterates over the children at a prefix.
  • Update Rust edition to 2024
  • Rename instances of Label to Token (so that Label refers to a string of tokens and Token refers to a single unit in the trie)
  • Rename "postfix" as "suffix" (maybe?)
  • Rename methods for consistency (e.g. exact_match to is_exact, to mirror is_prefix).
  • Restructure crate to be consistent with standard practices (e.g. rename module files to <mod>/mod.rs; rename internal_data_structure to just internal)
  • Optimize some methods by moving temp Vecs outside of loops and reuse memory
  • Make sure size of node storage is shrunk (convert Vec<Node> to Box<[Node]>)
  • Create unit tests
  • Write documentation
  • Create migration guide
  • Increment minor version
  • Update changelog

Possible changes (might be left for future work):

  • Optimize search methods that use TryFromIterator<Label, M> by making them return slices of the input or exact indices
  • Move Answer outside of inc_search and rename to LabelKind, and add NodeRef::kind(_)
  • Change API to use get methods that return NodeRef objects (to mirror APIs that exist in the ecosystem)

Improves API ergonomics by depending on an in-crate trait, instead of the previous AsRef trait and introduces new accessors for children nodes.

Why Label<Token> and not IntoIterator<Item = Token>?

Then this crate would be locked out of specific implementations like impl Label<u8> for char, which allows for incremental searches of Trie<u8, _> with a char, due to orphan rules.
Instead, if a custom iterator needs to be used, users can call into_label() to put it into a generic wrapper (which should just get optimized away).

Drawbacks

This does mean type annotations are sometimes needed when building tries (e.g. &str implements Label<u8> and Label<char>).

Future Work

Implement children_mut methods on NodeMut (needs an unsafe implementation to get around borrow checker).

Explore alternative vector implementations like smallvec (maybe feature flagged?) to avoid heap allocations.

Investigate inconsistency of checks:

// Trie::value(_)
if node_num.0 >= 2 {
  self.trie_tokens[(node_num.0 - 2) as usize].value.as_ref()
} else {
  None
}

// vs

// Trie::value_mut(_)
self.trie_tokens[(node_num.0 - 2) as usize].value.as_mut()

@JonahPlusPlus
Copy link
Author

Some preliminary benchmarks (just to get a sense of where we are in performance before I make more changes):

[89c8a67] Trie::build() 10000 items
                        time:   [4.4129 ms 4.6616 ms 4.8854 ms]
[9514c6c] Trie::build() 10000 items
                        time:   [4.0583 ms 4.1358 ms 4.2735 ms]

[89c8a67] Trie::exact_match() 100 times
                        time:   [3.7502 ms 3.8412 ms 4.0162 ms]
[9514c6c] Trie::exact_match() 100 times
                        time:   [3.5008 ms 3.5263 ms 3.5589 ms]

[89c8a67] Trie::predictive_search() 100 times
                        time:   [6.6395 ms 6.6848 ms 6.7540 ms]
[9514c6c] Trie::predictive_search() 100 times
                        time:   [6.4623 ms 6.4806 ms 6.5048 ms]

[89c8a67] Trie::predictive_search_big_output()
                        time:   [49.609 ms 50.004 ms 51.079 ms]
[9514c6c] Trie::predictive_search_big_output()
                        time:   [48.979 ms 49.154 ms 49.544 ms]

[89c8a67] Trie::predictive_search_limited_big_output()
                        time:   [809.92 us 811.48 us 813.42 us]
[9514c6c] Trie::predictive_search_limited_big_output()
                        time:   [806.87 us 809.73 us 812.83 us]

[89c8a67] Trie::common_prefix_search() 100 times
                        time:   [3.8087 ms 3.8161 ms 3.8296 ms]
[9514c6c] Trie::common_prefix_search() 100 times
                        time:   [3.6169 ms 3.6331 ms 3.6495 ms]

[89c8a67] Trie::common_prefix_match() 100 times
                        time:   [912.16 us 915.12 us 917.49 us]
[9514c6c] Trie::common_prefix_match() 100 times
                        time:   [902.21 us 906.88 us 915.14 us]

Overall, performance seemed to improve (hard to tell when the benchmarks can't be compared by criterion due to the weird naming convention).

I'll have to update the benchmarks to use criterion 0.5 and remove the git hash in the names (instead opting for named baselines).

@shanecelis
Copy link
Collaborator

shanecelis commented Feb 26, 2025

This is looking good. I really like that this doesn't interfere with the using the API with strings. I think the type annotation on builder is a fine concession.

One consideration I've had in the back of my mind, and this can be reserved as a separate PR, is that the map::Trie potentially doubles the size of the data structure because it stores a key value pair per entry. I believe this could be handled instead as a key value union where the last entry (the entry with no children) is treated as a value, and the rest are treated as keys.

@shanecelis
Copy link
Collaborator

I think changing "postfix" to "suffix" is good.

Perhaps we can reconsider some of our search names as well. Might as well bundle up the breaking changes. I'm not requesting you do this work, just soliciting your opinion on these name changes:

  • "predictive_search" -> "search"
  • "common_prefix_search" -> "common_prefixes"
  • "postfix_search" -> "suffixes"

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Feb 28, 2025

Making a note (in case feedback is needed):
Right now I'm working on improving the usability of the search methods. Before they would return labels and their values, but with the introduction of NodeRef, this can be made more flexible and even improve performance.

The change is nested iterators with a new NodeIter object, which looks like:

pub struct NodeIter<'t, Token, Value> {
    pub(crate) trie: &'t Trie<Token, Value>,
    pub(crate) start: LoudsNodeNum,
    pub(crate) end: LoudsNodeNum,
}

This is a double-ended iterator (optimized for reverse iteration). The idea is simple: pop a node (go to the parent) from the end until you reach the start. In the case of forward iteration, this means a bit of work (you have to go through many nodes until you reach the start and repeat this while incrementing start), but it's trivial for a reverse iterator (meaning that some collectors will need to be optimized to use reverse iteration and array reversal under the hood).

So instead of search iterators returning (C, &Value) pairs, they will just return NodeIter<Token, Value> iterators instead.

Then an extension trait will be used to allow people to call .into_pairs() if they want the old behavior.

Edit: this will also allow expanding Trie for querying ranges with a label (instead of just getting the last node ref).

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 1, 2025

@shanecelis I'm considering dropping "postfix_search" in favor of changing:

  • "predictive_search" -> "after" (as in all exact matches that appear 'after' some node)
  • "common_prefix_search" -> "before" (as in all exact matches that appear 'before' some node)

And adding a new .filter_prefix(label) method to strip the prefix from the labels.

My rationale: "postfix_search" is almost the same as "predictive_search", it just drops the common prefix from the results. Rather than maintain two separate APIs that do pretty much the same thing, it might be better to merge them and create a modifier that opts into "postfix_search" behavior.

Plus, this would mean more flexibility, since .filter_prefix(label) could work on "common_prefix_search" (why, IDK, but having general behavior is always nice).

How does this sound?

Edit:
Instead of an extension trait that adds filter_prefix, it might just be a flag method called drop_prefix. I'll play around with it until I get something that ties in conceptually without losing any API ergonomics.

@shanecelis
Copy link
Collaborator

shanecelis commented Mar 2, 2025

I like the idea of refining the API so that it has options like you suggest. I'm also ok with maybe dropping postfix_search. I'm not sure it had a real use case. It was more like an implementation detail that got exposed. We could deprecate it and see if anyone complains, or you can do like you're suggesting.

I see the thinking behind before() and after() but I'm not taken with them; they don't gel with string matches to me, which is our inspiring and probably most prevalent case. Not that we shouldn't remain generic but strings ought to be natural.

How about instead predictive_search() we ape the string api and have simple starts_with()? Then for common_prefix_search(), hmm..., I don't know, maybe starts_with_any_prefix(), starts_with_any_sub(), starts_with_any(), starts_of(), or starts_within(). [Grimaces.] Happy to hear your ideas for what you could call "common_prefix_search".

@JonahPlusPlus
Copy link
Author

I did consider "predictive_search" -> "start_with" and "common_prefix_search" -> "prefixes_of", but that's when I was also considering "postfix_search" -> "suffixes_of".

However, now I'm thinking I should just include all three methods and just drop SearchIter in favor of a modified PostfixIter:
Right now, I am looking at the logic shared by PostfixIter and SearchIter, and it would be easy to store a start: LoudsNodeNum field to set for NodeIter. When constructing PostfixIter, this can be set to the root (the last node of a label), while for SearchIter it can be set to 1. This means it is far easier to convert a PostfixIter to SearchIter, just set start: 1. But IMO, converting postfix to predictive search won't feel as natural as going the opposite direction (e.g. calling include_prefixes vs strip_prefixes).

(I've probably been too keen on merging APIs when I should be merging their logic)

tl;dr I think I'll go with "predictive_search" -> "start_with", "common_prefix_search" -> "prefixes_of", and "postfix_search" -> "suffixes_of".

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 2, 2025

A note for future work:
Once a feature like specialization, general auto-traits + negative impls, or negative bounds is stabilized, it should be possible to merge map::Trie + set::Trie and NodeRef + KeyRef by disabling value accessors when Value is () (e.g. where Value: !Unit).

It would simplify implementations so much, but alas.

Edit: Proof of concept
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2024&gist=a3252b6528a701f0e3f4ad152db4bf45

Edit 2: An even more advanced proof of concept that is DRY over mutability
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2024&gist=651469965458b6dfb9040ccb63ad58bb

At this point, the maintainability/extensibility is questionable, still, it's hella cool that this might be possible one day.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 3, 2025

Major regressions with the changes to search iterators when getting labels (before it used a unified iterator to accumulate many labels; now it runs it separately per label i.e. .predictive_search(label) -> .starts_with(label).labels()).

Gonna have to make a specialized label allocator that shares a buffer between labels and checks differences between node ranges.
i.e. .labels() should return a special iterator instead of calling .label() per node iterator.

If these regressions can't be removed, I'll add back the previous implementations as special methods on Trie (in fact, I might not bother trying to improve .labels() and just add back the previous methods anyways).

@shanecelis
Copy link
Collaborator

What does .labels() do? And what is returned without .labels()?

Is it a performance regression?

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 4, 2025

What does .labels() do? And what is returned without .labels()?

Is it a performance regression?

So I changed the search iterators to return iterators over nodes rather than the labels or (label, value) pairs. I then introduced .labels() and .pairs which converted node iterators to labels and (label, value) pairs, respectively, by running the label collector per iterator. The problem is that this is a lot of duplicate work that was avoided before, leading to a massive performance regression.

So, I'm just going to add the old methods back in as shortcuts, and see if I can improve performance while I'm at it (i.e. *_labels and *_pairs variants of prefixes_of, suffixes_of, and starts_with).
Since I found a way of merging PostfixIter and SearchIter, I can do the same here, so it's still a win.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 4, 2025

Treating the label collector as a special case has fixed the regression for starts_with.
What's left for this part of the API:

  • Do the same for prefixes.
  • Change longest_prefix to use the TryFromTokens<Token>
  • Make it so that label iterators don't unwrap internally (it's better for end users to handle failures)
  • Write a ton of tests/examples/docs/benchmarks
  • Change TryFromTokens to return infallible values with Result AT, Zip<Other> GAT and zip function (to avoid unwrapping Result<T, Infallible>)
  • Create IncSearch wrapper for set::Trie

Almost done... Should be done by this weekend (schoolwork permitting).

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 4, 2025

longest_prefix is pretty weird. It does nothing which the name implies. Instead of returning the longest exact match prefix, it returns the first exact match after the label, as long as there aren't any alternative, otherwise it returns the given prefix.

Example:
Given the following set:

  • "a"
  • "app"
  • "application"
  • "apple"

We get the following matches:

  • "a" -> "a"
  • "ap" -> "app"
  • "app" -> "app"
  • "appl" -> "appl" (we have a choice between "apple" and "application", so just return the prefix)
  • "appli" -> "application"
  • "apple" -> "apple"

A better name for this might be "next_path", since what it is doing is looking for a path subgraph following some prefix.
Also, I don't think "appl" should return Some("appl"), since it doesn't exist in the trie.
Instead, I feel that this method should follow the same patterns as other search methods, and have a variant that returns a NodeRef and another that returns a (label, value) pair.

Edit: Actually, the (label, value) pair variant isn't necessary since there isn't any mass label collection happening. Just getting a NodeRef and calling .pair() on it should be enough.

Edit 2: "path_of" sounds like a good name. "next_path" sounds like it might return the path subgraph after a label, but this really returns the shortest path that the label lies on, which means an exact match returns instead of looking for the next entry.

@shanecelis
Copy link
Collaborator

You've got to think of it in the context of tab completion, where what you want is to solicit the user for which branch to follow. Perhaps longest_prefix_or_match() would be a better name since it doesn't skip over exact matches. Or maybe longest_prefix(skip_matches: bool) would be clearer. I think "prefix" implies it's not necessarily a match.

Let's take a step back and look at how we could build this based on other parts of our API to see if there's a clearer name for what we're doing.

trie.starts_with_labels("ap") // ["app", "apple", application"]
    .X() // "app"

What would you name the X operation? Maybe common_prefix? I probably specifically avoided that name because of common_prefix_search() being in the API, but we can reconsider since that would go away.

@JonahPlusPlus
Copy link
Author

Probably just ".shortest()", but that doesn't provide enough information for a top-level API method. Maybe "shortest_suffix"? ("shortest_suffix_or_match" would be more precise, but that's too long IMO)
I don't think using the term "prefix" here is appropriate, since it is searching after the label for the first exact match, not before.
I think "shortest_suffix", "next_match", or "next_path" could all work. (Maybe "first_starts_with", but that could confuse people with "starts_with", which doesn't avoid branching)

@shanecelis
Copy link
Collaborator

shanecelis commented Mar 6, 2025

Probably just ".shortest()", but that doesn't provide enough information for a top-level API method. Maybe "shortest_suffix"? ("shortest_suffix_or_match" would be more precise, but that's too long IMO)

I see. You're saying "suffix" because we're traveling down the tree. Interesting.

I don't think using the term "prefix" here is appropriate, since it is searching after the label for the first exact match, not before.

I see what you're thinking. Good point. I have conflated the term "prefix" because it is a prefix trie. Any query that has matches is a "prefix" WRONG (Noted with "Gah" below.) Any query that has matches is a "prefix" or a "terminal", but with a API like prefixes_of we're talking about a prefix of the query and we're only providing "match"es. whew Naming is hard.

I think "shortest_suffix", "next_match", or "next_path" could all work.

Let's figure out what to call our pieces and discuss how we currently handle them.

  • A Token is primitive piece of the Entry/Match, the edges on our trie.
  • A Match/Entry/ExactMatch is when a query was also used to build the trie, sometimes called an exact match.
  • A Prefix is represented by a parent node in the trie, sometimes an exact match, sometimes not. All nodes are prefixes. But also we've been using prefix to indicate a kind of direction for which way our search is going, using "prefix" to indicate it's going up the tree.

Gah, I just had to remind myself that not every node in a trie is a prefix node.

/// Note: A prefix may be an exact match or not, and an exact match may be a prefix or not.
  • A Query is a sequence of tokens that may refer to a node in the trie, in which case its a "prefix" or a "match", if that node is exact then it is a "match" or "exact match" (but it would be nice to do away with the "exact" business so either there are "prefix"es and "match"es)
  • A Mismatch is a query that is not represented in a trie or is only represented up to a certain token query[0..n]. Typically its represented as None in the code.
  • A Path is new terminology to me. Sounds like it could represent a prefix or exact match. How do you define it?
  • A Suffix is where we return a portion of a match excluding its query. And sometimes we've been using "suffix" to mean we're descending the trie. This is confusing partly because it makes "suffix" and "prefix" seem like they're equally weighted concepts, but "suffix" was only an output request originally.

We've had a need to deal with these concepts more coherently, and I think your PR is providing the thrust for this much needed clarity, and I THANK YOU FOR THAT. And I'm beginning to see the value too of returning a NodeIter as the most basic type, and then choosing one's output need rather than forcing string reification or whatever the representation is. Especially in instances where maybe one wants to count the matches rather than reify them.

I'd like to pare down the concepts we use that will cover our domain. I'm hoping that we can mostly write our user-facing API in terms of these things:

  • Token
  • Query or Label
  • Prefix
  • Match

I am happy to hear and consider what terms you'd choose. Terms?

Maybe change prefixes_of to matches_within or exact_matches_within (feels too long). What does the other terminology feel like here? entries_within hmm, I feel like matches_within is more evocative of what it does.

Maybe we can do like you suggested and drop suffices_of or rather treat the issue of what you want from your matches as an adverb like you suggested too, e.g., starts_with(query).labels(), starts_with(query).suffices(), starts_with(query).values(), starts_with(query).pairs(). I'm now fully on board with this idea.

(Maybe "first_starts_with", but that could confuse people with "starts_with", which doesn't avoid branching)

Yeah, starts_with does a depth-first-search returning all matches and this longest_prefix does a goto to first branch or match. I'm speaking aloud here just to help myself navigate, and why do we do that? To solicit the user to choose among the branches or to be content with the next match. So maybe we can use that and name it in terms of its utility, some options: complete (too generic), auto_complete, next_decision, next_completion, next_match_or_branch (boo). Thoughts?

An aside

One other thing I've been considering is do we want to change LabelKind from:

pub enum LabelKind {
    /// There is a prefix here.
    Prefix,
    /// There is an exact match here.
    Match,
    /// There is a prefix and an exact match here.
    PrefixAndExact,
}

to something like this:

bitflags! {
    pub struct NodeKind: u8 {
        const PREFIX   = 0b00000001;
        const MATCH    = 0b00000010;
    }
}
impl NodeKind {
    pub fn is_terminal(&self) -> bool { self == NodeKind::MATCH }
    pub fn is_match(&self) -> bool { self.contains(NodeKind::MATCH }
    pub fn is_prefix(&self) -> bool { self.contains(NodeKind::PREFIX) }
    pub fn is_missing(&self) -> bool { self.is_empty() } // better name?
    pub fn is_present(&self) -> bool { !self.is_empty() }
}

I've just been feeling like there have been other parts of the API where saying PREFIX | MATCH might be expressive as arguments.

@shanecelis
Copy link
Collaborator

By the way, I should of said this from the start, but I really like your PR's set of checkboxes that shows the intent and progress. I think I'm going to adopt that in the future. It's incredibly helpful.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 6, 2025

Let's figure out what to call our pieces and discuss how we currently handle them.

I think your definitions are good. As far as changes/clarifications go:

  • I think "Query" should be dropped for just "Label" (since the TrieBuilder uses them during construction). We could also just call it "Tokens", which might be more accessible (I didn't put much thought into it and just took the term "Label" because it was being used in place of "Token" internally).
  • By "Path", I'm thinking along terms in graph theory. A trie is a just a particular tree graph after all. A path is a sequence of vertices (aka nodes) in a graph. What longest_prefix was doing was searching after a label for the first exact match that didn't have any branching. In other words, the shortest path subgraph after that label (since a path doesn't have any branching). Now, I was using the term sparingly, since you can find a path as a subgraph in any graph (all NodeIter instances are paths).
  • On that note, next_completion sounds more intuitive and indicates intent better. So I think I'll go with that.
  • As far as what to do with the dual definitions of "prefix" and "suffix", I think the definitions are close enough to think of them primarily as labels relative to a particular label (so, like a direction and like a portion of some label).

Maybe we can do like you suggested and drop suffices_of or rather treat the issue of what you want from your matches as an adverb like #40 (comment), e.g., starts_with(query).labels(), starts_with(query).suffices(), starts_with(query).values(), starts_with(query).pairs(). I'm now fully on board with this idea.

Oh, I've backtracked from that (partly) 😅. The reason being it complicates implementation while not bringing much in terms of ergonomics.

Warning: Information dump inbound

To illustrate the problem, let's look at starts_with(label).suffixes().labels().
With this chain, we could eliminate 3 top-level APIs (starts_with_labels, suffixes_of, suffixes_of_labels) in exchange for 2 modifiers (suffixes and labels). That's a pretty sweet deal, right? Maybe we can get more flexibility, even if you got to type a bit more?

But let's think about how these modifiers get applied:
Are they methods on top of the specific iterators (i.e. PostfixIter here)? Or are they blanket implementations for iterators that return NodeIter?

The second sounds far more flexible (you can then use Iterator methods like filter before converting to labels). But consider performance: if you want to create an optimized label collector, you can't assume all NodeIters haven't been modified or you need to communicate that assumptions will be made that result in data loss. Suddenly, it's not as flexible as it seemed.

Well, then let's look at making .labels() a method on PostfixIter. Well, if it's going to consume PostfixIter anyways, you might as well make it a method on Trie and save typing out some characters.

Okay, so .labels() is out, but what about keeping the modifier .suffixes()?
Consider what must happen for suffixes to drop the original label:
If starts_with is creating node iterators with a range of { start: 1, end: <some_node> }, .suffixes() will need to turn start into the last node of the label.
But where does this information come from? We would have to store that information in the NodeIter, so .suffixes() could move it into start.

That sounds gross: we are now storing an extra field that not all users of NodeIter will use.
But maybe we can approach this from the other direction?
Since starts_with will always set start to 1, we could just replace starts_with with suffixes_of and replace suffixes with include_prefix (or some name).
(Personally, I think this is unintuitive, but let's focus on the technical aspects for now)

Let's think about how this modifier ties into suffixes_of and suffixes_of_labels (since it was established above that we need both options for performance).
The include_prefix option works fine with suffixes_of as a trait implementation, but since suffixes_of_labels returns a custom optimized iterator, include_prefix would have to be a method on it that modifies it.
Plus, suffixes_of_labels already does some work for initialization, so to modify it would make that work unnecessary.
So it would be better to just create a top level method on Trie, and if you are going to do that, you might as well be consistent and make starts_with() instead of suffixes_of().include_prefix() and type less.

tl;dr it's better to just have more options on Trie and not worry about redundancy here

I'm now planning on removing .labels() and .pairs() from extension traits, in order to encourage users to use *_labels() and *_pairs() variants. If they really need to do some sort of filtering or modification, they can just use .map(|node| node.label()) instead.

By the way, I should of said this from the start, but I really like your PR's set of checkboxes that shows the intent and progress. I think I'm going to adopt that in the future. It's incredibly helpful.

Thanks, I use them to help keep track of what I need to get done, but I'm glad it's useful for you too.

Anyways, I got a ton of schoolwork I've need to get to, so I'm going to have to put this on pause till Saturday.

@shanecelis
Copy link
Collaborator

shanecelis commented Mar 7, 2025

  • By "Path", I'm thinking along terms in graph theory.

I'm fine with path being used consistently in the library. I just hope not to expose the user to it unless it pays for itself.

Warning: Information dump inbound

I welcome it. :)

To illustrate the problem, let's look at starts_with(label).suffixes().labels(). With this chain, we could eliminate 3 top-level APIs (starts_with_labels, suffixes_of, suffixes_of_labels) in exchange for 2 modifiers (suffixes and labels). That's a pretty sweet deal, right? Maybe we can get more flexibility, even if you got to type a bit more?

What I want is perhaps more limited adverbs. These are the uses I imagine:

starts_with(label) // ???
starts_with(label).labels() // Iterator<Item = L>
starts_with(label).pairs() // Iterator<Item = (L, &Value)>
starts_with(label).values() // Iterator<Item = &Value>

starts_with(label).suffixes() // Iterator<Item = L>

They would not permit .labels().suffixes() because you're right there's a data plumbing problem there.

Okay, so .labels() is out

I didn't want to contend with this point without getting my hands a little dirty, so based off of your work I made a little branch "label-adverb" to exercise some ideas. And I figured out what I want to suggest as the principle output type that might unify what we're doing. Here it is: impl Iterator<Item = NodeRef>. Running with your NodeRef, we can point to the things we're interested in in the trie without having to reify a label, which is great and allows us to do some things a lot more performantly like counting the number of matches without generating labels.

starts_with(label) // Iterator<Item = NodeRef>

In doing that I have an answer for the following issue you cite:

what about keeping the modifier .suffixes()? Consider what must happen for suffixes to drop the original label: If starts_with is creating node iterators with a range of { start: 1, end: <some_node> }, .suffixes() will need to turn start into the last node of the label. But where does this information come from? We would have to store that information in the NodeIter, so .suffixes() could move it into start.

In my branch that information is provided by the StartsWith struct which is an iterator of NodeRefs. It keeps its start position handy purely in the event a user requests .suffixes().

That sounds gross: we are now storing an extra field that not all users of NodeIter will use.

Agreed.

Let's think about how this modifier ties into suffixes_of and suffixes_of_labels (since it was established above that we need both options for performance).

suffixes_of kind of broke my brain because I had intended prefixes_of(label) to show me the matches within the given label, which does seem pretty confusing. I get what it means after looking at how it behaves and its starts_with2(label).suffixes(). So this is another case where I got my hands dirty and added matches_within() which permits a similar set of adverbs:

matches_within(label) // Iterator<Item = NodeRef>
matches_within(label).labels() // Iterator<Item = L>
matches_within(label).pairs() // Iterator<Item = (L, &Value)>
matches_within(label).values() // Iterator<Item = &Value>
// No suffixes. They don't even make sense here.
//matches_within(label).suffixes() // Iterator<Item = L>

tl;dr it's better to just have more options on Trie and not worry about redundancy here

I appreciate you expressing your arguments with clarity. However, I am not convinced it's worth adding each output variant (or adverb) to Trie. I hope my sketch of the code in my branch, which does not delete any of your code and merely uses unhappy names like starts_with2(), can be convincing to you that the adverb path can be practical.

The other thing that this code affords is it does make the "happy path" less cumbersome—subsuming the .filter_map(Result::ok)—while not preventing the user from dealing with those errors when they must trie.starts_with(label).map(|node_ref| node_ref.label()).

The code in my branch is not production ready. It currently returns Value and not &Value for implementation convenience but that's not actually satisfactory. And I believe we can create an iterator extension to give us labels(), pairs(), and values() without duplicating the impl like I've done.

I'm now planning on removing .labels() and .pairs() from extension traits, in order to encourage users to use *_labels() and *_pairs() variants. If they really need to do some sort of filtering or modification, they can just use .map(|node| node.label()) instead.

Expand on this code .map(|node| node.label()) what are you mapping from?

Anyways, I got a ton of schoolwork I've need to get to, so I'm going to have to put this on pause till Saturday.

Understood. Good luck to you on your schoolwork. As with many things open source, no one's paying you to hurry. ;)

Oh by the way, I saw from your repos that you're a Bevy user. Me too. It's a fun game engine.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 8, 2025

I didn't want to contend with this point without getting my hands a little dirty, so based off of your work I made a little branch "label-adverb" to exercise some ideas.

Nice, that's really helpful.

Your StartsWith struct runs into the same problem my experiments with .labels() did: it suffers in performance compared to the original implementation because it doesn't de-duplicate work done across each label.

From my tests, it was going from something like 6ms to 80ms for predictive search, so more than 10x slower. (In hindsight, I should have led with discussing how benchmarks affected my decision to go away from this)

Now, if I was implementing this library from scratch, I probably wouldn't care about optimization here, but given that this would be a regression for users, I really don't want to make Trie's API smaller just to embrace the iterator aesthetic.

Another (minor) issue: since the methods are implemented on the iterator itself instead of a trait, there's not much benefit in ergonomics. Now if you want suffixes, you have to write: starts_with(label).suffixes() instead of suffixes_of_labels(label), but you can't do anything before .suffixes() that justifies typing more out.

I think that the reason for keeping them separate becomes more obvious when one tries to optimize the label collectors.

The other thing that this code affords is it does make the "happy path" less cumbersome—subsuming the .filter_map(Result::ok)

I'm against this. I don't think libraries should try to hide errors from users just for simplicity's sake. However, avoiding error handling of infallible labels is something I'm working on improving. I redefined parts of TryFromTokens to look like:

type Result;
type Zip<Other>;
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other>; 

Where Self::Result is returned for try_from_tokens and try_from_reverse_tokens.
(The zip items are for *_pairs() methods)
What this means is that for most label types, the implementation can be:

type Result = Self;
type Zip<Other> = (Self, Other);
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other> {
  (this, other)
}

And for fallible labels (like TryFromTokens<u8> for String) it can be:

type Result = Result<Self, FromUtf8Error>;
type Zip<Other> = Result<(Self, Other), FromUtf8Error>;
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other> {
  this.map(|l| (l, other))
}

So for infallible labels like Vec<_> and Box<[_]>, there is no need to handle errors, but you do for strings from bytes. (Only downside is having to specify the label type at the call site instead of the variable declaration e.g. .suffixes_of_labels::<String>(label))

Expand on this code .map(|node| node.label()) what are you mapping from?

A node returned from NodeIter::next_back(), though I guess it should just be .map(|nodes| nodes.label()) where we call a method NodeIter::label().

Understood. Good luck to you on your schoolwork. As with many things open source, no one's paying you to hurry. ;)

Ah thanks, it's more a reminder to me to focus on classes. I really want to get this PR merged soon so I can get back to implementing my profanity filter.

Oh by the way, I saw from your repos that you're a Bevy user. Me too. It's a fun game engine.

I love it 😊. I plan on working on it again after I complete my web app.

@JonahPlusPlus
Copy link
Author

I'm tempted to merge set::Trie and map::Trie, since it would simplify implementation so much.

For instance, I currently have set::Trie::inc_search() return a regular IncSearch which returns NodeRefs, so to be consistent, I would have to make a set::IncSearch that returns KeyRefs.

All this could be avoided by being fine with Trie<Token, ()>, but I guess it would make insertion weird...

Oh well, for now I'll just make variants and one day merge them after specialization or something arrives.

@shanecelis
Copy link
Collaborator

shanecelis commented Mar 8, 2025

I'm tempted to merge set::Trie and map::Trie, since it would simplify implementation so much.

I'm actually really happy you mentioned this. This library was originally built off of set::Trie and I added map::Trie, which used set::Trie as its backend. Now you can see that, that has switched so map::Trie is the backing implementation. But I'd like to switch it back so set::Trie is the real deal.

Why? Because we're not really as space efficient as we ought to be. I'd encourage you to look at the v0.2.0 that Sho originally wrote. Its instructive but I'll point out what its entry struct looked like:

struct TrieLabel<Label> {
    label: Label,
    is_terminal: bool,
}
/// A trie for sequences of the type `Label`.
pub struct Trie<Label> {
    louds: Louds,

    /// (LoudsNodeNum - 2) -> TrieLabel
    trie_labels: Vec<TrieLabel<Label>>,
}

What I take issue with is is_terminal. I figured replacing that with Option<T> wasn't a bad deal since Option<()> took up the same space as bool, but it is a whole byte for every token. Instead what I'd like to do is restore set::Trie as the real backing trie with something like this:

//struct TrieLabel<Label> {
//    label: Label,
//    is_terminal: bool,
//}
/// A trie for sequences of the type `Label`.
pub struct Trie<Label> {
    louds: Louds,

    /// (LoudsNodeNum - 2) -> TrieLabel
    trie_labels: Vec<Label>,
    terminal_bits: BitVec<u8, Lsb0>,
}

And then we could compactly store the values, which can take up way more than a byte per entry.

/// A trie for sequences of the type `Label`.
pub struct map::Trie<Label, Value> {
    trie: set::Trie<Label>,
    /// (LoudsNodeNum - 2) -> value index
    node_to_index: Vec<usize>,
    values: Vec<Value>,
}

Now, you obviously do not need to do any of this work, but now you know where I want to go. So if it would help you to drop implementation consideration of either set::Trie or map::Trie for the time being. I'm happy to pick up and follow my dream on this after the merge.

@shanecelis
Copy link
Collaborator

From my tests, it was going from something like 6ms to 80ms for predictive search, so more than 10x slower. (In hindsight, I should have led with discussing how benchmarks affected my decision to go away from this)

Dang. All right. You've put me off having Iterator<Item = NodeRef> as our principle output type. However, in that case I'd just push the machinery into StartsWith so that we can have our performance and API affordances. So maybe StartsWith would simply encode our query:

struct StartsWith<Token> {
    query: Vec<Token>,
    // ...
}

Maybe it impls Iterator or IntoIterator of Item = NodeRef but then delegates to our performant implementations for each of output APIs. One other thing to consider for map::Trie is we may want a starts_with_mut(), so the proliferation of APIs could get even bigger.

Another (minor) issue: since the methods are implemented on the iterator itself instead of a trait, there's not much benefit in ergonomics.

I am happy for them to be implemented wherever it makes sense that preserves the fluent API where the user chooses the search operation and then the output format.

Now if you want suffixes, you have to write: starts_with(label).suffixes() instead of suffixes_of_labels(label), but

To me these look like different operations. It'd be nice to present to the user that there are essentially two searches: starts_with and matches_within. I think that'd be a big improvement in naming.

you can't do anything before .suffixes() that justifies typing more out.

I don't follow. Do you mean there's nothing you can do with a trie.starts_with(label)? Because I can see a lot of things you can do:

let count = trie.starts_with(label).count();

I'm against this. I don't think libraries should try to hide errors from users just for simplicity's sake.

In general, I agree with you. The .filter_map(Result::ok) everywhere just sticks in my craw. Perhaps we can have an affordance of .labels_ok() or something.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 8, 2025

I don't follow. Do you mean there's nothing you can do with a trie.starts_with(label)?

What I mean is that there is no iterator chaining possible when .suffixes() is implemented for the iterator itself (e.g. trie.starts_with(label).filter(shorter_than_50_chars).suffixes()). This means that .suffixes() is akin to a builder pattern, where it just consumes the StartsWith iterator.

This relates to the following:

However, in that case I'd just push the machinery into StartsWith so that we can have our performance and API affordances.

I would like to avoid this, since it introduces overhead for something that may be called frequently (a heap allocation happens when calling .starts_with() and then it has to reallocate for .suffixes() when the initial allocation can be avoided with a direct method that creates a suffix iterator). I think builder patterns should be relegated to one-off operations, like building the trie.

It's a small amount of overhead (in the case of prefixes_of, it can be avoided entirely when converting to an optimized label collector), but for something like starts_with(label).suffixes().labels(), it would be allocating 3 times, which I suppose is negligible when compared to all the labels being allocated, but seems unnecessary when it's just to avoid having suffixes_of_labels(label).

If you think this low overhead is worth it, then I have no issue nesting variants under each iterator.

One other thing to consider for map::Trie is we may want a starts_with_mut(), so the proliferation of APIs could get even bigger.

I'm not sure starts_with_mut is possible, at least with starts_with() iterating over NodeIters. We wouldn't be able to construct a NodeIterMut since they would overlap (even if we do a partial borrow of the trie). And iterating over NodeMut wouldn't work, since NodeMut has methods like children, which may return nodes that are also being returned by starts_with_mut, possibly leading to memory corruption. You would need to make some other mutable reference type that can't get access to any other node (NodeMutPtr?) or some alternative to iterators that ensures each item is dropped before the next.

It's not like a possible children_mut, which could work since the parent has exclusive access, and none of the children have overlapping paths.

But you are right, if more APIs get added, it will fill up the top level with many variants. I'm just not sure if this is a bad thing. On one hand, putting them all on the top level makes them easy to find, but on the other hand, finding them across each search iterator can be mitigated with documentation. So to me, it's really just a matter of overhead.

In general, I agree with you. The .filter_map(Result::ok) everywhere just sticks in my craw.

Agreed; I'm interested in how you'll see my solution to this when I upload it.

@JonahPlusPlus
Copy link
Author

Perhaps a better naming convention would be sufficient? Stuff like starts_with_pairs doesn't really roll of the tongue TBH (starts_with was named before I tried creating the pair and label variants):

  • starts_with -> descendants
  • prefixes_of -> prefixes or ancestors
  • suffixes_of -> suffixes
  • starts_with_pairs -> descendant_pairs
  • prefixes_of_pairs -> prefix_pairs or ancestor_pairs
  • suffixes_of_pairs -> suffix_pairs

And then similar *_labels variants.

@JonahPlusPlus
Copy link
Author

JonahPlusPlus commented Mar 9, 2025

Actually, looking back now, I suppose my concerns about avoiding overhead were misplaced. The search iterators are likely going to do re-allocations anyways, so doing a couple in the beginning hardly matters. I'll start work on moving variants to the search iterators, instead of on Trie. Sorry about any confusion!

Edit: Actually, I'm still not sure if it's a good idea, even if the overhead is negligible, due to implementation complexity. It is pretty straightforward to nest variants for map::Trie, but for set::Trie, I was wrapping methods in Labels or Keys converters, but if we put the variants on the iterator objects themselves, then we would have to create a wrapper for each iterator, instead of the blanket implementation.

e.g. set::Trie::starts_with would return SetPostfixIter that wraps PostfixIter so it can implement a .labels() that wraps .pairs() and a .suffixes() that returns a wrapped PostfixIter set to suffixes.

Plus, writing out suffixes_of_pairs() as starts_with().suffixes().pairs() irks me more now that I see it in the tests.

I guess I'll finish the changes and it can always be reverted...

Edit 2: I might be able to implement specific versions of the Labels and Keys wrappers to get around that issue. (I probably just need some sleep 😅)

@shanecelis
Copy link
Collaborator

shanecelis commented Mar 10, 2025

  • starts_with -> descendants
  • prefixes_of -> prefixes or ancestors
  • suffixes_of -> suffixes
  • starts_with_pairs -> descendant_pairs
  • prefixes_of_pairs -> prefix_pairs or ancestor_pairs
  • suffixes_of_pairs -> suffix_pairs

I feel like these APIs take an implementation specific view. We know it's a tree. We understand how it gets its runtime performance by using the prefixes, but for a lot of people "a trie" is a container of strings that happens to be fast. The fact that it's a tree is an implementation detail they may not know or may not care to know, so what I'd like to do is present an API that is natural for having a big container of strings but also since this library is generic over its token and labels works with a generic type. That's why I'm kind of jazzed about starts_with and matches_within. But this is your PR, do what you think is best, but I'll probably exercise some editorial control after the merge.

Going on a short summer break so need this synced to my laptop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Label Trait + QOL changes

2 participants