-
Notifications
You must be signed in to change notification settings - Fork 21
Label trait + children methods + QOL changes
#40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Some preliminary benchmarks (just to get a sense of where we are in performance before I make more changes): Overall, performance seemed to improve (hard to tell when the benchmarks can't be compared by criterion due to the weird naming convention). I'll have to update the benchmarks to use criterion 0.5 and remove the git hash in the names (instead opting for named baselines). |
|
This is looking good. I really like that this doesn't interfere with the using the API with strings. I think the type annotation on builder is a fine concession. One consideration I've had in the back of my mind, and this can be reserved as a separate PR, is that the |
|
I think changing "postfix" to "suffix" is good. Perhaps we can reconsider some of our search names as well. Might as well bundle up the breaking changes. I'm not requesting you do this work, just soliciting your opinion on these name changes:
|
|
Making a note (in case feedback is needed): The change is nested iterators with a new pub struct NodeIter<'t, Token, Value> {
pub(crate) trie: &'t Trie<Token, Value>,
pub(crate) start: LoudsNodeNum,
pub(crate) end: LoudsNodeNum,
}This is a double-ended iterator (optimized for reverse iteration). The idea is simple: pop a node (go to the parent) from the end until you reach the start. In the case of forward iteration, this means a bit of work (you have to go through many nodes until you reach the start and repeat this while incrementing start), but it's trivial for a reverse iterator (meaning that some collectors will need to be optimized to use reverse iteration and array reversal under the hood). So instead of search iterators returning Then an extension trait will be used to allow people to call Edit: this will also allow expanding |
|
@shanecelis I'm considering dropping "postfix_search" in favor of changing:
And adding a new My rationale: "postfix_search" is almost the same as "predictive_search", it just drops the common prefix from the results. Rather than maintain two separate APIs that do pretty much the same thing, it might be better to merge them and create a modifier that opts into "postfix_search" behavior. Plus, this would mean more flexibility, since How does this sound? Edit: |
|
I like the idea of refining the API so that it has options like you suggest. I'm also ok with maybe dropping postfix_search. I'm not sure it had a real use case. It was more like an implementation detail that got exposed. We could deprecate it and see if anyone complains, or you can do like you're suggesting. I see the thinking behind How about instead |
|
I did consider "predictive_search" -> "start_with" and "common_prefix_search" -> "prefixes_of", but that's when I was also considering "postfix_search" -> "suffixes_of". However, now I'm thinking I should just include all three methods and just drop (I've probably been too keen on merging APIs when I should be merging their logic) tl;dr I think I'll go with "predictive_search" -> "start_with", "common_prefix_search" -> "prefixes_of", and "postfix_search" -> "suffixes_of". |
|
A note for future work: It would simplify implementations so much, but alas. Edit: Proof of concept Edit 2: An even more advanced proof of concept that is DRY over mutability At this point, the maintainability/extensibility is questionable, still, it's hella cool that this might be possible one day. |
|
Major regressions with the changes to search iterators when getting labels (before it used a unified iterator to accumulate many labels; now it runs it separately per label i.e. Gonna have to make a specialized label allocator that shares a buffer between labels and checks differences between node ranges. If these regressions can't be removed, I'll add back the previous implementations as special methods on |
|
What does Is it a performance regression? |
So I changed the search iterators to return iterators over nodes rather than the labels or So, I'm just going to add the old methods back in as shortcuts, and see if I can improve performance while I'm at it (i.e. |
|
Treating the label collector as a special case has fixed the regression for
Almost done... Should be done by this weekend (schoolwork permitting). |
|
Example:
We get the following matches:
A better name for this might be "next_path", since what it is doing is looking for a path subgraph following some prefix. Edit: Actually, the Edit 2: "path_of" sounds like a good name. "next_path" sounds like it might return the path subgraph after a label, but this really returns the shortest path that the label lies on, which means an exact match returns instead of looking for the next entry. |
|
You've got to think of it in the context of tab completion, where what you want is to solicit the user for which branch to follow. Perhaps Let's take a step back and look at how we could build this based on other parts of our API to see if there's a clearer name for what we're doing. What would you name the |
|
Probably just ".shortest()", but that doesn't provide enough information for a top-level API method. Maybe "shortest_suffix"? ("shortest_suffix_or_match" would be more precise, but that's too long IMO) |
I see. You're saying "suffix" because we're traveling down the tree. Interesting.
I see what you're thinking. Good point. I have conflated the term "prefix" because it is a prefix trie.
Let's figure out what to call our pieces and discuss how we currently handle them.
Gah, I just had to remind myself that not every node in a trie is a prefix node.
We've had a need to deal with these concepts more coherently, and I think your PR is providing the thrust for this much needed clarity, and I THANK YOU FOR THAT. And I'm beginning to see the value too of returning a I'd like to pare down the concepts we use that will cover our domain. I'm hoping that we can mostly write our user-facing API in terms of these things:
I am happy to hear and consider what terms you'd choose. Terms? Maybe change Maybe we can do like you suggested and drop
Yeah, An asideOne other thing I've been considering is do we want to change to something like this: I've just been feeling like there have been other parts of the API where saying |
|
By the way, I should of said this from the start, but I really like your PR's set of checkboxes that shows the intent and progress. I think I'm going to adopt that in the future. It's incredibly helpful. |
I think your definitions are good. As far as changes/clarifications go:
Oh, I've backtracked from that (partly) 😅. The reason being it complicates implementation while not bringing much in terms of ergonomics. Warning: Information dump inbound To illustrate the problem, let's look at But let's think about how these modifiers get applied: The second sounds far more flexible (you can then use Well, then let's look at making Okay, so That sounds gross: we are now storing an extra field that not all users of Let's think about how this modifier ties into tl;dr it's better to just have more options on I'm now planning on removing
Thanks, I use them to help keep track of what I need to get done, but I'm glad it's useful for you too. Anyways, I got a ton of schoolwork I've need to get to, so I'm going to have to put this on pause till Saturday. |
I'm fine with path being used consistently in the library. I just hope not to expose the user to it unless it pays for itself.
I welcome it. :)
What I want is perhaps more limited adverbs. These are the uses I imagine: starts_with(label) // ???
starts_with(label).labels() // Iterator<Item = L>
starts_with(label).pairs() // Iterator<Item = (L, &Value)>
starts_with(label).values() // Iterator<Item = &Value>
starts_with(label).suffixes() // Iterator<Item = L>They would not permit
I didn't want to contend with this point without getting my hands a little dirty, so based off of your work I made a little branch "label-adverb" to exercise some ideas. And I figured out what I want to suggest as the principle output type that might unify what we're doing. Here it is: starts_with(label) // Iterator<Item = NodeRef>In doing that I have an answer for the following issue you cite:
In my branch that information is provided by the
Agreed.
matches_within(label) // Iterator<Item = NodeRef>
matches_within(label).labels() // Iterator<Item = L>
matches_within(label).pairs() // Iterator<Item = (L, &Value)>
matches_within(label).values() // Iterator<Item = &Value>
// No suffixes. They don't even make sense here.
//matches_within(label).suffixes() // Iterator<Item = L>
I appreciate you expressing your arguments with clarity. However, I am not convinced it's worth adding each output variant (or adverb) to The other thing that this code affords is it does make the "happy path" less cumbersome—subsuming the The code in my branch is not production ready. It currently returns
Expand on this code
Understood. Good luck to you on your schoolwork. As with many things open source, no one's paying you to hurry. ;) Oh by the way, I saw from your repos that you're a Bevy user. Me too. It's a fun game engine. |
Nice, that's really helpful. Your From my tests, it was going from something like 6ms to 80ms for predictive search, so more than 10x slower. (In hindsight, I should have led with discussing how benchmarks affected my decision to go away from this) Now, if I was implementing this library from scratch, I probably wouldn't care about optimization here, but given that this would be a regression for users, I really don't want to make Another (minor) issue: since the methods are implemented on the iterator itself instead of a trait, there's not much benefit in ergonomics. Now if you want suffixes, you have to write: I think that the reason for keeping them separate becomes more obvious when one tries to optimize the label collectors.
I'm against this. I don't think libraries should try to hide errors from users just for simplicity's sake. However, avoiding error handling of infallible labels is something I'm working on improving. I redefined parts of type Result;
type Zip<Other>;
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other>; Where type Result = Self;
type Zip<Other> = (Self, Other);
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other> {
(this, other)
}And for fallible labels (like type Result = Result<Self, FromUtf8Error>;
type Zip<Other> = Result<(Self, Other), FromUtf8Error>;
fn zip<Other>(this: Self::Result, other: Other) -> Self::Zip<Other> {
this.map(|l| (l, other))
}So for infallible labels like
A node returned from
Ah thanks, it's more a reminder to me to focus on classes. I really want to get this PR merged soon so I can get back to implementing my profanity filter.
I love it 😊. I plan on working on it again after I complete my web app. |
|
I'm tempted to merge For instance, I currently have All this could be avoided by being fine with Oh well, for now I'll just make variants and one day merge them after specialization or something arrives. |
I'm actually really happy you mentioned this. This library was originally built off of Why? Because we're not really as space efficient as we ought to be. I'd encourage you to look at the v0.2.0 that Sho originally wrote. Its instructive but I'll point out what its entry struct looked like: struct TrieLabel<Label> {
label: Label,
is_terminal: bool,
}
/// A trie for sequences of the type `Label`.
pub struct Trie<Label> {
louds: Louds,
/// (LoudsNodeNum - 2) -> TrieLabel
trie_labels: Vec<TrieLabel<Label>>,
}What I take issue with is //struct TrieLabel<Label> {
// label: Label,
// is_terminal: bool,
//}
/// A trie for sequences of the type `Label`.
pub struct Trie<Label> {
louds: Louds,
/// (LoudsNodeNum - 2) -> TrieLabel
trie_labels: Vec<Label>,
terminal_bits: BitVec<u8, Lsb0>,
}And then we could compactly store the values, which can take up way more than a byte per entry. /// A trie for sequences of the type `Label`.
pub struct map::Trie<Label, Value> {
trie: set::Trie<Label>,
/// (LoudsNodeNum - 2) -> value index
node_to_index: Vec<usize>,
values: Vec<Value>,
}Now, you obviously do not need to do any of this work, but now you know where I want to go. So if it would help you to drop implementation consideration of either |
Dang. All right. You've put me off having struct StartsWith<Token> {
query: Vec<Token>,
// ...
}Maybe it impls
I am happy for them to be implemented wherever it makes sense that preserves the fluent API where the user chooses the search operation and then the output format.
To me these look like different operations. It'd be nice to present to the user that there are essentially two searches:
I don't follow. Do you mean there's nothing you can do with a let count = trie.starts_with(label).count();
In general, I agree with you. The |
What I mean is that there is no iterator chaining possible when This relates to the following:
I would like to avoid this, since it introduces overhead for something that may be called frequently (a heap allocation happens when calling It's a small amount of overhead (in the case of If you think this low overhead is worth it, then I have no issue nesting variants under each iterator.
I'm not sure It's not like a possible But you are right, if more APIs get added, it will fill up the top level with many variants. I'm just not sure if this is a bad thing. On one hand, putting them all on the top level makes them easy to find, but on the other hand, finding them across each search iterator can be mitigated with documentation. So to me, it's really just a matter of overhead.
Agreed; I'm interested in how you'll see my solution to this when I upload it. |
|
Perhaps a better naming convention would be sufficient? Stuff like
And then similar |
|
Actually, looking back now, I suppose my concerns about avoiding overhead were misplaced. The search iterators are likely going to do re-allocations anyways, so doing a couple in the beginning hardly matters. I'll start work on moving variants to the search iterators, instead of on Edit:
I guess I'll finish the changes and it can always be reverted... Edit 2: I might be able to implement specific versions of the |
I feel like these APIs take an implementation specific view. We know it's a tree. We understand how it gets its runtime performance by using the prefixes, but for a lot of people "a trie" is a container of strings that happens to be fast. The fact that it's a tree is an implementation detail they may not know or may not care to know, so what I'd like to do is present an API that is natural for having a big container of strings but also since this library is generic over its token and labels works with a generic type. That's why I'm kind of jazzed about |
Fixes #39 (and supersedes #36 and #37)
Label<Token>trait for iterating over tokens and implement it over common types and blanket types.pushandinsertbuilder methods together.impl AsRef<[Label]>to theLabel<Token>trait.IntoLabel<Token>extension trait andLabelIter<Token>wrapper for convertingimpl Iterator<Item = Token>to aLabel<Token>object.childrenmethods toTrieandIncSearchthat iterates over the children at a prefix.LabeltoToken(so thatLabelrefers to a string of tokens andTokenrefers to a single unit in the trie)exact_matchtois_exact, to mirroris_prefix).<mod>/mod.rs; renameinternal_data_structureto justinternal)Vecs outside of loops and reuse memoryVec<Node>toBox<[Node]>)Possible changes (might be left for future work):
TryFromIterator<Label, M>by making them return slices of the input or exact indicesAnsweroutside ofinc_searchand rename toLabelKind, and addNodeRef::kind(_)getmethods that returnNodeRefobjects (to mirror APIs that exist in the ecosystem)Improves API ergonomics by depending on an in-crate trait, instead of the previous
AsReftrait and introduces new accessors for children nodes.Why
Label<Token>and notIntoIterator<Item = Token>?Then this crate would be locked out of specific implementations like
impl Label<u8> for char, which allows for incremental searches ofTrie<u8, _>with achar, due to orphan rules.Instead, if a custom iterator needs to be used, users can call
into_label()to put it into a generic wrapper (which should just get optimized away).Drawbacks
This does mean type annotations are sometimes needed when building tries (e.g.
&strimplementsLabel<u8>andLabel<char>).Future Work
Implement
children_mutmethods onNodeMut(needs an unsafe implementation to get around borrow checker).Explore alternative vector implementations like
smallvec(maybe feature flagged?) to avoid heap allocations.Investigate inconsistency of checks: