-
Notifications
You must be signed in to change notification settings - Fork 0
Imp/datasets #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Imp/datasets #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so some thoughts on testing:
- We should have unit tests in the arrow segment. These should be fairly detailed (e.g. test that the focus works, including with struct arrays)
- We should have a unit test for the cache-only case.
- We should add a simple high-level integration test (see server/tests for examples). All we need to verify here IMO is that the data focus when passed to the server properly filters the data. This is tricky because the filter is currently additionally applied at the http handler layer. We can't remove that (yet) because of the column size filtering. But, may be worth temporarily disabling that code somehow to verify that the filter makes it all the way from HTTP to the segment layer? There may be a better way to get at what we really want to test like
cfg
directives. - As far as manual testing goes, the best way is probably to use the CV benchmark as a dataset generator. See the
cv
binary inserver/bench
. This runs for a configurable period of time and generates a CV-like load. You could re-use the resultingdata
directory with the main plateau server, and then run a query on that (e.g. focus just thetime
column).
server/src/segment/arrow.rs
Outdated
.fields | ||
.iter() | ||
.enumerate() | ||
.filter(|(_, field)| !exclude.contains(&field.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works with nested datasets (e.g. StructArray
)? Do we know how those are represented in these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost sure you right, this is something I need to figure out. I planned to start building tests of various complexity and see where it fails.
.get_records(next_index, RowLimit::records(2), Ordering::Forward) | ||
.get_records( | ||
next_index, | ||
RowLimit::records(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to think we need to package up these three (limits, ordering, focus) in some kind of higher-level struct (maybe Query
? Don't love that name...)
That way we could default / use the builder pattern to avoid the repetition throughout all these test modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 How about DataSelector
?
97f9e87
to
b64cf57
Compare
@AnIrishDuck Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially with the recent regression involving focus and nested datasets (see #23), I'd strongly suggest having some tests in src/segment/test.rs
to verify that works properly. You could probably even re-use the nested schema (inferences_large_extension
) from that work.
6cba685
to
f82f7d5
Compare
Aside from my comment above, I've looked at this a couple times at this point and I think it's good to go once that is addressed. |
No description provided.