-
Notifications
You must be signed in to change notification settings - Fork 417
Randomize order of inputs from OutputSweeper
#4033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For a marginal increase in privacy, we can randomize the inputs from the `OutputSweeper`. Since we don't depend on `rand` to randomize the order I just put the elements into a hashset, then back into a Vec. This should give us enough randomness without having to introduce a new dep or make the `OutputSweeper` depend on the `EntropySource`.
I've assigned @tnull as a reviewer! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4033 +/- ##
=======================================
Coverage 88.74% 88.74%
=======================================
Files 176 176
Lines 128845 128847 +2
Branches 128845 128847 +2
=======================================
+ Hits 114340 114342 +2
+ Misses 11912 11909 -3
- Partials 2593 2596 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was digging into the difference between using EntropySource
-based shuffling versus the current Vec → HashSet → Vec
approach, and it seems like the latter might not be the best fit for what we’re aiming to do.
On Randomness:
- It’s not truly random. There’s no guarantee that all permutations are equally likely, which matters if the goal is to introduce meaningful entropy (for example, for privacy).
- Since the order comes from internal hashing behavior, there’s no way to seed or reproduce the order for testing.
- It replaces a well-understood and tested shuffling algorithm with a non-deterministic, opaque alternative. This might behave unpredictably across platforms or versions, which could potentially introduce subtle bugs over time.
If randomness does matter here, maybe it’s worth considering introducing EntropySource
into OutputSweeper
as a longer-term solution.
On Deduplication:
Also wondering: is it worth introducing an additional allocation (via HashSet
) just to remove duplicates? If duplicates are actually a problem, maybe we can address that more explicitly, especially considering the relatively small size of the input in most cases.
Happy to discuss further if I’m missing context. Just wanted to flag these points as they came up while reviewing.
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with previous reviewers that we should probably just take on an EntropySource
argument to OutputSweeper
, especially since we're gonna need it when doing 'proper' batch randomization going forward.
I disagree here, I think using the default To correct the original misconceptions:
It may not be guaranteed, no, but it is in practice. In practice the values are SipHash-1-3 of the values, seeded with an actual random input from
Wait, what is "proper" batch randomization? |
Currently, |
I don't see why that needs a "real" RNG either? Just a counter and hash would suffice for that (given the behavior is, first, "include all", then do random inclusion sets). |
Yeah, we don't necessarily need a 'real' RNG for that, that's true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna move forward with this, we can still improve things further when revisiting batch randomization.
Closes #3526
For a marginal increase in privacy, we can randomize the inputs from the
OutputSweeper
. Since we don't depend onrand
to randomize the order I just put the elements into a hashset, then back into a Vec. This should give us enough randomness without having to introduce a new dep or make theOutputSweeper
depend on theEntropySource
.