-
Notifications
You must be signed in to change notification settings - Fork 54
flux-fsck: add --lost-and-found option #6953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I forgot that the job manager already was using Seems like repair should be the default, but maybe at the end before checkpointing we should get confirmation, either a Y from the user if stdin is a tty, or require Perhaps "repair" instead of "recovery". |
Reminder: we should require that the KVS is not running like we do in |
I thought about that, but made it an option b/c normally an Edit: I suddenly have a recollection we had a offline discussion about possibly trying to work on the KVS without having the KVS loaded. Very doable, but a little tricky. Is this what you were thinking for this lost+found? Edit2: Yeah, the offline KVS "fixes" is what we should be doing.
Originally my thought was that since we weren't changing anything amongst "real KVS data", figured checkpoint confirmation wasn't necessary. i.e. the only change is lost+found. Is your thinking that the Y/n confirmation is just to make sure the user is aware corruption was found and repaired? Edit: If we go with the offline KVS update, then the Y/n is more obvious, as we're circumventing normal KVS stuffs. |
6118a34
to
1323ede
Compare
just pushed a WIP update
|
Problem: In the near future flux-fsck will repair data and move it to a lost+found directory. This conflicts with the lost+found directory used in the job-manager. Rename the lost+found directory in the job-manager to job-lost+found. Update one test in t/t2219-job-manager-restart.t for directory name change.
Problem: A number of globals are used in flux-fsck. This is ok for the time being, but many more global "trackings" variables will be needed in the future. Refactor flux-fsck to have a single "context" that is passed around between functions. Remove all globals.
Problem: In the near future we will want to repair corrupted valref treeobjs. In order to do so efficiently, we would like to "save" the location of invalid indexes while we are fsck-ing the valref treeobj. Save the indexes of the missing references in a special missing_indexes list. This list is presently unused.
Problem: In the near future updates to the root treeobj may have to be done. However this root is not saved anywhere. Save the root treeobj to the primary context. This does not change any current flux-fsck behavior.
1323ede
to
0508a5d
Compare
Problem: flux-fsck can identify corrupted KVS entries, but does not presently do anything to help users recover data. Support a --lost-and-found option. Corrupted KVS entries that can be repaired will be placed into the "lost+found" directory for users to look at.
Problem: The new --lost-and-found option in flux-fsck is not documented. Add documentation to flux-fsck(1).
Problem: There is no coverage for the new flux-fsck --lost-and-found option. Add coverage in t2816-fsck-cmd.t.
I decided that splitting out some code to a convenience library wasn't worth it at this point. In time with some of our other ideas from #6589, perhaps it'll be more worthwhile. So I removed WIP for now and we can debate some of the points I list above. |
0508a5d
to
8499da0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6953 +/- ##
==========================================
- Coverage 84.03% 84.03% -0.01%
==========================================
Files 545 545
Lines 91833 92012 +179
==========================================
+ Hits 77174 77319 +145
- Misses 14659 14693 +34
🚀 New features to boost your workflow:
|
Problem: flux-fsck can identify corrupted KVS entries, but does not presently do anything to help users recover data.
Support a --lost-and-found option. Corrupted KVS entries will be recovered as best as possible and placed into a "lost+found"
directory for users to consider using.
notes:
I struggled with some documentation / verbiage here. "recover" doesn't seem to be the right word sometimes, as that suggests full recovery. Perhaps there's a better word I'm not thinking of.
there is no handling of corrupted treeobjs, like hypothetically a treeobj that is outright bad (fails
treeobj_validate()
). We could stick "empty values" inlost+found
for those?I chose the directory "lost+found", which is actually what is used by the
job-manager
for its lost and found. Should change? Shouldjob-manager
path change?