Skip to content

flux-fsck: add --lost-and-found option #6953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 6, 2025

Problem: flux-fsck can identify corrupted KVS entries, but does not presently do anything to help users recover data.

Support a --lost-and-found option. Corrupted KVS entries will be recovered as best as possible and placed into a "lost+found"
directory for users to consider using.


notes:

  • I struggled with some documentation / verbiage here. "recover" doesn't seem to be the right word sometimes, as that suggests full recovery. Perhaps there's a better word I'm not thinking of.

  • there is no handling of corrupted treeobjs, like hypothetically a treeobj that is outright bad (fails treeobj_validate()). We could stick "empty values" in lost+found for those?

  • I chose the directory "lost+found", which is actually what is used by the job-manager for its lost and found. Should change? Should job-manager path change?

@garlick
Copy link
Member

garlick commented Aug 6, 2025

I forgot that the job manager already was using lost+found. Maybe for now we could just rename that directory to job-lost+found, with the eventual plan of auditing those error paths and seeing if we should handle them differently.

Seems like repair should be the default, but maybe at the end before checkpointing we should get confirmation, either a Y from the user if stdin is a tty, or require -y on the command line? Until we do that it's a no-op.

Perhaps "repair" instead of "recovery".

@garlick
Copy link
Member

garlick commented Aug 7, 2025

Reminder: we should require that the KVS is not running like we do in flux-restore(1).

@chu11
Copy link
Member Author

chu11 commented Aug 7, 2025

Seems like repair should be the default

I thought about that, but made it an option b/c normally an fsck doesn't require the KVS to be loaded. With lost+found it does require it to be loaded (and the content module too I think?). So making it the default would requiere the KVS to be loaded before doing an fsck?

Edit: I suddenly have a recollection we had a offline discussion about possibly trying to work on the KVS without having the KVS loaded. Very doable, but a little tricky. Is this what you were thinking for this lost+found?

Edit2: Yeah, the offline KVS "fixes" is what we should be doing.

but maybe at the end before checkpointing we should get confirmation, either a Y from the user if stdin is a tty, or require -y on the command line? Until we do that it's a no-op.

Originally my thought was that since we weren't changing anything amongst "real KVS data", figured checkpoint confirmation wasn't necessary. i.e. the only change is lost+found. Is your thinking that the Y/n confirmation is just to make sure the user is aware corruption was found and repaired?

Edit: If we go with the offline KVS update, then the Y/n is more obvious, as we're circumventing normal KVS stuffs.

@chu11 chu11 force-pushed the issue6589_fsck_lost_found branch from 6118a34 to 1323ede Compare August 18, 2025 17:24
@chu11 chu11 changed the title flux-fsck: add --lost-and-found option WIP: flux-fsck: add --lost-and-found option Aug 18, 2025
@chu11
Copy link
Member Author

chu11 commented Aug 18, 2025

just pushed a WIP update

  • the lost+found is now done via work directly on the content store, not through the KVS

    • i should split this out into a mini-library and test it, gonna work on that in my next round
    • Edit: ya know, I think the code addition is under the "convenience library" line ... i'm thinking i may not do that now.
  • kept lost+found as an option, it's sorta nice when testing fscks over and over and w/ --rootref. We can debate this

  • checkpoint is still "automatic" for now, can work on that later for user to say Y/n. Maybe a follow on PR to debate exactly how to do it? and leave this PR specifically for the lost+found work.

chu11 added 4 commits August 18, 2025 10:26
Problem: In the near future flux-fsck will repair data and move
it to a lost+found directory.  This conflicts with the lost+found
directory used in the job-manager.

Rename the lost+found directory in the job-manager to job-lost+found.

Update one test in t/t2219-job-manager-restart.t for directory name
change.
Problem: A number of globals are used in flux-fsck.  This is ok
for the time being, but many more global "trackings" variables
will be needed in the future.

Refactor flux-fsck to have a single "context" that is passed around
between functions.  Remove all globals.
Problem: In the near future we will want to repair corrupted
valref treeobjs.  In order to do so efficiently, we would like to
"save" the location of invalid indexes while we are fsck-ing
the valref treeobj.

Save the indexes of the missing references in a special
missing_indexes list.  This list is presently unused.
Problem: In the near future updates to the root treeobj may have to
be done.  However this root is not saved anywhere.

Save the root treeobj to the primary context.  This does not change
any current flux-fsck behavior.
@chu11 chu11 force-pushed the issue6589_fsck_lost_found branch from 1323ede to 0508a5d Compare August 18, 2025 17:26
chu11 added 3 commits August 18, 2025 13:27
Problem: flux-fsck can identify corrupted KVS entries, but does
not presently do anything to help users recover data.

Support a --lost-and-found option.  Corrupted KVS entries that
can be repaired will be placed into the "lost+found" directory
for users to look at.
Problem: The new --lost-and-found option in flux-fsck is not
documented.

Add documentation to flux-fsck(1).
Problem: There is no coverage for the new flux-fsck --lost-and-found
option.

Add coverage in t2816-fsck-cmd.t.
@chu11 chu11 changed the title WIP: flux-fsck: add --lost-and-found option lux-fsck: add --lost-and-found option Aug 18, 2025
@chu11 chu11 changed the title lux-fsck: add --lost-and-found option flux-fsck: add --lost-and-found option Aug 18, 2025
@chu11
Copy link
Member Author

chu11 commented Aug 18, 2025

I decided that splitting out some code to a convenience library wasn't worth it at this point. In time with some of our other ideas from #6589, perhaps it'll be more worthwhile. So I removed WIP for now and we can debate some of the points I list above.

@chu11 chu11 force-pushed the issue6589_fsck_lost_found branch from 0508a5d to 8499da0 Compare August 18, 2025 23:06
Copy link

codecov bot commented Aug 18, 2025

Codecov Report

❌ Patch coverage is 81.85484% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.03%. Comparing base (9a6203e) to head (8499da0).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/builtin/fsck.c 81.78% 45 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6953      +/-   ##
==========================================
- Coverage   84.03%   84.03%   -0.01%     
==========================================
  Files         545      545              
  Lines       91833    92012     +179     
==========================================
+ Hits        77174    77319     +145     
- Misses      14659    14693      +34     
Files with missing lines Coverage Δ
src/modules/job-manager/restart.c 82.12% <100.00%> (ø)
src/cmd/builtin/fsck.c 82.14% <81.78%> (+2.33%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants