-
Notifications
You must be signed in to change notification settings - Fork 41
reporting corrupted files
CRAB parses failed jobs stdout to identify possible input file corruption and report it so that Rucio can be used to check replicas and possibly invalidate and re-transfer. See https://github.com/dmwm/CRABServer/issues/8773 and https://its.cern.ch/jira/browse/CMSTRANSF-1024
-
NOTE: we rely on cmsRun exceptions. cmsRun does not differentiate bad file from error from storage server (local or remote), when reading remote files this is particularly annoying. When there is an error reading a remote file there is no indication of where the file was being read from, only that gloabl redirector was used. cmsRun can not parse root errors and "do different things", best it can do is to throw a generic "error while opening" or "error while reading" and print out the exception text from root.
-
NOTE: there is a tension among "report a bad file" (from a single HTC job which failed), "report files which filed in multiple resubmissions of same job" and the fact that CRAB is build around collection of jobs which read same dataset (task) so that entire tasks are exposed to non-file-related read failures in case of xrootd infrastructure problems or code issues on user site.
- corrupted: a FatalRoot error was raised, usually truncated files, highly reliable. Sometimes we also get "not a root file", also highly reliable
- suspicious: correct size with bad checksum will end up here, no way to tell from "random network error" on a single job base.
- Something happens in the HTCondor AP's (aka schedulers) which execute code sent along with each task submission
- keep it simple since it takes time to change and it is hard to debug
- Something happens in a script which runs daily as a crontab in
belforte
user, where more sophisticated decisions are possible
whenever a (HTC) job completes, DAGMAN running on AP (aka scheduler) runs a POST script PostJob.py.
- first action in PostJob is to examine job exit code via RetryJob.py
- if RetryJob says that job can be retried (e.g. file open/read error), job is retried up to 3 times total. If it still fails it is up to the user to try again. Some jobs are retried tens of times and eventually succeed.
pseudo-code
if exitCode in [8020, 8021, 8022, 8028]:
check_corrupted_file(exitCode)
def check_corrupted_file(exitCode):
# read all lines from job stdout
# identify last "successfully opened"
# find lines between " ----- Begin Fatal Exception" and " ----- End Fatal Exception"
# if on of those lines has "Fatal Root Error:" flag as corrupted and get the file name from that line
# if that line does not have a file name, downgrade to suspicious file and use last successfully
# opened file as file name to report. This decision is based on heuristic.
# prepare a few line "message" with file name, error, pointers to job and post-job logs (for debugging)
reportBadInputFile(message)
def report(msg):
# write one JSON file on /eos/cms/store/temp/user/BadInputFiles/corrupted/new or /eos/cms/store/temp/user/BadInputFiles/suspicious/new with the message, creating a sub-dir for each task. But:
# skip HammerCloud jobs (yeah.. they have plenty of 8021/28, no comment)
# keep count per CRAB task and never report more than 30 (sometimes errors are due to code, not files)
# tech. detail: JSON file is written with gfal-copy and user proxy.
# Files will be automatically removed after 30-days by an EOS policy to applied to /store/temp/user
# Everybody can read from that EOS directory.
- fetch ProcessBadFilesList.py from GH master
- run it
- copy list of TruncatesFiles, SuspiciouFiles, NotRootFiles (this is empty since months) to
/eos/project/c/cmsweb/www/CRAB
from where they can be pulled even via HTTP like in this
- go through new task dirs in eos
- if a task has more than 30 entries, call it "fake" and skip it (can't/wan't fix whole datasets)
- aggregate failure reports (there are one per job) by file name
- require at least 3 reports for a file (make sure it was not fixed by automatic retries) and then add to the corresponding list
Aim to get rid of the crontab and replace with direct flagging of suspicious replicas in Rucio from RetryJobs. For this:
- only report for retry number >=2 (avoid the need to request >= 3 retries). (retry number starts from 0)
- remove the check on "no more than 30 per task" which all in all appears very rare.