Description
Background
A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.
When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved.
warehouse/warehouse/forklift/legacy.py
Lines 1525 to 1536 in aeb9ccd
The path
data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.
This could also feasibly be used via Inspector or something similar, if made available via some API.
Proposal
A few steps to tackle the problem, can definitely change based on further findings or ideas.
- Add
Filename.path
column (easy) - Populate
Filename.path
during file upload, around here (easy) - Backfill the column from existing data in
File.path
for filenames that match (easyish) - Backfill the column for remaining empty entries from BigQuery data (medium to hard)