Skip to content

Add copy of release_files.path to file_registry for long-term storage #17750

Open
@miketheman

Description

@miketheman

Background

A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.

When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved.

# Figure out what our filepath is going to be, we're going to use a
# directory structure based on the hash of the file contents. This
# will ensure that the contents of the file cannot change without
# it also changing the path that the file is saved too.
path="/".join(
[
file_hashes[PATH_HASHER][:2],
file_hashes[PATH_HASHER][2:4],
file_hashes[PATH_HASHER][4:],
filename,
]
),

The path data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.

This could also feasibly be used via Inspector or something similar, if made available via some API.

Proposal

A few steps to tackle the problem, can definitely change based on further findings or ideas.

  • Add Filename.path column (easy)
  • Populate Filename.path during file upload, around here (easy)
  • Backfill the column from existing data in File.path for filenames that match (easyish)
  • Backfill the column for remaining empty entries from BigQuery data (medium to hard)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions