Skip to content

Proposal for managing test baseline images using data version control (dvc) #5724

Closed
@maxrjones

Description

@maxrjones

Proposal for managing test baseline images using data version control (dvc)

This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in GenericMappingTools/pygmt#1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! 🙏 🎉 ).

Motivation for migrating baseline images to dvc

Here's the current breakdown for the GMT repository:

  • .git: ~1.1 GB (up from ~720 MB on Feb. 06 2020)
  • test: ~115 MB (101 MB from PS files) (up from ~113 MB on Feb 06. 2020)
  • doc: ~68 MB (51 MB from PS files; 33 MB from PS in doc/examples ; 18 MB from PS in doc/scripts) (down from ~70 MB on Feb. 06 2020)
  • share: ~13.5 MB
  • src: ~16 MB

The fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.

What is data version control

Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).

Steps required

(Based on PyGMT, may need some updating)

Initial setup (only needs to be done once for the repository)

Installing DVC for developing GMT

Initialize dvc

dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# Optionally configure the repository to not send anonymous usage data
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"

Setup DVC remote

dvc remote add origin https://dagshub.com/GenericMappingTools/gmt.dvc # updates .dvc/config file with remote URL
dvc remote default origin  # set default dvc remote to 'upstream'

Migrating tests

(based on PyGMT steps, may need updating)

# Sync with git and dvc remotes
git pull
dvc pull
# Generate hash for baseline image and stage the *.dvc file in git
git rm --cached 'test/<test-folder>/<test-image>.ps'
mv test/<test-folder>/<test-image>.ps test/baseline/<test-folder>/<test-image>.ps
dvc add test/baseline/<test-folder>
git add test/baseline/<test-folder>.dvc test/baseline/.gitignore
# Commit changes and push to both the git and dvc remotes
git commit -m "Migrate test to DVC"
git push
dvc push

Pull images from DVC remote (for GitHub Actions CI and local testing)

dvc status # should report any files 'not_in_cache'
dvc pull # pull down files from DVC remote cache (fetch + checkout)
cd <build-dir>
ctest

What about the images for documentation?

Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (#5364 (comment)).

References

Are you willing to help implement and maintain this feature? Yes

Metadata

Metadata

Assignees

Labels

maintenanceBoring but important stuff for the core devs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions