Description
Proposal for managing test baseline images using data version control (dvc)
This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in GenericMappingTools/pygmt#1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! 🙏 🎉 ).
Motivation for migrating baseline images to dvc
Here's the current breakdown for the GMT repository:
.git
: ~1.1 GB (up from ~720 MB on Feb. 06 2020)test
: ~115 MB (101 MB from PS files) (up from ~113 MB on Feb 06. 2020)doc
: ~68 MB (51 MB from PS files; 33 MB from PS indoc/examples
; 18 MB from PS indoc/scripts
) (down from ~70 MB on Feb. 06 2020)share
: ~13.5 MBsrc
: ~16 MB
The fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.
What is data version control
Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).
Steps required
(Based on PyGMT, may need some updating)
- Add DVC as a dependency for developing GMT
- Initialize dvc in the repository (e.g., Initialize data version control for managing test images pygmt#1036)
- Setup DVC remote
- Add instructions for using DVC for image-based testing to the contributing guide
- Setup dvc for CI tests
- Add a workflow to support side-by-side comparison of modified images (e.g., Improve the DVC image diff workflow to support side-by-side comparison of modified images pygmt#1219)
- Migrate existing test baseline images to DVC (e.g., Migrate tests to use dvc-tracked baseline images pygmt#1131)
- Exclude dvc related files from the source distribution (Add .dvc to cmake remove directories step #6206)
- Optionally, add baseline images as a separate release asset for each release (e.g., Add a workflow to upload baseline images as a release asset pygmt#1317)
Initial setup (only needs to be done once for the repository)
Installing DVC for developing GMT
- Add link to dvc install instructions in the wiki.
- Add dvc to development requirements list in
BUILDING.md
.
Initialize dvc
dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# Optionally configure the repository to not send anonymous usage data
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"
Setup DVC remote
- Setup mirror of the GMT repository on the GMT DAGsHub organization
dvc remote add origin https://dagshub.com/GenericMappingTools/gmt.dvc # updates .dvc/config file with remote URL
dvc remote default origin # set default dvc remote to 'upstream'
Migrating tests
- Get added as a collaborator on DAGsHub and set up authentication
(based on PyGMT steps, may need updating)
# Sync with git and dvc remotes
git pull
dvc pull
# Generate hash for baseline image and stage the *.dvc file in git
git rm --cached 'test/<test-folder>/<test-image>.ps'
mv test/<test-folder>/<test-image>.ps test/baseline/<test-folder>/<test-image>.ps
dvc add test/baseline/<test-folder>
git add test/baseline/<test-folder>.dvc test/baseline/.gitignore
# Commit changes and push to both the git and dvc remotes
git commit -m "Migrate test to DVC"
git push
dvc push
Pull images from DVC remote (for GitHub Actions CI and local testing)
dvc status # should report any files 'not_in_cache'
dvc pull # pull down files from DVC remote cache (fetch + checkout)
cd <build-dir>
ctest
What about the images for documentation?
Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (#5364 (comment)).
References
- https://dvc.org/doc/start/data-versioning
- https://hackernoon.com/how-to-get-started-with-data-version-control-dvc-nn2n31bo
- https://dagshub.com/blog/configure-a-dvc-remote-without-a-devops-degree/
- https://dagshub.com/blog/datasets-should-behave-like-git-repositories/
- https://dagshub.com/blog/data-version-control-tools/
- https://dagshub.com/blog/data-science-pull-requests/
Are you willing to help implement and maintain this feature? Yes