Data Versioning
data_versioning.Rmd
DPR2 offers features to facilitate tracking data versioning throughout a project’s lifecycle. This vignette will cover those features.
One feature that DPR2 offers is the ability to print the history of
data objects from a version-controlled data package and compare them to
the current version. Here we use the dpr_data_history
function to get a listing of rda
files, when they were
created, and their git SHA-1 hashes. We can then fetch specific versions
using dpr_recall_data_versions
.
history <- dpr_data_history(path=path, include_checksums=TRUE)
print(history)
## blob_git_sha1 name author
## 1 8366813e7c1e1dfcf07e65cee2e876877ac44776 irisArea.rda notauser
## 2 360e51e0dc96e37265a7ada42f4d0d5203c5c0c3 mazda.rda notauser
## 3 405308c73f2f360076ba76963ba3b73466b2332b trees.rda notauser
## 4 d73a121b19e03d64d1c304a694fcda1e5c61a95d irisArea.rda notauser
## 5 4924f0aa284e1741a00d34198e168b522461e4b5 mazda.rda notauser
## 6 43eee6c599938b35df657acccfcb913be479fc4d trees.rda notauser
## when object_checksum
## 1 2025-09-24 20:35:09 1a8558f6a03c6e7cd0fde036018388b0
## 2 2025-09-24 20:35:09 70b01998e5dd4243758e76b5312ee606
## 3 2025-09-24 20:35:09 370a7132861fb520bd721d9bcbe008a4
## 4 2025-09-24 20:35:12 e1279ff5a8c8ab9b255ce78f2cb08479
## 5 2025-09-24 20:35:15 8bf6afe5717bceebb76ed9bf069d4274
## 6 2025-09-24 20:35:18 b311633cdf60bc0ab9dd5de7a92bc282
irisAreas <- history[history$name == "irisArea.rda",]
firstIrisHash <- irisAreas[irisAreas$when == min(irisAreas$when), "blob_git_sha1"]
lastIrisHash <- irisAreas[irisAreas$when == max(irisAreas$when), "blob_git_sha1"]
dataversions <- dpr_recall_data_versions(c(firstIrisHash, lastIrisHash), path)
We can use the dpr_data_digest
function to get a table
of the current hashes recorded in the data package. The last object’s
checksum should match what is in the dpr_data_digest
output
table.
digest::digest(dataversions[[1]][[1]])
## [1] "1a8558f6a03c6e7cd0fde036018388b0"
digest::digest(dataversions[[2]][[1]])
## [1] "e1279ff5a8c8ab9b255ce78f2cb08479"
dpr_data_digest(path)
## name data_digest_md5
## 1 irisArea.rda e1279ff5a8c8ab9b255ce78f2cb08479
## 2 mazda.rda 8bf6afe5717bceebb76ed9bf069d4274
## 3 trees.rda b311633cdf60bc0ab9dd5de7a92bc282
We can inspect the changes between the two versions using
diffdf::diffdf
.
diffdf::diffdf(
dataversions[[1]][[1]],
dataversions[[2]][[1]]
)
## Warning in diffdf::diffdf(dataversions[[1]][[1]], dataversions[[2]][[1]]):
## There are columns in BASE that are not in COMPARE !!
## There are columns in COMPARE that are not in BASE !!
## Differences found between the objects!
##
## Summary of BASE and COMPARE
## ============================================================
## PROPERTY BASE COMP
## ------------------------------------------------------------
## Name dataversions[[1]][[1]] dataversions[[2]][[1]]
## Class data.frame data.frame
## Rows(#) 150 150
## Columns(#) 6 6
## ------------------------------------------------------------
##
##
## There are columns in BASE that are not in COMPARE !!
## ============
## COLUMNS
## ------------
## Sepal.Area
## ------------
##
##
## There are columns in COMPARE that are not in BASE !!
## ============
## COLUMNS
## ------------
## Petal.Area
## ------------
Below, we render a package before building it to compare the object checksums between the refreshed object in the data directory.
dpr_render(path)
Once we have rendered and changes are detected, we can compare
datasets using dpr_compare_data_digest
.
dpr_compare_data_digest(path)
## name data_md5
## 1 irisArea.rda c1bb2fb3a1f22b55ec3a9ce295566370
## 2 mazda.rda 8bf6afe5717bceebb76ed9bf069d4274
## 3 trees.rda b311633cdf60bc0ab9dd5de7a92bc282
## data_digest_md5 same
## 1 e1279ff5a8c8ab9b255ce78f2cb08479 FALSE
## 2 8bf6afe5717bceebb76ed9bf069d4274 TRUE
## 3 b311633cdf60bc0ab9dd5de7a92bc282 TRUE