The DPR2 Workflow
the_dpr2_workflow.Rmd
This vignette demonstrates the DPR2 workflow for building data packages. We will go over key DPR2 functions and how to build a data package from start to finish.
DPR2 workflow overview
- Create a folder with the name of the new data package.
- Start an R session with that folder set as the working directory.
- Load the DPR2 library with
library(DPR2)
- Initialize the data package with
dpr_init()
- Add any necessary source data to the folder
inst/extdata
- Add your processing scripts to the folder
processing
- Build the data package with
dpr_build()
In this workflow, processing scripts are written to transform source
data into analysis data. Static vignettes and documentation are rendered
from those scripts, and any data that is generated can be saved to the
package using dpr_save()
.
The DPR2 YAML and API
DPR2 was designed so that any package building behavior
configurations can be stored in a package’s
datapackager.yml
file that lives at the data package
source’s root directory. This file is generated when the data package is
initialized and is maintained by the user. The DPR2 API is designed
around this concept, much like the original DataPackageR; however, DPR2
also allows for changing the values in that file in an adhoc fashion by
using the ellipsis operator in all its function calls that access the
values in the configuration YAML file.
Initializing a DPR2 package
library(DPR2)
dpr_init(yaml = dpr_yaml_init(process_on_build = 'my_process.Rmd'))
There are two ways one can initialize a new DPR2 package: using
dpr_init
or dpr_create
. dpr_init
is simply a wrapper of dpr_create
, where
dpr_init
will create a new DPR2 package at the current
working directory, and dpr_create
will create a new DPR2
package at a specified path. When the package is initialized, DPR2 will
check if any necessary files it needs are already present and create
those if they are missing. This functionality is handy when a package
already exists and a user would like to maintain that package using
DPR2.
The DPR2 package datapackager.yml
and
DESCRIPTION
files can be initialized with specific values
using the yaml
and desc
arguments. The
arguments passed are named lists that are generated using the
dpr_yaml_init()
and dpr_description_init()
function calls, which are used to generate each configuration file
respectively. These functions will add any necessary key value pairs
that are required but not enumerated by the user to ensure the resulting
files are valid. For a list of default values for each of these, see
?dpr_yaml_defaults
and
?dpr_description_defaults
.
Building a DPR2 package
Before a package is built, a processing script must be added to the
processing
directory.
Adding an example processing script
#> ---
#> title: example processing script
#> ---
#> Example text
#>
#> ```{r}
#> library(DPR2)
#> my_df <- data.frame(a = 1:5, b = letters[1:5])
#> dpr_save('my_df')
#> ```
dpr_path()
is used to refer to other paths within
processing scripts and dpr_save()
to save objects to the
data
directory. Objects saved this way always follow DPR2
best practices, including saving as single-object rda
files.
Once a script is added to the processing directory, DPR2 must know that it is a script the user wants processed when the package is built.
dpr_add_scripts("my_process.Rmd")
dpr_build
and dpr_render
dpr_build
and dpr_render
are the functions
used for building a data package from processing scripts.
dpr_build
wraps dpr_render
but in addition may
do the following:
- save the rendered processing scripts to vignettes in the data package source
- update the data digest with new checksums
- build the package tarball
- install the package in the current R environment
dpr_render
can be used before dpr_build
if
the user doesn’t immediately want to build the package, but instead is
interested in checking if the package can be built successfully or if
the associated vignettes are rendered correctly.
In the above example, this call to dpr_render
makes use
of the ellipsis operator, temporarily overriding the value for
process_on_build
in the datapackager.yml
file.
Build the data package
After checking if dpr_render
renders the processing
scripts without error, dpr_build
can be used to build and
install the package. Users also can choose to skip rendering and build
the package as is.
dpr_build()
# examine data directory
list.files("data")
#> [1] "my_df.rda"
Examine the data digest
As mentioned in Introduction to
DPR2, the data digest is formatted as a directory of files. Each of
these files contains the md5 checksum of the objects loaded from data in
memory. This manner of hashing is consistent with the original
DataPackageR. The digest can be viewed at any time as a single file
using dpr_data_digest()
dpr_data_digest()
#> name data_digest_md5
#> 1 my_df.rda ac3f4afc00920b7b099f2051a48956b1
Examine the saved data object
Once a package is built it can be installed and its packaged data and
vignettes can be accessed using data()
and
vignette()
base functions.
Using Git
It is highly recommended that git be used alongside DPR2. Git version control allows users to store versions of data and processing scripts in a fashion that is easy to revert to and recall from. DPR2 was built to allow parallel versions of an analysis to be merged together, reconciling differences using git. DPR2 offers a handful of tools to simplify recalling different versions of a dataset, which is explained in the Data Versioning vignette. To learn about how to use git in general, see the free book Pro Git.
Using Renv
DPR2 may be initialized with renv for dependency version control, and it is recommended to do so. DPR2 can help tackle problems of unstable and non-reproducible environments by using renv after initializing a DPR2 package. Once the data package is initialized with renv, renv will make sure that when a dependency is installed by a user, it will use the same version of that dependency when building the data package in the future, and will inform users whether they need to install a particular version of a package dependency. This is useful when having multiple users collaborate on a data package, as it will ensure that the dependency versions are the same across users. For more details on renv, refer to the renv documentation.