Skip to contents

This vignette demonstrates the DPR2 workflow for building data packages. We will go over key DPR2 functions and how to build a data package from start to finish.

DPR2 workflow overview

  1. Create a folder with the name of the new data package.
  2. Start an R session with that folder set as the working directory.
  3. Load the DPR2 library with library(DPR2)
  4. Initialize the data package with dpr_init()
  5. Add any necessary source data to the folder inst/extdata
  6. Add your processing scripts to the folder processing
  7. Build the data package with dpr_build()

In this workflow, processing scripts are written to transform source data into analysis data. Static vignettes and documentation are rendered from those scripts, and any data that is generated can be saved to the package using dpr_save().

The DPR2 YAML and API

DPR2 was designed so that any package building behavior configurations can be stored in a package’s datapackager.yml file that lives at the data package source’s root directory. This file is generated when the data package is initialized and is maintained by the user. The DPR2 API is designed around this concept, much like the original DataPackageR; however, DPR2 also allows for changing the values in that file in an adhoc fashion by using the ellipsis operator in all its function calls that access the values in the configuration YAML file.

Initializing a DPR2 package

library(DPR2)
dpr_init(yaml = dpr_yaml_init(process_on_build = 'my_process.Rmd'))

There are two ways one can initialize a new DPR2 package: using dpr_init or dpr_create. dpr_init is simply a wrapper of dpr_create, where dpr_init will create a new DPR2 package at the current working directory, and dpr_create will create a new DPR2 package at a specified path. When the package is initialized, DPR2 will check if any necessary files it needs are already present and create those if they are missing. This functionality is handy when a package already exists and a user would like to maintain that package using DPR2.

The DPR2 package datapackager.yml and DESCRIPTION files can be initialized with specific values using the yaml and desc arguments. The arguments passed are named lists that are generated using the dpr_yaml_init() and dpr_description_init() function calls, which are used to generate each configuration file respectively. These functions will add any necessary key value pairs that are required but not enumerated by the user to ensure the resulting files are valid. For a list of default values for each of these, see ?dpr_yaml_defaults and ?dpr_description_defaults.

Building a DPR2 package

Before a package is built, a processing script must be added to the processing directory.

Adding an example processing script

cat(
  readLines(
    dpr_path('processing', 'my_process.Rmd')
  ), sep = '\n'
)
#> ---
#> title: example processing script
#> ---
#> Example text
#> 
#> ```{r}
#> library(DPR2)
#> my_df <- data.frame(a = 1:5, b = letters[1:5])
#> dpr_save('my_df')
#> ```

dpr_path() is used to refer to other paths within processing scripts and dpr_save() to save objects to the data directory. Objects saved this way always follow DPR2 best practices, including saving as single-object rda files.

Once a script is added to the processing directory, DPR2 must know that it is a script the user wants processed when the package is built.

dpr_add_scripts("my_process.Rmd")

dpr_build and dpr_render

dpr_build and dpr_render are the functions used for building a data package from processing scripts. dpr_build wraps dpr_render but in addition may do the following:

  • save the rendered processing scripts to vignettes in the data package source
  • update the data digest with new checksums
  • build the package tarball
  • install the package in the current R environment

dpr_render can be used before dpr_build if the user doesn’t immediately want to build the package, but instead is interested in checking if the package can be built successfully or if the associated vignettes are rendered correctly.

In the above example, this call to dpr_render makes use of the ellipsis operator, temporarily overriding the value for process_on_build in the datapackager.yml file.

Build the data package

After checking if dpr_render renders the processing scripts without error, dpr_build can be used to build and install the package. Users also can choose to skip rendering and build the package as is.

dpr_build()

# examine data directory
list.files("data")
#> [1] "my_df.rda"

Examine the data digest

As mentioned in Introduction to DPR2, the data digest is formatted as a directory of files. Each of these files contains the md5 checksum of the objects loaded from data in memory. This manner of hashing is consistent with the original DataPackageR. The digest can be viewed at any time as a single file using dpr_data_digest()

dpr_data_digest()
#>        name                  data_digest_md5
#> 1 my_df.rda ac3f4afc00920b7b099f2051a48956b1

Examine the saved data object

Once a package is built it can be installed and its packaged data and vignettes can be accessed using data() and vignette() base functions.

data('my_df')
vignette(package="my_data_package")

Using Git

It is highly recommended that git be used alongside DPR2. Git version control allows users to store versions of data and processing scripts in a fashion that is easy to revert to and recall from. DPR2 was built to allow parallel versions of an analysis to be merged together, reconciling differences using git. DPR2 offers a handful of tools to simplify recalling different versions of a dataset, which is explained in the Data Versioning vignette. To learn about how to use git in general, see the free book Pro Git.

Using Renv

DPR2 may be initialized with renv for dependency version control, and it is recommended to do so. DPR2 can help tackle problems of unstable and non-reproducible environments by using renv after initializing a DPR2 package. Once the data package is initialized with renv, renv will make sure that when a dependency is installed by a user, it will use the same version of that dependency when building the data package in the future, and will inform users whether they need to install a particular version of a package dependency. This is useful when having multiple users collaborate on a data package, as it will ensure that the dependency versions are the same across users. For more details on renv, refer to the renv documentation.