Module 3: A Real-World Workflow

Walkthrough of the ww-sra-salmon pipeline — downloading RNA-seq data and quantifying with Salmon.

Everything so far has been relatively straightforward, but you might be wondering: why would I use WDL over a shell script? The true benefits of WDL become clear when you start to scale things up — in both complexity and volume. Let's take a look at a real-world example to get a better idea of that.

The Big Picture

Don't worry too much about the exact bioinformatics context here. The general idea is:

Download a list of datasets from a public database
Post-process each dataset automatically
Save the final results to a location of our choosing

For the bioinformatics nerds out there: we're downloading raw RNA sequencing data from SRA and quantifying it with Salmon.

Walking Through the Workflow

Here's the workflow (simplified for readability — we'll highlight the key parts):

version 1.0

import ".../ww-sra.wdl" as sra_tasks
import ".../ww-salmon.wdl" as salmon_tasks

workflow sra_salmon {
  input {
    Array[String] sra_id_list
    File transcriptome_fasta
    Int ncpu = 8
    Int memory_gb = 16
  }

  call salmon_tasks.build_index { input:
      transcriptome_fasta = transcriptome_fasta,
      ...
  }

  scatter ( id in sra_id_list ){
    call sra_tasks.fastqdump { input:
        sra_id = id,
        ...
    }

    call salmon_tasks.quantify { input:
        salmon_index_dir = build_index.salmon_index,
        fastq_r1 = fastqdump.r1_end,
        fastq_r2 = fastqdump.r2_end,
        ...
    }
  }

  call salmon_tasks.merge_results { input:
      salmon_quant_dirs = quantify.salmon_quant_dir,
      ...
  }

  output {
    Array[File] salmon_quant_dirs = quantify.salmon_quant_dir
    File tpm_matrix = merge_results.tpm_matrix
    File counts_matrix = merge_results.counts_matrix
  }
}

Let's walk through it:

Inputs

We start by defining our inputs — in particular, the dataset IDs we want to download (sra_id_list) and a reference transcriptome to align against during post-processing (transcriptome_fasta). Notice that sra_id_list is an Array[String] — a list of IDs, not just a single one. This becomes important later.

Task Calls and Data Flow

From there, we call tasks to do the actual work. Each task's output can be directly piped into the next task as input — for example, fastqdump.r1_end (the downloaded data) feeds directly into salmon_tasks.quantify. This creates a clear chain of logic: download → quantify → merge results.

Importing Modules

But where are those tasks defined? You'll notice we're not defining them in this file. Instead, we import them from external WDL scripts:

import ".../ww-sra.wdl" as sra_tasks
import ".../ww-salmon.wdl" as salmon_tasks

You can import tasks from other local files or even directly from GitHub repositories. In fact, the WILDS WDL Library maintains a large collection of pre-built, well-tested tasks that you can import into your own workflows. If you're familiar with Python, this is very similar to importing modules — or loading libraries in R. For more details on what's available, see the WILDS WDL Library documentation on SciWiki.

Outputs

Finally, we specify which outputs to save. In this case, we're interested in the quantification results (both per-sample directories and merged matrices across all samples). You could easily add intermediate outputs here too if you wanted to keep the raw sequencing data.

Scaling with Scatter-Gather

All good so far, but this would only process a single dataset. What if you have a list of 100 datasets that you want to download and process? How do you make that scale?

With WDL, scaling is actually pretty straightforward using what's called a scatter-gather pattern. It's similar to a for loop, but instead of running things sequentially, it spreads each iteration out to its own computational node for maximum parallelization:

scatter ( id in sra_id_list ){
  call sra_tasks.fastqdump { input:
      sra_id = id,
      ...
  }

  call salmon_tasks.quantify { input:
      fastq_r1 = fastqdump.r1_end,
      fastq_r2 = fastqdump.r2_end,
      ...
  }
}

Since we provided sra_id_list as an array of IDs, the workflow automatically scatters across the entire list and sends each sample to its own node for downloading and quantification in parallel. So if you have 100 samples that each take an hour, 100 hours sequentially turns into ~1 hour in parallel. The only limit is the amount of resources available to you, and on the Fred Hutch cluster, you have quite a lot of nodes to play with.

The workflow also nicely takes care of the handoff between tasks — when sample A finishes downloading, WDL automatically hands off the results to the quantification step. No need to wait for the rest of the samples to finish downloading first, and no need to manually babysit.

After all the scattered tasks complete, the results are automatically gathered back together. That's why the merge_results task (which sits outside the scatter block) can receive the full array of quantification outputs and combine them into a single matrix.

In the next section, we'll look at how to customize workflows like this one to suit your specific research needs.

Learn more about: Office of the Chief Data Officer | WILDS

Previous: ← Hello World Workflow | Next: Customizing Workflows →