Lecture 3: Data and Project Management

Rasi Subramaniam

Link to slides

Reminders

The homework assignment is available today (October 6) and is due October 13 at 3:20pm.

Learning objectives

Identify minimum requirements for a reproducible computational project
Apply good practices for file organization
Use tidy principles for tabular (spreadsheet-style) data

Outline

Elements of reproducibility
File organization
Tidy data

This class requires Microsoft Excel or LibreOffice Calc (for opening .xlsx or .csv files).

Minimum elements for computational reproducibility

Annotated data from experiments or simulations
Documented code for data analyses
Defined software environments
Standardized organization of above 3 elements

1 project = 1 GitHub repository

Code
Data*
Lab notebook
Presentations
Manuscripts
Grants & fellowships
Discussion

Different locations for data at different analysis stages

Raw data: can be very large, store in cloud eg. AWS S3 or public repositories eg. Zenodo, SRA
Intermediate data: can be large or small, store in temporary scratch space
Tables underlying figures or samples: small, store in GitHub

Store sample annotations with data

Raw data without annotations cannot be analyzed

data/sample1.fastq
data/sample2.fastq
data/sample3.fastq
data/sample_annotations.tsv

sample1.fastq

@SRR21277963.1
GGAGTAACAGAAGTGAGAACCAGCTTATCAGAAAAAAAGTTTGAATTATG
+SRR21277963.1
AAGAGGGGAGGGAGGGGIAGGGGGGA.GGGGAGGGGIGIGGII<A<GGAA

sample_annotations.tsv

sample  srr_id       sample_id  sample_name
sample1 SRR21277963  104p8      dicodon_facs_off_low_2
sample2 SRR21277964  104p7      dicodon_facs_off_high_2
sample3 SRR21277965  104p6      dicodon_facs_on_low_2

Benefits of GitHub

Version control (file history, track changes)
Collaboration (branches, merging, issue comments, discussion)
Project management tools (project board, issues, milestones, labels)
Link to Slack channel for notifications

Example GitHub repositories:
https://github.com/rasilab/ribosome_collisions_yeast (public)
https://github.com/rasilab/bottorff_2022 (public)
https://github.com/rasilab/micropeptide_immunity (private)

Organization of GitHub repository

project_name
  |-- analysis/
  |-- experiments/
  |-- grants/
  |-- presentations/
  |-- manuscripts/
  |-- .devcontainer/
  |-- .gitignore
  |-- README.md

Use README.md to give an overview of the project and file organization
Use .gitignore to ignore files that should not be tracked by git
Use .devcontainer/ to define the software environment for analysis (specific to VSCode), also called .install/

Organization of `analysis` folder

analysis
    |--USER
        |--ANALYSIS_TYPE (eg. riboseq)
            |--YYYY-MM-DD_short_desc
                |--README.md
                |--data 
                    |--gencode
                        |--gencode.v26.gtf.gz
                    |--fastq
                        |--SRRnnnnnn.fastq
                |--scripts
                    |--analyze_riboseq.ipynb
                    |--download_from_sra.ipynb
                    |--run_analysis_pipeline.smk
                |--annotations 
                    |--sample_annotations.csv
                |--tables
                    |--summary_table_1.csv
                    |--summary_table_2.csv
                |--figures
                    |--summary_figure_1.pdf
                    |--summary_figure_2.pdf

.gz and .fastq files are usually in .gitignore
README.md should give an overview of the analysis, data source etc. and how to reproduce it
Ideally, every data file should be downloaded programmatically from permalinks

Naming conventions

Project repo: Short, descriptive, understandable
File names
- No caps, no spaces, no special characters other than _ and -
- Date format: YYYY-MM-DD
- No version numbers or names such as rasi_v20 (GitHub does this automatically)
Experiment labels
- exp001, exp002 etc.
- Use in filenames, sample annotations, issues
Sample labels
- 16p1, 16p2 … 20t1, 20t2 etc…
- Include experiment number (16, 20), type of sample (p, t), and sample number (1, 2)
- Use on Eppendorf tubes, lab notebook etc.
- Create a table of sample annotations in your lab notebook record

Tidy data

A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.

Hadley Wickham

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Tidy data examples

Examples from Park & Subramaniam 2019: data, annotations
Example from Table 2 in Bedford et al. 2014, available as an Excel table in the course repo
Follow same naming principles for columns as for files: No caps, no spaces, no special characters, use only _ and - if necessary.

Exercise on tidy data

Split into small groups of 3-4 people to work from an HI (haemagglutination-inhibition) table and convert to tidy data. Data available as an Excel table in the course repo.