The homework assignment is available today (October 6) and is due October 13 at 3:20pm.
tidy
principles for tabular (spreadsheet
-style) dataThis class requires Microsoft Excel or LibreOffice Calc (for opening .xlsx
or .csv
files).
Raw data without annotations cannot be analyzed
data/sample1.fastq
data/sample2.fastq
data/sample3.fastq
data/sample_annotations.tsv
sample1.fastq
@SRR21277963.1
GGAGTAACAGAAGTGAGAACCAGCTTATCAGAAAAAAAGTTTGAATTATG
+SRR21277963.1
AAGAGGGGAGGGAGGGGIAGGGGGGA.GGGGAGGGGIGIGGII<A<GGAA
sample_annotations.tsv
sample srr_id sample_id sample_name
sample1 SRR21277963 104p8 dicodon_facs_off_low_2
sample2 SRR21277964 104p7 dicodon_facs_off_high_2
sample3 SRR21277965 104p6 dicodon_facs_on_low_2
Example GitHub repositories:
https://github.com/rasilab/ribosome_collisions_yeast (public)
https://github.com/rasilab/bottorff_2022 (public)
https://github.com/rasilab/micropeptide_immunity (private)
project_name
|-- analysis/
|-- experiments/
|-- grants/
|-- presentations/
|-- manuscripts/
|-- .devcontainer/
|-- .gitignore
|-- README.md
README.md
to give an overview of the project and file organization.gitignore
to ignore files that should not be tracked by git.devcontainer/
to define the software environment for analysis (specific to VSCode), also called .install/
analysis
folderanalysis
|--USER
|--ANALYSIS_TYPE (eg. riboseq)
|--YYYY-MM-DD_short_desc
|--README.md
|--data
|--gencode
|--gencode.v26.gtf.gz
|--fastq
|--SRRnnnnnn.fastq
|--scripts
|--analyze_riboseq.ipynb
|--download_from_sra.ipynb
|--run_analysis_pipeline.smk
|--annotations
|--sample_annotations.csv
|--tables
|--summary_table_1.csv
|--summary_table_2.csv
|--figures
|--summary_figure_1.pdf
|--summary_figure_2.pdf
.gz
and .fastq
files are usually in .gitignore
README.md
should give an overview of the analysis, data source etc. and how to reproduce it_
and -
YYYY-MM-DD
rasi_v20
(GitHub does this automatically)exp001
, exp002
etc.16p1
, 16p2
… 20t1
, 20t2
etc…A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.
Examples from Park & Subramaniam 2019: data, annotations
Example from Table 2 in Bedford et al. 2014, available as an Excel table in the course repo
Follow same naming principles for columns as for files: No caps, no spaces, no special characters, use only _
and -
if necessary.
Split into small groups of 3-4 people to work from an HI (haemagglutination-inhibition) table and convert to tidy data. Data available as an Excel table in the course repo.