Downloading and preprocessing TCGA Data through R program language

Download Clinical information and Omic-data profile from TCGA project

Introduction

TCGA project had produced multiple omics data in Cancer research, and the public data could be assessed by R program. This pipeline aims to acquire the clinical information and omic-data profile and dissect the information into analysis model.

All the scripts which stored in my github repository named “bioinformatics_pipeline” are here .

How it works

There are four scripts which separated the pipeline into four parts

  1. 01.Download.R: download clinical information and save omics data into SummariedExperimentData

  2. 02.Prerpocess_clinical.R: Process the clinical information

  3. 03.Prerpocess_omics.R: Process the omics data profile

  4. 04.ExpressionSet.R: Convert Phenotype from clincal information and Profile into ExpressionSet

Materials

Downloading the reference files from GDC for the Copy number variation to get the CNV markers is necessary, and the file named util/snp6.na35.remap.hg38.subset.txt.gz. In the other hand, transforming the ensembl geneid into gene symbol via the util/human_gene_all.tsv file which generated by biomart R packages would be convenient for subsequent study.

Command line

Since these R scripts are used in linux system envirnment, the following command would display that how to excute the pipeline with the TCGA-PAAD example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# step1: Download the multiple omics data from TCGA
Rscript 01.Download.R -t TCGA-PAAD -o CNV
Rscript 01.Download.R -t TCGA-PAAD -o mRNA
Rscript 01.Download.R -t TCGA-PAAD -o miRNA
Rscript 01.Download.R -t TCGA-PAAD -o DNA_Methylation

# step2: Curate the origin Rawdata including clinical information and omics data
Rscript 02.Prerpocess_clinical.R -p Clinical/TCGA-PAAD_clinical_origin.csv -t TCGA-PAAD

Rscript 03.Prerpocess_omics.R -e Omics/TCGA-PAAD_CNV.RDS -t CNV -g util/snp6.na35.remap.hg38.subset.txt.gz -p TCGA-PAAD-post
Rscript 03.Prerpocess_omics.R -e Omics/TCGA-PAAD_mRNA.RDS -t mRNA -g util/human_gene_all.tsv -p TCGA-PAAD-post
Rscript 03.Prerpocess_omics.R -e Omics/TCGA-PAAD_miRNA.RDS -t miRNA -p TCGA-PAAD-post
Rscript 03.Prerpocess_omics.R -e Omics/TCGA-PAAD_DNA_Methylation.RDS -t DNA_Methylation -p TCGA-PAAD-post

# step3: Convert clinical information and Gene expression data into ExpressionSet for further analysis
Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_mRNA_profile.tsv -t mRNA -o TCGA-PAAD-post
Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_miRNA_PKM.tsv -t miRNA -o TCGA-PAAD-post-PKM
Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_miRNA_count.tsv -t miRNA -o TCGA-PAAD-post-count

Notes: Since accessing to TCGA website via the GDC API, using R TCGAbiolinks package does it would be rejected. To address this problem is try more and more times until the download is successful.

Summary

The clinical information contains all the patients’ phenotype but it doesn’t show any Normal samples’ information derived from omics data profile. Therefore, the extra information about normal samples would be obtains from the SummariedExperimentData. In the other hand, to distinct the tumor and normal samples is also be cautious, the Number named “01” at the position from 14 to 15 characters of SummarizedExperiment object’s rownames is “tumor” and the others are “normal”.

Files’ Structure

The following tree of file structure displays the whole final results of the downloading, preprocessing and integration of the TCGA-PAAD project from TCGA website through R program.

The final file for further analysis are named with RDS or post suffix in the Clean and Clinical diretories.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
TCGA_download_pipeline
├── 01.Download.R
├── 02.Prerpocess_clinical.R
├── 03.Prerpocess_omics.R
├── 04.ExpressionSet.R
├── CNV
│ └── TCGA-PAAD
│ └── harmonized
├── Clean
│ ├── TCGA-PAAD-post-PKM_miRNA_ExprSet_clinical.RDS
│ ├── TCGA-PAAD-post-count_miRNA_ExprSet_clinical.RDS
│ ├── TCGA-PAAD-post_CNV_tumor.tsv
│ ├── TCGA-PAAD-post_CNV_untumor.tsv
│ ├── TCGA-PAAD-post_DNA_Methylation_ExprSet.RDS
│ ├── TCGA-PAAD-post_DNA_Methylation_clinical.csv
│ ├── TCGA-PAAD-post_DNA_Methylation_profile.tsv
│ ├── TCGA-PAAD-post_mRNA_ExprSet.RDS
│ ├── TCGA-PAAD-post_mRNA_ExprSet_clinical.RDS
│ ├── TCGA-PAAD-post_mRNA_clinical.csv
│ ├── TCGA-PAAD-post_mRNA_profile.tsv
│ ├── TCGA-PAAD-post_miRNA_PKM.tsv
│ ├── TCGA-PAAD-post_miRNA_count.tsv
│ └── snp6.na35.remap.hg38.tsv
├── Clinical
│ ├── TCGA-PAAD-post_clinical.csv
│ └── TCGA-PAAD_clinical_origin.csv
├── DNA_Methylation
│ └── TCGA-PAAD
│ └── legacy
├── MANIFEST.txt
├── Omics
│ ├── TCGA-PAAD_CNV.RDS
│ ├── TCGA-PAAD_DNA_Methylation.RDS
│ ├── TCGA-PAAD_mRNA.RDS
│ └── TCGA-PAAD_miRNA.RDS
├── README.md
├── mRNA
│ └── TCGA-PAAD
│ └── harmonized
├── miRNA
│ └── TCGA-PAAD
│ └── harmonized
├── util
│ ├── human_gene_all.tsv
│ └── snp6.na35.remap.hg38.subset.txt.gz
└── work.sh

Reference

  1. GDC Reference
  2. How to perform CNV analysis

参考文章如引起任何侵权问题,可以与我联系,谢谢。


------------- The End Thanks for reading --------