Downloading and preprocessing TCGA Data through R program language
TCGAViews: Valine: Symbols count in article: 4.5kReading time ≈4 mins.
Download Clinical information and Omic-data profile from TCGA project
TCGA project had produced multiple omics data in Cancer research, and the public data could be assessed by R program. This pipeline aims to acquire the clinical information and omic-data profile and dissect the information into analysis model.
All the scripts which stored in my github repository named “bioinformatics_pipeline” are here .
How it works
There are four scripts which separated the pipeline into four parts
01.Download.R: download clinical information and save omics data into SummariedExperimentData
02.Prerpocess_clinical.R: Process the clinical information
03.Prerpocess_omics.R: Process the omics data profile
04.ExpressionSet.R: Convert Phenotype from clincal information and Profile into ExpressionSet
Downloading the reference files from GDC for the Copy number variation to get the CNV markers is necessary, and the file named util/snp6.na35.remap.hg38.subset.txt.gz. In the other hand, transforming the ensembl geneid into gene symbol via the util/human_gene_all.tsv file which generated by biomart R packages would be convenient for subsequent study.
Since these R scripts are used in linux system envirnment, the following command would display that how to excute the pipeline with the TCGA-PAAD example.
# step3: Convert clinical information and Gene expression data into ExpressionSet for further analysis Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_mRNA_profile.tsv -t mRNA -o TCGA-PAAD-post Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_miRNA_PKM.tsv -t miRNA -o TCGA-PAAD-post-PKM Rscript 04.ExpressionSet.R -p Clinical/TCGA-PAAD-post_clinical.csv -e Clean/TCGA-PAAD-post_miRNA_count.tsv -t miRNA -o TCGA-PAAD-post-count
Notes: Since accessing to TCGA website via the GDC API, using R TCGAbiolinks package does it would be rejected. To address this problem is try more and more times until the download is successful.
The clinical information contains all the patients’ phenotype but it doesn’t show any Normal samples’ information derived from omics data profile. Therefore, the extra information about normal samples would be obtains from the SummariedExperimentData. In the other hand, to distinct the tumor and normal samples is also be cautious, the Number named “01” at the position from 14 to 15 characters of SummarizedExperiment object’s rownames is “tumor” and the others are “normal”.
The following tree of file structure displays the whole final results of the downloading, preprocessing and integration of the TCGA-PAAD project from TCGA website through R program.
The final file for further analysis are named with RDS or post suffix in the Clean and Clinical diretories.