User Manual

INsPeCT: INtegrative Platform for Cancer Transcriptomics

Piyush B. Madhamshettiwar1, Stefan R. Maetschke1, Melissa J. Davis1, Antonio Reverter2 and Mark
A. Ragan1,*

1 The University of Queensland, Institute for Molecular Bioscience, 306 Carmody Road, St Lucia,
Brisbane, Queensland 4072, Australia
2 CSIRO Livestock Industries, 306 Carmody Road, St Lucia, Brisbane, Queensland 4072, Australia

* To whom correspondence should be addressed. Tel: +61-7-3346-2616; Fax: +61-7-3346-2101;
Email: [email protected]

User Manual

Sample datasets

We have provided sample datasets for each of INsPeCT’s framework. Before going through this
tutorial please download all the required files for that analysis. We strongly suggest reading this user
manual carefully for selecting the appropriate file formats and analysis options. Please enter all the
required information in all required sections for avoiding failing of your analysis. Please note sample
dataset provided for ChIP-seq data analysis is large in size (~1GB) and may take long time to
download and upload. For each analysis alongwith the sample data we also provide required
parameters that is required for the analysis, for example in online data import section of microarray
data analysis we provide information on the chip type of the dataset as "hgu133plus2" and control
condition as “Normal”. Or in case of WGCNA we provide both survival time variable “survivaltime”
and event variable “death”.

Workflows

Workflows section provides two main workflows, first showing the schematic representation of
INsPeCT and second for novel analytical framework RMaNI.

Process info and results folder

In all of the INsPeCT’s analyses frameworks, we provide timely update on the stage and steps of your
analyses after you submit the job. Once the analysis is done we will provide you a link at the bottom
of the page to download you results. Clinking on the link will show you individual result files. You can
either download individual files or download complete results folder with or without input data. In
the results folder, we also provide you a file listing all the parameters you have selected for a
particular framework.

Analysis Frameworks

Here we provide brief guide for performing each analysis. For detail description of individual
frameworks please refer to the manuscript.

Microarray Data Analysis

Analysis name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

Data import

Import online

Selecting “onlineimport” option will allow you to import the publicly available microarray gene
expression data from NCBI Gene Expression Omnibus (GEO) repository.

This can be done by entering the GEO accession number for the dataset. We support import of both
series (GSE) and datasets (GDS), so the accession number must start from GSE or GDS e.g.
“GSE14407” or “GDS3592”. At this stage data arising from Affymtrix chip types are supported.

After entering the accession number upload the sample annotation file in “Upload sample
annotation file” field. File must be in “.csv” format with column headers. First column name must be
“samples” containing sample names as in the data file and second column name must be “condition”
(like normal/cancer etc). This is very important for all the downstream analyses because INsPeCT will
require information about which samples belongs to e.g.” normal” condition and which samples
belongs to e.g. “cancer” condition. Example sample annotation file is provided in the sample
datasets section. This field is must.

Snapshot of the sample annotation file-

samples,condition
GSM359972,Normal
GSM359973,Normal
GSM359974,Normal
GSM359984,Cancer
GSM360039,Cancer
GSM360040,Cancer

Import raw data (celfiles)

Selecting “celfiles” option will allow you to import the raw microarray gene expression data from
your local computer. These are the raw data files with “.cel” extension. This can be done by
uploading a zip file containing all the “.cel” files. Example raw dataset is provided in the sample
datasets section.
As you have uploaded raw data, you need to select the data normalisation method. We have
provided five widely accepted methods – “mas5”, “rma”, “gcrma”, ”plier” and “dChip”. Default is
“mas5”.

After uploading the zip file and selecting data normalisation method, upload the sample annotation
file in “Upload sample annotation file” field. File must be in “.csv” format with column headers. First
column name must be “samples” containing sample names exactly matching the samples names in
the data file and second column name must be “condition” (like normal/cancer etc). This is very
important for all the downstream analyses because INsPeCT will require information about which
samples belongs to e.g.” normal” condition and which samples belongs to e.g. “cancer” condition.
Example sample annotation file is provided in the sample datasets section. This field is must.

Snapshot of the sample annotation file-

samples,condition
GSM359972,Normal
GSM359973,Normal
GSM359974,Normal
GSM359984,Cancer
GSM360039,Cancer
GSM360040,Cancer

Import processed data

Selecting “procdata” option will allow you to import the normalised microarray gene expression data
from your local computer. This can be done by uploading a “.csv” file containing processed data. File
must be in “.csv” format with column headers. Data format is, first column must be “probe” with
probe ids as identifiers and remaining columns must be samples. Example processed dataset is
provided in the sample datasets section.

Snapshot of the processed data –

probe,GSM359972,GSM359973,GSM359974,GSM359984,GSM360039,GSM360040
117_at,4.9038,6.9663,7.2027,7.2027,6.6135,5.3785
1255_g_at,3.6934,4.8820,5.8916,5.8916,4.2432,3.4529
1438_at,3.2685,3.9953,3.4732,3.4732,4.7258,3.1192

After uploading the processed data file, upload the sample annotation file in “Upload sample
annotation file” field. File must be in “.csv” format with column headers. First column name must be
“samples” containing sample names exactly matching the sample names in the data file and second
column name must be “condition” (like normal/cancer etc). This is very important for all the
downstream analyses because INsPeCT will require information about which samples belongs to
e.g.” normal” condition and which samples belongs to e.g. “cancer” condition. Example sample
annotation file is provided in the sample datasets section. This field is must.

Snapshot of the sample annotation file-

samples,condition
GSM359972,Normal
GSM359973,Normal
GSM359974,Normal
GSM359984,Cancer
GSM360039,Cancer
GSM360040,Cancer

Upload sample annotation

As mentioned in the previous sections, “Upload sample annotation file” allow you to upload sample
annotations. File must be in “.csv” format with column headers. First column name must be
“samples” containing sample names exactly matching the samples names in the data file and second
column name must be “condition” (like normal/cancer etc). This is very important for all the
downstream analyses because INsPeCT will require information about which samples belongs to e.g.
“normal” condition and which samples belongs to e.g. “cancer” condition. Example sample
annotation file is provided in the sample datasets section. This field is must.

Snapshot of the sample annotation file-

samples,condition
GSM359972,Normal
GSM359973,Normal
GSM359974,Normal
GSM359984,Cancer
GSM360039,Cancer
GSM360040,Cancer

Select chip-type

“Select chip type” allow you to select chip-type of your dataset. This is very important as it will affect
the annotation of the probes to other gene identifiers and downstream enrichment analyses. Please
confirm the chip-type of your data and select appropriate option from the list of 13 Affymetrix chip-
types. Providing wrong chip-type will fail your analysis or result in incorrect results.

Enter control condition

In the control condition text box enter the name of control condition for your dataset. For the
sample datasets provided, control condition is “Normal”. This name should match exactly as in the
sample annotation file.

Select top genes

Often we require a set of genes those are changed across all the samples rather than differentially
expressed between conditions, for example one could ask, which genes in the dataset show highest
average expression or maximum coefficient of variation? “Select top genes” allow you to select and
download top 1000, 2000 or 4000 genes showing “highest average expression” or “maximum
coefficient of variation” across all samples in the dataset.

Differential gene expression analysis

“Perform differential gene expression data analysis” allow you to identify differentially expressed
genes between the conditions. For this analysis you need to enter the name of the control condition
from your sample annotation file eg. “normal”, “untreated”, “wildtype” etc. in the “enter control
condition from you sample annotation file” text box. Once you enter the control condition you can
then select the method “LIMMA” or “SAM” for differential gene expression detection. “LIMMA” is
default. We declare a differential gene expression significant if the Benjamini-Hochberg (BH)
adjusted p-value is at most 0.05. This section is must if you like to perform any downstream analysis.

Analysis of differentially expressed genes

The analyses options provided in this section allow you to analyse differentially expressed genes
data. These options are described in detail in the manuscript.

Gene-list analysis framework

The analyses options provided in this section allow you to process your differentially expressed
genes to functional analysis. These options are described in detail in the manuscript.
ChIP-seq Data Analysis

Analysis name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

Data import

This section allows you to import the raw ChIP-seq data from your local computer. We support four
most common file formats “sra”, “fastq”, “sam”, “bam”. Select the appropriate file type of your data
otherwise your analysis will fail. Once you enter the analysis name and select file type, “Upload zip
file containing all the chip-seq data files of the filetype selected above” can be used to upload a zip
file containing you data. We support data upload upto 2GB and please note that it may take long
time. Example zip file is provided in the sample datasets section. This is very large file (~1.3GB) so
please be patient while downloading this file. The standard file format for raw ChIP-seq reads used
for input to Bowtie is FASTQ, which we then convert to SAM, BAM and sorted BAM formats for
further analyses. For data that are available in SRA format we provide functionality to convert from
SRA to FASTQ.

Upload sample annotation

As mentioned in the previous sections, “Upload sample annotation file” allow you to upload sample
annotations. File must be in “.csv” format with column headers. First column name must be
“samples” containing sample names exactly matching the samples names in the data file and second
column name must be “condition”. Currently, we support analysis of data with two conditions with
replicates. Conditions must be “bgr1” for condition 1 and “sig1” for condition 2. This is very
important for all the downstream analyses because INsPeCT will require information about which
samples belongs to e.g. “bgr1” condition and which samples belongs to “sig1” condition. Example
sample annotation file is provided in the sample datasets section. This field is must.

Snapshot of the ChIP-seq sample annotation file-

samples,condition
SRR167632,bgr1
SRR167633,bgr1
SRR167638,sig1
SRR167639,sig1

Quality control

After mapping reads, we process the data for quality control using the FastQC tool. Selecting
“Perform quality control of the uploaded data. Report will be generated and saved in zip file” will
perform the quality control of the uploaded data and provided comprehensive report for each
sample.

GO enrichment analysis of the enriched peaks

This option allows you to perform GO enrichment analysis of the enriched peaks.
Extract enriched peak sequences

This option allows you to extract the enriched peak sequences in the fasta format.

Submit enriched peaks to MEME-ChIP

Motif discovery is of obvious relevance in ChIP-seq analysis. INsPeCT integrates the widely used
MEME Suite of tools for motif discovery, comparison and analysis. We use MEME-ChIP, which was
specifically designed for analysis of ChIP-seq data. MEME-ChIP performs different motif analyses on
the input data, and includes the MEME, TOMTOM, SPAMO, DREME, CENTRIMO and AME tools. An
interactive HTML file will be provided which summarises the results and provide links to the results
for each program. It also displays interactive plots for visual inspection. For this analysis you can
select upstream and downstream promoter region. For upstream promoter region we provide five
choices 100, 200, 500, 1000(default), 2000 bp, and for downstream region we provide three choices
50, 100 (default), 200 bp.

Differential binding analysis

This option allows you to perform differential binding analysis of the enriched peaks. We provide
three most common and widely accepted methods “DEseq”, “edgeR”, “foldchange” or all these
three using “allabove”.

Gene-list analysis framework

The analyses options provided in this section allow you to process significant genes from differential
binding analysis to functional analysis. These options are described in detail in the manuscript.
RNA-seq Data Analysis

Analysis name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

Data import

For RNA-seq data analysis we only allow analysis of processed read counts. The file format for RNA-
seq data is similar to microarray gene expression data. “Please upload a file containing processed
data” will allow you to import the processed RNA-seq data from your local computer. File must be in
“.csv” format with column headers. Data format is, first column must be “RefSeqID” with refseq
identifiers only and remaining columns must be samples. Example processed dataset is provided in
the sample datasets section.

Snapshot of the processed data –

RefSeqID,8N,8T,33N,33T,51N,51T
NM_182502,2592,3,7805,321,3372,9
NM_003280,1684,0,1787,7,4894,559
NM_152381,9915,15,10396,48,23309,7181
NM_022438,2496,2,3585,239,1596,7
NM_001100112,4389,7,7944,16,9262,1818
NM_017534,4402,7,7943,16,9244,1815

Upload sample annotation

After uploading the processed data file, upload the sample annotation file in “Upload sample
annotation file” field. File must be in “.csv” format with column headers. First column name must be
“samples” containing sample names exactly matching the sample names in the data file and second
column name must be “condition”. This is very important for all the downstream analyses because
INsPeCT will require information about which samples belongs to e.g. “Normal” condition and which
samples belongs to e.g. “Tumor” condition. Example sample annotation file is provided in the sample
datasets section. This field is must.

Snapshot of the sample annotation file-

samples,condition
8N,Normal
8T,Tumor
33N,Normal
33T,Tumor
51N,Normal
51T,Tumor

Differential gene expression analysis

“Perform differential gene expression data analysis” allow you to identify differentially expressed
genes between the conditions based on the read counts. For this analysis you need to enter the
name of the control condition from your sample annotation file eg. “Normal”, “untreated”,
“wildtype” etc in the “enter control condition from you sample annotation file” text box. Once you
enter the control condition you can then select the method “edgeR”, “DEseq” or both of them as
“allabove” for differential gene expression detection. “edgeR” is default. We declare a differential
gene expression significant if the Benjamini-Hochberg (BH) adjusted p-value is at most 0.05. This
section is must if you like to perform any downstream functional analysis.

Gene-list analysis framework

The analyses options provided in this section allow you to process your differentially expressed
genes to functional analysis. These options are described in detail in the manuscript.
Genelist Analysis

Analysis name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

File upload

If you have already analysed your data and you have a list of genes then this framework allows you
to perform functional analysis of your genes of interest. File upload box allow you upload a gene list
(Entrez IDs) file. File must be in csv format without header. Please note gene ids must be Entrez.

Analysis options

The analyses options provided in this section allow you to process uploaded Entrez gene ids to
functional analysis. These options are described in detail in the manuscript.

Automated Workflows

RMaNI – Regulatory Module Network Inference

RMaNI is the novel analytical workflow developed for cancer subtype specific transcriptional module
network inference and analysis. It uses the Learning Module Networks (LeMoNe) algorithm for
model based co-clustering of the expression data and Regulatory Impact Factors (RIF) to identify
potential regulators of the inferred modules. We provide a very simple web interface for complex
analysis workflow. Please see workflows section for detailed workflow in RMaNI.

Analysis name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

Data import

File upload box in the data import option will allow you to import the normalised microarray gene
expression data from your local computer. This can be done by uploading a “.csv” file containing
processed data. File must be in “.csv” format with column headers. Data format is, first column must
be “probe” with probe ids as identifiers and remaining columns must be samples. Example
processed dataset is provided in the sample datasets section.

Snapshot of the processed data –

probe,GSM359972,GSM359973, GSM359984, GSM360039
117_at,4.9038,6.9663,4.3324,6.8817
1255_g_at,3.6934,4.882,7.1288,6.9821
1438_at,3.2685,3.9953,6.6480,3.4728

Upload sample annotation

“Upload sample annotation file” allow you to upload sample annotations. File must be in “.csv”
format with column headers. First column name must be “samples” containing sample names
exactly matching the samples names in the data file and second column name must be “condition”
(like normal/cancer etc). This is very important for all the downstream analyses because INsPeCT
will require information about which samples belongs to e.g. “normal” condition and which samples
belongs to different subtypes/conditions e.g. “HCC”, “cirrhosisHCC” or “cirrhosis”. Example sample
annotation file is provided in the sample datasets section. This field is must.

Snapshot of the sample annotation file for RMaNI-

samples,condition
GSM358209,Normal
GSM358210,Normal
GSM358114,HCC
GSM358115,HCC
GSM358113,cirrhosisHCC
GSM358116,cirrhosisHCC
GSM358119,cirrhosis
GSM358123,cirrhosis

Enter control condition from sample annotation file

For this analysis you must enter the name of the control condition from your sample annotation file
eg. “Normal”, “untreated”, “wildtype” in the control condition text box. This is must for comparison
of different subtypes/conditions with control condition. If this information is not provided then the
analysis will fail. For the sample dataset provided in this tutorial, control condition is “Normal”.

Select chip-type

“Select chip type” allow you to select chip-type of your dataset. This is very important as it will affect
the annotation of the probes to other gene identifiers and downstream analyses. Please confirm the
chip-type of your data and select appropriate option from the list of 13 Affymetrix chip-types.
Providing wrong chip-type will fail your analysis or result in incorrect results.

Select number of genes

Here you can select the number of genes for input to RMaNI. RMaNI will automatically select these
many top (see RMaNI workflow for detailed procedure) genes from the input dataset. Your
processed data must contain all the probes in the dataset eg. ~54000 probes in dataset with
hgu133plus2 chip-type. Default is 1000 genes.

WGCNA – Weighted Gene Co-expression Network Analysis

This workflow is based on a general framework for WGCNA. It finds modules of highly correlated
genes across microarray samples, associated with the external sample traits eg. overall survival,
relapse-free survival, metastasis-free survival etc. You do not need to have the control condition for
this analysis. This analysis can be applied to microarray dataset if survival information available.
Analysis Name

This field allows you to enter a short name for your analysis. It is important to enter the name for
your analysis because the generated result files will include this name in the file names for your
convenience. For instance, if you enter “ovary” in this field generated files will look like
“results_ovary.csv”. This field is must.

Data import

File upload box in the data import option will allow you to import the normalised microarray gene
expression data from your local computer. This can be done by uploading a “.csv” file containing
processed data. File must be in “.csv” format with column headers. Data format is, first column must
be “probe” with probe ids as identifiers and remaining columns must be samples. Example
processed dataset is provided in the sample datasets section.

Snapshot of the processed data –

probe,GSM359972,GSM359973,GSM359984,GSM360039
117_at,4.9038,6.9663,4.3324,6.8817
1255_g_at,3.6934,4.882,7.1288,6.9821
1438_at,3.2685,3.9953,6.6480,3.4728

Upload sample annotation

“Upload sample annotation file” allow you to upload sample annotations. File must be in “.csv”
format with column headers. First column name must be “samples” containing sample names
exactly matching the samples names in the data file and remaining two columns must contain
information about survival time and event, respectively. This is very important for all the
downstream analyses because INsPeCT will require information about events eg. event – death
provides information about which patients are died of a disease (1) and which patients are alive (0).
Example sample annotation file is provided in the sample datasets section. Survival time must be in
months and only a number. Event must be boolean [0/1], 0 represents no event e.g. alive/no
relapse/no metastatsis and 1 represents event e.g. dead/relapse/metastasis. This field is must.

Snapshot of the sample annotation file for WGCNA-

samples,survivaltime,death
GSM368662,35,0
GSM368664,14,0
GSM368665,46,0
GSM368675,30,1
GSM368679,31,1
GSM368683,12,1

Select background annotation type for your data

“Select chip type” allow you to select chip-type of your dataset. This is very important as it will affect
the annotation of the probes to other gene identifiers and downstream analyses. Please confirm the
chip-type of your data and select appropriate option from the list of 13 Affymetrix chip-types.
Providing wrong chip-type will fail your analysis or result in incorrect results.

Name of the column in sample annotation file which represents survival time information
Please confirm and enter the name of the column in the sample annotation file which represents
survival time information eg. survivaltime (refer sample annotation file ) in this text box. If this name
and column name in the sample annotation file is not matching, analysis will fail.

Name of the column in sample annotation file which represents the event information

Please confirm and enter the name of the column in the sample annotation file which represents
event information eg. death (refer sample annotation file ) in this text box. If this name and column
name in the sample annotation file is not matching, analysis will fail.

INsPeCT

Links