Computation Protocols

Data Management

To manage the genomics and proteomics data for the innate immunity contract a content repository is used. All information generated from these cores is fed into the content repository so that it can be managed effectively. Content management systems are used in a wide variety of fields and provide a convenient and standardized system for storing data in a reliable and robust manner.

The content repository we are using is built to the JCR (Java Content Repository) standard. A JCR implementation offers a range of services, which are categorized by levels: level one provides read functionality (e.g. basic querying and structured content); level two provides simple write functionality (security and history); and level three which provides advanced features (transactional control and structured searching). A JCR implementation allows for the definition of an organizational structure (as interlinked nodes of different types) to which properties (annotations and data items) can be added. Standardized services are provided for versioning and transactional behavior, as well as more advanced services for both unstructured (e.g. lucene based information retrieval) and structured querying (e.g. SQL or xpath based searching).

Content management systems are designed to handle generic data items. They are suitable for dealing with raw and processed experiment data files, which can also be versioned and annotated. With these systems a series of higher level organizational data structures can be overlaid to make navigation (and searching) easier.

Through the JCR we are able to dynamically organize the data into hierarchies, and directly associate metadata with this information. The metadata can be searched to retrieve scientific results, and provide information about the data provenance and experiment specific conditions.

Any information that is added to the content repository can instantly be made available through the web portal. The JCR implementation used for the portal is Jackrabbit, and is open-source.

Data Processing and Pipelines

The purpose of the data processing and data pipelines proposed under this contract is to aggregate all the various all relevant data generated by the individual Core facilities. The goal of the data processing is to provide the means to reliably and accurately store the data within the data repository such that it can be shared within the consortium and made easily available through the web portal.

A major focus of year one has been to implement data processing pipelines related to the high-throughput array data generated by the Genomics Core at ISB. Off-the-shelf sofware (Gene Pattern, MIT/Broad Institute) has been customized and interfaced to the content repository (JCR) adopted in the contract. The principle example of this is a flexible data processing pipeline for the analysis of Affymetrix 3' Expression arrays, where users can select their choice of arrays and the type of normalizations performed. Similar pipelines have been instigated for ChIP-on-chip (ISB Customized Affymetrix array and Affymetrix Mouse Promoter Gene Chip 1.0R array) and Affymetrix Mouse Exon arrays.

Affymetrix 3' Expression Array Data Processing Pipeline

Hybridization intensities from Affymetrix Mouse GeneChip 430 2.0 3' expression microarrays are processed and combined through a modular software pipeline to produce gene level-summarized, replicate-combined, annotated expression intensities. The use of an automated software pipeline helps ensure that data are processed in a manner that is consistent and reproducible, and that the specific parameters and combination of input data files for each processing task are recorded for future reference. An overview of the steps in the pipeline is shown in Fig. 1 below. The data processing pipeline produces a spreadsheet of replicate-combined, gene level-summarized intensity data along with annotation information for each gene. A sample screen shot of a portion of this spreadsheet illustrating the data format and gene annotation, is shown in Fig. 2 below.

The pipeline functions in three steps. The first step (leftmost yellow rectangle in Fig. 1) includes background adjustment, probe level summarization, and probeset summarization. These are carried out using the Robust Multichip Average (RMA) procedure, as described in Irizarry et al., Biostatistics 4(2), 249-64 (2003) and in Irizarry et al., Bioinform 22(7), 789-94 (2006). In the second step (middle yellow rectangle in Fig. 1), probeset-level intensities for hybridizations representing biological replicates for a particular combination of strain, cell type, stimulus, and time point are combined using the arithmetic mean (the standard deviation is also recorded for assessment of variability). In the third step (rightmost yellow rectangle in Fig. 1), a representative probeset is selected for each unique gene listed in the Affymetrix GeneChip annotation dataset. All probesets ending in "_x_at" or "_s_at", and all probesets representing Affymetrix on-chip controls, are eliminated from consideration (the first two suffixes indicating probesets that are annotated as having probes that cross-hybridize to transcripts from different genes). Furthermore, all probesets that do not possess gene annotations, are eliminated. Following these probeset filtering steps, when there are multiple remaining probesets that map to the same NCBI Entrez gene identifier, the probeset with the highest absolute intensity in any one replicate-combined sample, is selected as representative (this choice is based on the observation that, on average, a higher intensity probeset tends to have a lower coefficient of variation, for linear scale intensity). Each representative probeset is then annotated using gene annotation information from Affymetrix and from the Jackson Laboratories' Mouse Genome Informatics database. The resulting list of representative probesets is then ordered by gene symbol, and a datasheet containing the replicate-combined intensity for each probeset across each condition (combination of strain, cell type, stimulus, and time point) is generated. This datasheet is archived in a repository along with the list of hybridizations that were processed to obtain the datasheet. The data processing procedure is described in more detail in the Materials and Methods section of the article by Ramsey et al., PLoS Comp Biol, 4(3), e1000021 (2008).

Fig. 1 - Affymetrix 3' expression array data processing pipeline overview

Input data (or data files) are shown in olive. Processing steps are shown in yellow rectangles. Intermediate data files are shown in blue. The output data file is shown in pink. The light blue background coloring shows the portion of the pipeline that is executed within the GenePattern system. Arrows indicate data flow through the pipeline. Arrows into a yellow rectangle indicate input data for that processing step; arrows out of a yellow rectangle indicate output data for that step. Additional software modules that obtain the sample metadata and gene annotation information, are not shown (to simplify the diagram). The ".CEL" and ".CDF" file formats are described at the Affymetrix website. The ".xls" file format indicates a Microsoft Excel spreadsheet. The acronym "RMA" refers to the Robust Multichip Average data processing procedure. The RMA processing step is performed using the BioConductor software.

Fig. 2 - Screen shot of the replicate-combined expression data spreadsheet

Each row represents a gene, for which a particular representative Affymetrix probeset was chosen. Column A contains the NCBI gene symbol of the gene. Column B contains the NCBI Entrez ID of the gene. Column C indicates the Affymetrix probeset selected as representative for this gene. Column D contains the gene name (title). Columns E-G contain the Gene Ontology annotation information for the gene. Columns H-N show log2 gene expression intensities across various time points in an experiment involving LPS stimulation of wild-type BL/6 bone marrow-derived macrophages.

Computational predictions of transcription factor binding sites was performed for regions flanking the transcription start sites of approximately 28,000 mouse genes, based on the July 2007 release of the mouse genome assembly. The regions scanned are a merger of 3 different sources: MGI gene coordinates, genomic regions tiled on the ISB customized array and promoter regions tiled on the Affymetrix Mouse Promoter Gene Chip 1.0R array. In total this represents ~12% of the mouse genome. Scanning was performed with the software program MotifLocator, using a collection of transcription factor motif position weight matrices from TRANSFAC Professional and JASPAR_PHYLOFACTS. Repetitive elements were identified in sequence regions using RepeatMasker prior to scanning. A p-value score was assigned for each motif match, enabling straightforward post-filtering of the motif match results based on user-specified stringency criteria.

Mouse Genome Assembly Chronology and Tagging

Mouse assemblies will be used referenced using the mm* notation to describe the specific assembly. In full:

Selection of Genomic Regions to Scan

Three sets of gene/promoter regions were selected and merged into one large set. The sources are genomic coordinates are:

1. Longest VEGA gene if available.
2. Longest NCBI or EnsEMBL coordinates. If they are the same length,
the NCBI coordinates are chosen.

(Source: personal communication with MGI website maintainers.)

Preparation of Genomic Regions to Scan

The start and end coordinates of the genes were simply parsed from the file downloaded from the MGI website.

Both of these arrays are based on Affymetrix's 25mer oligo probe technology. The probes laid on both arrays were specified based on the mm5 mouse genome assembly. Hence a three stage process was derived to stringently re-map probes from assembly mm5 to mm9.

1. Removal of Redundant Probes by Exhaustive Search on Assembly mm9

An exhaustive search was made using all 25mer oligo probes on the mouse assembly mm9, taking each probe sequence and performing a simple, regular expression search.

Probes were discarded if they mapped to multiple genomic locations or could not be mapped onto mm9.

2. Removal of Probes Over-lapping mm9 Repeat Features

Repeat features reported by RepeatMasker and the Tandem Repeat Finder, termed simple repeat features, were downloaded from the UCSC genome website on 16th December 2007. These features were mapped onto the mm9 assembly coordinates. (The simple repeat features are predominantly a sub-set of RepeatMasker repeats.)

Taking the unique probe locations identified in the previous step, the location of each probe was checked against the two sets of repeats and discarded if any of the following criteria were met:

3. Comparison of Mapped Probes with the UCSC liftOver Software Tool

The following step was only peformed on the ISB Customized tiling array as it was not possible to obtain a completely reliable set of probe coordinates for the Affymetrix Mouse Promoter Gene Chip 1.0R array.

The UCSC genome browser website provide a software tool for mapping genomic coordinates from one assembly to another, called liftOver. This is a useful tool to identify how DNA regions are relocated across assemblies, with possible relocations to different chromosomes. It also can identify chromosomal regions that may have been fully or partially deleted when transitioning from an old to new assembly.

It is used in the mapping procedure to complement the exhaustive search. While a 25mer oligo may map to a unique location on a newer assembly, this may not be the genomic region with which the probe was originally associated with. The probe may be an outlier in this case, located in isolation.

The liftOver tool was used to assess the probe remapping from mm5 to mm9 using the following criteria:

The analysis was performed on the set of all probes mapped to mm9, including the repeat and multiple genomic locations analyses. The aim here is to identify those probes that were still classified as unique on mm9 after all the prior analyses but that fail the liftOver criteria.

At the end of the mapping process the following unique probes were retained:

Merging and Creation of Genomic Regions

The three set of genomic regions were overlain with each other and merged to produce the largest, contiguous regions of DNA sequence. For example, for a gene that is completely tiled on the ISB Customized Affymetrix Array, one contiguous regions may contain three sources of coordinates; the Affymetrix Mouse GeneChip Promoter array, the ISB Customised array and the promoter region as defined by the MGI representative genes. In other cases, only MGI promoter regions are found.

The merging process created 30,366 genomic sequence regions for scanning containing 338,012,214 basepairs (approximately 12% of the mouse mm9 assembly golden path length).

The sequences were extracted from the EnsEMBL mouse core database mus_musculus_core_49_37b, which uses the mm9 assembly. Regions were retrieved from the database following RepeatMasking. A small percentage of retrieved regions were entirely repeat sequences and were thus discarded from scanning. This criteria reduced the number of selected genomic regions to 30,355. These regions cover 337,525,505 basepairs, of which 110,863,198 basepairs are repeats (~33%).

Three sets of position weight matrices were used in the scanning.

A set of 389 matrices were extracted from the entire 815 matrices that were tagged specifically as being related to mouse.

TRANSFAC also characterises a set of matrices that are termed non-redundant matrices in vertebrates. In essence for a matrix that has multiple copies, a selection process is applied such that a single, representative matrix is chosen. This set consists of 151 matrices. Of these, 63 matrices were not in the mouse-specific set of TRANSFAC matrices described above. Hence, the "missing" 63 added to the mouse-specific set to create a combined set of 452.

The JASPAR matrices used in the scans are from the open-source JASPAR database. The set specifically incorporated scanned with are the 174 matrices from the PHYLOFACTS set, which is derived from the paper:

# Xie et al., Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals, Nature 434, 338-345 (2005) and supplementary material

The following paragraph is the description of the matrices from the JASPAR help page:

In short, the authors used the following strategy. Promoters (defined as the 4-kb region around the TSS) of human genes from the RefSeq database were aligned against the genomes of mouse, rat and dog. Every consensus sequence of length between 6 and 26, defined over an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R, Y, K, M, S, W, N) nucleotides, was scanned over the alignments. A motif is regarded as conserved when it appears in the alignment both for the human and for the other three mammalian species. The conservation rate p is defined as the number of times a motif is conserved divided by the number of times it occurs in man only. This conservation rate is compared to the expected conservation rate p0, estimated from random motifs, which gives the motif conservation score MCS. Only motifs with an MCS>6 were retained, resulting in a list of 174 highly conserved motifs (see supplementary Table S2 of Xie et al.). The count matrices for these 174 motifs were extracted from the downloaded alignments. They were further annotated according to their resemblance with TRANSFAC and JASPAR CORE motifs. For TRANSFAC, the annotation of Xie et al. was used. For comparing to the JASPAR CORE matrices, the Pearson Correlation Coefficient (PCC) was used to define matrix similarity. All PHYLOFACTS matrices were scanned against the JASPAR CORE matrices, and matrices were regarded as being similar when PCC>0.8. When multiple hits were found, only the one with the highest PCC was retained.

Where it was determined by previous analyses and literature searches that specific transcription factors were not contained in the previous two sets, hand-curated PWMs were added. In the current implementation only a single matrix is incorporated for ISRE.

A combined total of 627 matrices were used in the scanning process.

Preprocessing of PWMs

During initial, exploratory analyses using MotifLocator, on the odd occasion a known binding site would be missed, even though there was a labelled matrix that should have hit. Examining these individual cases it appeared that the specific matrix did not report the hit as it was too stringent in a single position i.e., the matrix expected an A at 100%, whereas a C was observed.

In order to reduce this particular stringency, a low-level of noise was added to the matrices. Specifically, if the expectation of a nucleotide is 0 then 0.01 is added to that position and the other expectations are adjusted accordingly. For example,

Before: A C G T

     1.00  0.00  0.00  0.00

After: A C G T

     0.97  0.01  0.01  0.01

In essence this is akin to the methodology of adding a pseudocount to the nucleotide expectations.

To reduce the number of potential false positive predictions made by the matrix scanning process, a method was implemented to select only the most stringent predictions. The composition of matrices can vary greatly from those that are very tightly defined (i.e. a position that is always 100% an A) to those that are very loose (i.e. any of the four base nucleotides and so an N), so a single global threshold value is not always appropriate. Therefore a scanning procedure was designed that produced a distribution for each indiviudal matrix against random sequence and selected threshold values that gaves us the topmost, significant quantiles. It is described in more detail below.

To assess the statistical significance of MotifLocator scores, we evaluated scores on randomized sequence. We obtained the random distributions for each binding site matrix model individually, as the distributions and thresholds were sensitive to the choice of matrix (see below). The random sequences were constructed as follows. 200kBs of sequence was sampled from upstream regions of ~100 immune related genes. This sequence underwent a random shuffling and was then split into four equal seqments of 50kB each, to be accommodated by the scanning algorithm. Scanning was then performed, on both + and - strands. The randomization procedure was repeated twice for a total of 4x(2?50)x(1+2)=1200 random scores per matrix. An example distribution is given below. This distribution allows us to convert any MotifLocator score on true sequence to a p-value reflecting the probability that that score was obtained by chance. These distributions can also be used to set thresholds for filtering scanning results, for example corresponding to defined quantiles of the random distribution. Quantile values vary considerably between matrices. For example, the 0.1% topmost quantile ranges from 0.711 for TRANSFAC PAX1_B, to 1.087 for TRANSFAC GATA_01.

Scanning was peformed to report predictions with a frequency of one prediction every 1 kb. This is equivalent in reporting a prediction with a p-value of 0.001, which is equivalent to selecting a prediction in the 0.1% topmost quantile when compared to the random distributions. More stringent predictions can be selected by choosing smaller p-values e.g. for reporting predictions with a frequency of occurrence in 1 kb of once in 10 kb, is equivalent to a p-value of 0.0001 or less.

Web Portal Development

The Web Portal is built up of a series of modular components called "portlets". Portlets are modular web applications which can be embedded within a web application and can communicate with each other. Portlets allow people to design their own pages by choosing which portlets to show and directly editing their behavior. Each portlet typically represents a single data source, or performs a simple function.

As portlets can be added and arranged on a web page interactively, different people can layout the portlets to suit their individual needs. Additionally, as portlets within a page can communicate with each other, sophisticated applications can be built up using groups of portlets that send messages to each other.

We have developed portlets for: gene expression viewing; communication with genome annotation sites (e.g. NCBI and Ensembl); searching for genes; viewing binding information (ChIPChip); exploring binding site predications (TRANSFAC); and collecting/browsing of gene annotations.

The life cycle of the portlets, including their look-n-feel, are managed by a portlet container. The portlet container used for the portal is Liferay, and is open-source.