RCADE: Recognition Code-Assisted Discovery of regulatory Elements

 

Motif discovery from ChIP-seq data is often limited by presence of non-targeted transcription factor motifs, as well as similarity of peak sequences due to common ancestry rather than common binding factors. The latter aspect particularly affects a large number of proteins from the Cys2His2 zinc finger (C2H2-ZF) class of transcription factors, as their binding sites are often dominated by endogenous repeat elements (EREs) that have highly similar sequences. To overcome these limits, RCADE combines predictions from a DNA recognition code of C2H2-ZFs with ChIP-seq data to identify models that represent the genuine DNA binding prefer-ences of C2H2-ZF proteins.

 

 

RCADE webserver

 

For motif discovery using RCADE, you will need the sequence of the C2H2-ZF protein that was targeted in the ChIP-seq experiment, as well as the sequences of the peaks identified from ChIP-seq. We suggest that you use the top 500 peaks (based on peak p-values), and center the sequences on peak summits. You can either copy/paste the sequences in FASTA format in the provided text boxes, or upload the FASTA files.

 

Once the sequences/files are selected, click on the ÒSubmitÓ button. This will take you to the results page, which provides you with the links for downloading the RCADE output files, and also shows an in-depth log of activities, which is mostly helpful for debugging purposes. In case you encounter an error running RCADE, please include this log text along with your input files in an email to hamed.najafabadi@utoronto.ca.

 

Here are the output files that you can download on the results page:

-       Optimization summary (PS): A postscript file that visualizes a summary of the optimization results. RCADE identifies several motifs from the ChIP-seq data, which are sorted in this file based on their AUROC values for distinguishing ChIP-seq peaks from dinucleotide-shuffled sequences. For each motif, the corresponding zinc fingers are shown on the top (for example, CTCF:3-7 means that zinc fingers 3-7 of the CTCF protein are used for predicting the initial seed motif that is then optimized). The seed motif that is directly predicted from protein sequence is then shown, followed by the motif that is optimized based on ChIP-seq data. The AUROC value for each motif, the associated p-value, as well as the Pearson similarity of the seed and optimized motifs are also shown.

-       Optimization summary for the top motif (PS): Same as the above output, except that it only includes the top-scoring optimized motif.

-       Top optimized motif (CisBP format): A text file containing the PFM of the top-scoring optimized motif, in a format similar to what is used in the CisBP database (http://cisbp.ccbr.utoronto.ca/).

-       Top optimized motif (MEME format): A text file containing the PFM of the top-scoring optimized motif, in a format suitable for the MEME suite (http://meme.nbcr.net/meme/).

-       All seeds and optimized motifs (CisBP format): A text file containing all seed motifs and their optimized versions (the optimized motif names end with the phrase ÒoptÓ). The motifs are in CisBP format.

-       Log: The log file showing all the messages produced by RCADE.

 

 

 

RCADE source code

 

The source code for RCADE can be downloaded here.

 

Requirements

Unix-compatible OS

R version 3.0.1 or later (http://www.r-project.org/)

R ÒrandomForestÓ library (http://cran.r-project.org/web/packages/randomForest/index.html)

GNU-compatible MAKE software (https://gcc.gnu.org/)

MEME Suite (http://meme.nbcr.net/meme/downloads.html)

 

Installation

Step 1. To install the program, extract the package, and run the "make" command.

Step 2. Change the value of line 7 of the ÒRCOpt.shÓ script to where the executable MEME files are located on your computer.

 

To test the pipeline, execute this command:

bash RCOpt.sh MyTestJob examples/CTCF/CTCF.fasta examples/CTCF/GSM1407629.500bp.fasta

This should create a Ò./out/MyTestJobÓ folder, with the RCADE output files described above.

 

Usage

Use the RCOpt.sh script to run RCADE on your dataset:

bash RCOpt.sh <job_name> <C2H2_fasta> <ChIP_fasta>