NRSA (Nascent RNA Sequencing Analysis) is a bioinformatics tool dedicated to analyze nascent transcription profiles generated by PRO-seq and GRO-seq data. NRSA not only quantifies nascent transcription for known genes, but also detects, annotates, and quantifies active enhancers, and predict their targets based on the closest TSS (transcriptional start site), within a distance (default: 50kb) strategy and enhancer-TSS associations from FANTOM5 and 4DGenome if available. In addition, NRSA smoothly integrates other genomic data to prioritize enhancers.
Known genes: Quantify transcriptional pausing and detect condition-dependent transcriptional changes in proximal-promoter and gene body regions.
Active enhancers: Identify, prioritize and annotate active enhancers, predict enhancer targets, and quantify condition-dependent changes of eRNAs.
Software: NRSA.zip
Hdac3 wild type:
YZ-1.bed
YZ-2.bed
Hdac3 knock-out:
YZ-3.bed
YZ-4.bed
K562 PRO-seq: K562-PROseq.bam
K562 GRO-seq: K562-GROseq.bam
U2OS GRO-seq: GSM1634453.bam; GSM1634455.bam
GM12878 GRO-seq: GM12878-GROseq.bam
1. Requirements: Perl, R, HOMER and bedtools
Install Perl, R, HOMER and bedtools (v2.22.1), and add /bin directory to your executable path.
2. Install NRSA
unzip NRSA.zip #Unzip the file
cd NRSA/ #Change directories into the folder
chmod 755 bin/*.pl #Change the mode of executable files
chmod 755 bin/*.modules #Change the mode of executable files
#Add NRSA scripts to Shell searching path ($PATH). This step is optional.
#If your NRSA is installed at /home/usrname/NRSA
export PATH=/home/usrname/NRSA/bin/:$PATH
3. Install required perl packages
#Check whether all the packages needed are installed by running test.modules,
./bin/test.modules
#the output of test.modules looks like,
ok Cwd
ok File::Basename
ok File::Path
ok Getopt::Long
fail Statistics::Basic
fail Statistics::R
#If “fail” shows up before the package name, that means users don’t have this required package. In this case, Statistics::Basic and Statistics::R need to be installed.
Users can install missing packages by running "./bin/install.modules packagename".
#e.g
./bin/install.modules Statistics::Basic
#Add the NRSA lib to your PERL5LIB environment variable,
#If your NRSA is installed at /home/usrname/NRSA
export PERL5LIB=$PERL5LIB:/home/usrname/NRSA/lib/
4. Install required R packages
gplots and DESeq2 (v1.2.10)
#In your R shell, type the following commands:
#gplots
install.packages("gplots")
#DESeq2
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")
1. Input
NRSA takes alignment files in bed or bam format as input. If each condition has more than one samples, they should be given as a space-separated list. It is not required to have the same number of samples for each condition. Spike-in sequences should be removed before starting the analysis.
2. Output
NRSA generates various kinds of results, including nascent RNA abundance in proximal-promoter and gene body regions, pausing index and pausing significance of known genes, pausing index changes across conditions, as well as identified active enhancers and long eRNAs and their quantification. Additionally, NRSA generates heatmap and boxplot figures providing a global view of the data. Users can also use tools provided by NRSA to customize the figures they want to generate.
To identify function important enhancers, NRSA introduced a new algorithm to prioritize enhancers based on the assumption that enhancers bound by the regulator of interest and affecting the transcription of target genes are highly likely to mediate the regulator's function. Integrating GRO/PRO-seq data with ChIP-seq peaks of the regulator, NRSA combines the binding and the functional evidence to rank each enhancer:
Where Es is the combined score for the enhancer, Bs is the upstream binding effect score and Fs is the downstream functional activity score, w is a weight to balance the relative impact of binding and functional evidence. These three scores for each enhancer are included in the output file Enhancer.txt.
There are two conditions, Hdac3 wild type and Hdac3 knock-out in mouse liver. Each condition has two replicates. Please download them from the Download section and save them in the data/ folder of NRSA.
Hdac3 wild type:
YZ-1.bedHdac3 knock-out:
YZ-3.bedNote: please remove spike-ins before running NRSA if you have.
perl ./bin/pause_PROseq.pl -o ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10
BED/BAM files for each condition are separated by space. BED/BAM files for the first condition are specified by -in1, while those for the second condition are specified by -in2, and the output directory is specified by -o (Note that this directory will be used as the work directory in step 2). The organism is defined by -m. hg19, mm10, dm3 and ce10 are available, default is hg19. Make sure the genome version is consistent with the one reads were aligned to. The differential analysis is performed on condition 2 vs. 1.
If you have added NRSA scripts to Shell searching path ($PATH), please use the command "pause_PROseq.pl -o ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10" instead.
The main results are put into the folder known-gene/ and intermediate results are into the folder intermediate/.
Main results (known-gene/):
Two files: quantification of transcription for each gene in proximal-promoter and gene body regionspindex.txt | pausing index and significance for each gene in all samples |
normalized_pp_gb.txt | normalized reads count in promoter-proximal and gene body regions for each gene in all samples |
pp_change.txt | transcription changes of promoter-proximal regions across two conditions |
gb_change.txt | transcription changes of gene body regions across two conditions |
pindex_change.txt | pausing index changes across two conditions |
boxplot_ppdensity.pdf | box plot of normalized read density of promoter-proximal regions for each sample |
boxplot_gbdensity.pdf | box plot of normalized read density of gene body regions for each sample |
boxplot_pausingIndex.pdf | box plot of pausing index for each sample |
pindex_change.pdf | heatmap of pausing index change across two conditions for genes with adjp<0.05 |
heatmap.pdf | heatmap of condition-dependent transcription changes around TSS for active genes |
Reps-condition1.tif | histogram for transcriptional changes between replicates in condition 1 |
Reps-condition2.tif | histogram for transcriptional changes between replicates in condition 2 |
Intermediate results that users might be interested in (intermediate/):
(samplename)_raw.txt | raw counts, density, summit and pausing index for each sample |
(samplename)_FDR.txt | same to (samplename)_raw.txt, but adding significant information of pausing index |
count_pp_gb.txt | raw counts for all samples |
active_gene.txt | active gene list |
nf.txt | normalization factors of each sample |
perl ./bin/eRNA.pl -w ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10 -pri 1 -dir 0 -peak ../Hdac3_ZT10_peaks.narrowPeak -lk pp -cf 0.001
BED/BAM files for each condition are separated by space. Data for the first condition are specified by -in1, while those for the second condition are specified by -in2 (the input files should be the same with step 1), and the organism is defined by -m. Here, -w is used to provide the work directory of step 2 (it should be the same as the output directory of step 1. Even if users only want to run enhancer analysis, the first step should be run first to generate information required for this step).
If you have added NRSA scripts to Shell searching path ($PATH), please use the command "eRNA.pl -w ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10 -pri 1 -dir 0 -peak ../Hdac3_ZT10_peaks.narrowPeak -lk pp -cf 0.001" instead.
The outputs of step 2 are put into two folders: eRNA/ and intermediate/. Here the intermediate/ folder is the same one of step 1.
Main results (eRNA/):
Three files: identification/quantification for active enhancersEnhancer.txt | list of identified enhancers with annotation, predicted target genes from different strategies, and rank scores |
Enhancer_center.txt | list of enhancer centers |
normalized_count_enhancer.txt | normalized counts for each enhancer |
Enhancer_change.txt | transcription change of enhancers |
signal_around_ehancer-center.pdf | PRO-seq signal around enhancer center for all samples |
long_eRNA.txt | indentified long eRNAs (default: length > 10 Kb) |
longeRNA-pindex.txt | pausing information of long eRNAs for all samples |
longeRNA-normalized_pp_gb.txt | normalized reads count in promoter-proximal and gene body regions of long eRNAs |
longeRNA-pp_change.txt | transcription changes of promoter-proximal regions of long eRNAs across two conditions |
longeRNA-gb_change.txt | transcription changes of gene body regions of long eRNAs across two conditions |
longeRNA-pindex_change.txt | pausing index changes of long eRNAs across two conditions |
Intermediate results (intermediate/):
(samplename)_longeRNA_raw.txt | raw counts, density, summit and pausing index of long eRNAs for each sample, one file per sample |
(samplename)_longeRNA_FDR.txt | same to (samplename)_raw.txt, but adding significant information of pausing index |
count_enhancer.txt | raw counts of enhancers for all samples |
longeRNA-count_pp_gb.txt | raw counts or long eRNAs for all samples |
PROseq_signal.txt | raw signal around enhancer center for plot (not normalized) |
If you added NRSA scripts to your Shell searching path ($PATH), please omit "perl" for the following commands. e.g. for pause_PROseq.pl, please use "pause_PROseq.pl options...." instead of "perl pause_PROseq.pl options....".
Usage: perl pause_PROseq.pl [options] -in1 bed/bam files -in2 bed/bam files
e.g: perl pause_PROseq.pl -o pause_out/ -in1 control1.bed control2.bed -in2 case1.bed case2.bed
-m [string] | defines the genome: hg19, mm10, dm3 or ce10, default: hg19 (Make sure the genome version is consistent with the one the reads were aligned to.) |
-in1 [bed/bam] | required, read alignment files in bed (6 columns) or bam format for condition1, each file is separated by space |
-in2 [bed/bam] | read alignment files in bed (6 columns) or bam format for condition2, each file is separated by space (It is NOT required to have the same number of samples for each condition. The differential analysis is performed on condition 2 vs. 1.) |
-f1 [string] | normalization factors for samples of condition1, separated by space. If not specified, we will use DESeq2 default method to normalize |
-f2 [string] | normalization factors for samples of condition2, same to -f1 |
-o [string] | required, output/work directory |
-u [int] | defines the upstream of TSS as promoter (bp, default: 500) |
-d [int] | defines the downstream of TSS as promoter (bp, default: 500) |
-b [int] | defines the start of gene body density calculation (bp, default: 1000) |
-l [int] | defines the minimum length of a gene to perform the analysis (bp, default: 1000). Genes with length less than this will be ignored. |
-w [int] | defines the window size (bp, default: 50) |
-s [int] | defines the step size (bp, default: 5) |
-h | help message |
Column ID | Description |
Transcript | Transcript ID |
Gene | Gene symbol |
ppc | reads count in promoter-proximal region |
ppm | mappable sites in promoter-proximal region |
ppd | reads density of promoter-proximal region |
pps | the site position with highest reads count in promoter-proximal region (pausing site) |
gbc | reads count in gene body region |
gbm | mappable sites in gene body region |
gbd | reads density of gene body region |
pauseIndex | pausing index |
Usage: perl eRNA.pl [options] -in1 bed/bam files -in2 bed/bam files
e.g: perl eRNA.pl -w pause_out/ -in1 control1.bed control2.bed -in2 case1.bed case2.bed
-m [string] | defines the genome: hg19, mm10, dm3 or ce10, default: hg19 (Make sure the genome version is consistent with the one reads were aligned to.) |
-c [int] | distance cutoff for divergent transcripts for enhancer detection (bp, default: 400) |
-d [int] | distance within which two eRNAs are merged (bp, default: 500) |
-w [string] | required, working directory of eRNA.pl, should be the same of pause_PROseq.pl’s output directory |
-le [int] | length cutoff for long-eRNA identification (bp, default: 10000) |
-in1 [bed/bam] | required, read alignment files in bed (6 columns) or bam format for condition1, each file is separated by space |
-in2 [bed/bam] | read alignment files in bed (6 columns) or bam format for condition2, each file is separated by space (same files should be used for pause_PROseq.pl and eRNA.pl) |
-f [int] | whether to filter for enhancers: 0 (not filter) or 1 (filter), deflault: 1 |
-dtss [int] | if filter enhancers, the minimum distance from enhancer to TSS (bp, default: 2000) |
-dtts [int] | if filter enhancers, the minimum distance from enhancer to TTS (bp, default: 20000) |
-wd [int] | distance within which associate active genes of enhancer (bp, default: 50000) |
-pri [int] | prioritize enhancer or not. 1: yes; 0: no (default: 0) |
-peak [string] | required if -pri set to 1. ChIP-seq peak file for the regulator of interest in bed format |
-lk [string] | valid when -pri set to 1 for two conditions. Transcriptional changes to be used for enhancer prioritization, the value can be pp (changes in promoter-proximal region), gb (changes in gene body region), or pindex (changes in pausing index) (default: gb) |
-dir [int] | valid when -pri set to 1 for two conditions. The expected simultaneous change of expression between enhancers and target genes. 1: the expression changes of enhancers and target genes are in the same direction; -1: the expression changes of enhancers and target genes are in the opposite direction; 0: the expression changes of enhancers and target genes are either in the same or in the opposite direction (default: 0) |
-wt [real] | valid when -pri set to 1 for two conditions. Weight to balance the impact of binding and function evidence. The higher the weight, the bigger impact does the binding evidence have (default: 0.5, i.e., binding and functional evidence have equal impact on enhancer prioritization) |
-cf [real] | valid when -pri set to 1 for two conditions. Cutoff to select significant transcriptional changes (default: FDR < 0.05). Use Foldchange instead if no FDR is generated (default: Foldchange > 1.5) |
-h | help message |
Column ID | Description |
chr | chromosome location of enhancers |
start | the start position of enhancers |
end | the end position of enhancers |
center | the location of enhancer centers |
FANTOM5 | whether enhancer centers exist in FANTOM5 |
associated_gene-FANTOM5 | associated genes in FANTOM5 |
associated_gene-50kb | associated genes within 50 Kb distance |
associated_gene-4DGenome | associated genes in 4DGenome |
Cell/Tissue | Cell/Tissue associated genes in 4DGenome detected in |
detection_method | detection method associated genes in 4DGenome |
PMID | PMID corresponding associated genes in 4DGenome |
closest_gene | the closest gene to each enhancer |
distance | the distance from the closest gene to the enhancer |
Fscore | enhancer prioritizing score for downstream functional activity |
Bscore | enhancer prioritizing score for upstream binding effect |
Score | combined enhancer prioritizing score |
It generates three boxplots: reads density in promoter-proximal regions, reads density in gene body regions, and pausing index for each sample.
Usage: perl boxplot.pl options
e.g: perl boxplot.pl -i genes_for_boxplot.txt -w pausing_result/ -n boxplot -rep1 2 -rep2 2
-i [string] | required, file with a list of genes (RefSeq id like, NM_001005277) user wants to plot |
-w [string] | required, work directory, should be the same of pause_PROseq.pl’s output/work directory |
-n [string] | required, prefix name of output figures |
-rep1 [int] | required, the number of replicates in the first condition |
-rep2 [int] | the number of replicates in the second condition |
-h | help message |
It generates a heatmap for transcription changes in condition2 vs condition1 around TSS for genes of interest.
Usage: perl heatmap.pl options
e.g: perl heatmap.pl -i genes_for_heatmap.txt -w pausing_result/ -n heatmap -in1 control1.bed control2.bed -in2 case1.bed case2.bed
-i [string] | required, file with a list of genes (RefSeq id like, NM_001005277) user wants to plot |
-m [string] | define the genome: hg19, mm10, dm3, or ce10, default: hg19 |
-in1 [bed/bam] | required, read alignment files in bed/bam format for condition1, each file is separated by space |
-in2 [bed/bam] | read alignment files in bed/bam format for condition2, each file is separated by space |
-r [int] | upstream and downstream distance relative to TSS for plotting PRO-seq signal (bp, default: 5000), should can be divided by the bin size |
-b [int] | bin size (bp, default: 200), should can be divided by 2 |
-w [string] | required, work directory, should be the same of pause_PROseq.pl’s output/work directory |
-n [string] | required, name of output figure |
-h | help message |
It generates a figure for pausing index changes for genes of interest in condition2 vs condition1.
Usage: perl pindex_change-plot.pl options
e.g: perl pindex_change-plot.pl -w pausing_result/ -n pindex-change -cutajp 0.001 -cutfc 2
-w [string] | required, work directory, should be the same of pause_PROseq.pl’s output/work directory |
-n [string] | required, name of output figure |
-cutajp [value] | cutoff of adjusted pvalue for seleting genes, default "0.001" |
-cutfc [value] | cutoff of absolute value of log2 fold change for selecting genes, default "0" |
-h | help message |
This tool smooths the integration of PRO/GRO-seq results with other genomic data. It generates a histogram showing the PRO/GRO-seq signals or other sequencing data (e.g., binding or histone modification signals from ChIP-seq data) around identified enhancers or regions of interest.
Usage: perl signal.pl options
perl signal.pl -i enhancer_center.txt -w pause_out/ -n signal -in1 control1.bed control2.bed -in2 case1.bed case2.bed
-i [string] | required, enhancer center position file (Enhancer_center.txt-- the output of eRNA.pl), or the file of chromation location of interest (should contains at least two columns separated by tab or space, e.g "chr11 83078317") |
-d [int] | up and down relative distance from the center for plotting PRO-seq or other signal (bp, default: 2000), should can be divided by bin size |
-b [int] | bin size for smoothing (bp, default: 20), must be even number |
-in1 [bed/bam] | required, read alignment files in bed/bam format for condition1, each file is separated by space |
-in2 [bed/bam] | read alignment files in bed/bam format for condition2, each file is separated by space |
-w [string] | required, work directory, should be the same of pause_PROseq.pl’s output directory |
-n [string] | required, name of output figure |
-s [char] | option to generate PRO/GRO-seq signal (p), or signal for other data such as histone modification (o) (default: p) e.g. generates PRO-seq signal around enhancer centers perl signal.pl -i enhancer_center.txt -s p -w pause_out/ -n PROseq-signal -in1 YZ-1.bed YZ-2.bed -in2 YZ-3.bed YZ-4.bed e.g. generates H3K4me1 signal around enhancer centers perl signal.pl -i enhancer_center.txt -s o -w pause_out/ -n H3K4me1-signal -in1 H3K4me1-rep1.bam H3K4me1-rep2.bam |
-h | help message |
Nascent RNA sequencing analysis (NRSA) provides insights into the transcriptional regulation of eRNAs. Under revision.