CentriMo outputs a tab-separated values (TSV) file ('centrimo.tsv') that contains one line for each
region found to be significantly enriched for a motif.
The lines are sorted in order of decreasing statistical significance.
The first line in the file contains the (tab-separated) names of the fields.
Your command line is given at the end of the file in a comment line starting with the
character '#'.
The names and meanings of each of the fields, which depend on whether or not you provide
control sequences to CentriMo, are described in the table below.
field
name
contents
1
db_index
The index of the motif file that contains the motif. Motif
files are numbered in the order the appeared in the command line.
2
motif_id
The name of the motif, which is unique in the motif database file.
If more than one motif has the same ID, CentriMo uses only the first such motif.
The name is single-quoted and preceded with '+' or '-' if you scanned separately with
the reverse complement motif (using the --sep option).
3
motif_alt_id
An alternate name for the motif that may be provided in the motif database file.
4
consensus
A consensus sequence computed from the motif (as described below).
5
E-value
The expected number motifs that would have
at least one region as enriched for best matches to the motif as the reported region
(or would have optimal average distance to the sequence center as low as observed,
if you used the --cd option).
The E-value is the adjusted p-value multiplied by the number of motifs in the
input files(s).
6
adj_p-value
The statistical significance of the enrichment of the motif, adjusted for multiple tests.
By default, a p-value is calculated by using the one-tailed binomial
test on the number of sequences with a match to the motif
that have their best match in the reported region;
if you provided control sequences, the p-value of Fisher's exact test on the enrichment of
best matches in the positive sequences relative to the negative sequences is computed instead;
if you used the --cd option, the p-value is the probability that the average
distance between the best site and the sequence center would be as low or lower than observed,
computed using the cumulative Bates distribution, optimized over different score thresholds.
In all cases, the reported p-value has been adjusted for the number of regions
and/or score thresholds tested.
7
log_adj_p-value
Log of adjusted p-value.
8
bin_location
Location of the center of the most enriched region, or
0 if you used the --cd option.
9
bin_width
The width (in sequence positions) of the most enriched region (default),
or two times the average distance between the center of the best site
and the sequence center if you used the option --cd.
A best match to the motif is counted as being in the region if
the center of the motif falls in the region.
10
total_width
The maximum number of regions possible for this motif
round(sequence_length - motif_length + 1)/2,
or the number of places the motif will fit if you used the --cd option.
11
sites_in_bin
The number of (positive) sequences whose best match to the motif falls in the reported region (default) or anywhere in the sequence (if you used the option --cd).
Note: This number may be less than the number of
(positive) sequences that have a best match in the region.
The reason for this is that a sequence may have many matches that score
equally best. If n matches have the best score in a sequence, 1/n is added to the
appropriate bin for each match.
12
total_sites
The number of sequences containing a match to the motif
above the score threshold.
13
p_success
The probability of a random match falling into the enriched region:
bin_width / total_width
14
p-value
The uncorrected p-value before it gets adjusted for the
number of multiple tests to give the adjusted p-value.
15
mult_tests
This is the number of multiple tests (n) done for this motif.
It was used to adjust the p-value of a region for
multiple tests using the formula:
p' = 1 - (1-p)n where p is the unadjusted p-value.
The number of multiple tests is the number of regions
considered times the number of score thresholds considered.
It depends on the motif length, sequence length, and the type of
optimizations being done (central enrichment, local enrichment, central distance or
score optimization).
The following additional columns are present when you provide control sequences to CentriMo
(using the --neg option).
16
neg_sites_in_bin
The number of negative sequences where the best
match to the motif falls in the reported region.
This value is rounded but the underlying value may contain fractional counts.
Note: This number may be less than the number of negative have a best match in the region.
The reason for this is that a sequence may have many matches that score equally best.
If n matches have the best score in a sequence, 1/n is added to the
appropriate bin for each match.
17
neg_sites
The number of negative sequences containing a match to the
motif above the minimum score threshold.
When score optimization is enabled the score threshold may be raised
higher than the minimum.
18
neg_adj_pvalue
The probability that any tested region in the negative
sequences would be as enriched for best matches to this motif
according to the Binomial test.
19
log_neg_adj_pvalue
Log of negative adjusted p-value.
20
fisher_adj_pvalue
Fisher adjusted p-value before it gets adjusted for the
number of motifs in the input files(s).
21
log_fisher_adj_pvalue
Log of Fisher adjusted p-value.
A consensus sequence is constructed from each column in a
motif's frequency matrix using the "50% rule"
as follows:
The letter frequencies in the column are sorted in decreasing order.
Letters with frequency less 50% of the maximum are discarded.
The letter used in this position in the consensus sequence is determined
by the first rule below that applies:
If there is only one letter left, or if the remaining letters exactly match
an ambiguous symbol in the alphabet, the letter or ambiguous symbol,
respectively, is used.
Otherwise, if the remaining set contains at least 50% of the core
symbols in the alphabet, the alphabet's wildcard
(e.g., "N" for DNA or RNA, and "X" for protein) is used.
Otherwise, the letter with the maximum frequency is used.
CentriMo outputs a text file ('site_counts.txt') that contains,
for each motif, pairs of values (bin_position, site_count),
or triples of values (bin_position, site_count, neg_site_count) if
you provided control sequences to CentriMo.
This data can be used to plot the density of motif best matches (sites)
along the input sequences. Fractional counts are possible if multiple (n)
bins contain the best match for a given sequence, with each bin
receiving an incremental count of 1/n.
The data for each motif begins with a header line with the format:
DB <db_number> MOTIF <id> <alt>
where <id> and <alt> are as described above.
The following lines (up to the next header line)
each contain a single value-pair or value-triple for the motif
named in the header line.
Each "motif probability curve" shows the (estimated) probability of the
best match to a given motif occurring at a given position in the
input sequences. This estimated probability is based only on sequences that
contain at least one match with scoreabove
than the score threshold defined for this motif,
and is the maximum likelihood estimate of the conditional probability shown below.
Points (X,Y) on the plot are:
Y = Pr(best match occurs at position X | sequence contains a match)
Note: The plots are smoothed according to the function
selected from the "Smoothing" menu on the right. Setting the smoothing
window size to 1 turns off smoothing.
If a negative dataset has been supplied then two curves are drawn for
each motif, one for each dataset. The distribution of the motif in the
primary dataset is plotted with a single unbroken curve, whereas the distribution in
the negative dataset is plotted with a dashed curve.
This shows a listing of all motifs currently plotted on the graph.
The color used to plot a motif can be changed by clicking on the
color swatch next to the motif you want to change, followed by clicking
on the color swatch you wish to swap it with.
Allows selection of the smoothing function applied to the graph.
The weighted moving average option uses weights shaped as an isosceles
triangle where the central point (or points in an even sized window)
get the maximum weight.
The moving average simply weights all points in the smoothing window
equally.
Note: Setting the smoothing window size to 1 turns off
smoothing.
Window
The window size used to smooth the graph. The larger the smoothing
window size, the smoother the graph, at the cost of hiding detail.
Below a smoothing window size of 10, thinner lines are used on the
graph to allow more detail to be visible.
Note: Remember to press "return" or "enter" after changing
the number in the input box in order to see the effect of the new
smoothing window size.
Legend
Choose to display/disable the on-graph legend. The legend can be
moved by clicking on the graph.
Negative Sequences
Choose whether to plot the motif probability curve(s) for the
negative sequences (if provided). The curve(s) are plotted as dashed
lines, using the same color as the corresponding curve for the positive
sequences.
Zoom
Drag a range on the graph to zoom into that section. Clicking
"Undo Zoom" will return the view to the previously displayed part of the
graph and clicking "Center on 0" will move the view so 0 is in the
center.
Download EPS
Download the graph that you are currently viewing as an
encapsulated postscript (EPS) image. EPS images are scalable making them
suitable for publication.
List only enriched motifs that meet the selected filter criteria below.
Selected motifs are always listed; deselect all motifs first by clicking on
the "X" above the color swatches if you wish to filter all motifs.
To filter on "ID" or "Name", you can enter any Javascript regular
expression pattern. See here
for documentation on Javascript regular expression patterns.
Sorting is applied after filtering where possible (the exception being
the "Top" filter) so the filters applied will affect the sort. You can
choose the motif sorting feature using the "Motifs:" menu.
If CentriMo is searching for locally enriched regions (not just centrally
enriched regions), then multiple regions may be found per motif, and
the "Regions:" menu will also be displayed. In this case,
CentriMo first sorts all regions using the feature
shown in the "Regions:" menu, and then it sorts the highest-ranked
region of each motif according to the feature shown in the "Motifs:" menu.
Unless you check the box next to the "Regions:" menu, it will automatically
show the same feature as the "Motifs:"
menu (or "E-value" if a motif-only feature is chosen in the "Motifs:" menu).
Note:The motif p-value shown in the plot legend will always be for
the region with the lowest p-value, and therefore may not match the value
shown in the table "p-value" column
when the "Regions:" menu is not set to "p-value".
A consensus sequence is constructed from each column in a
motif's frequency matrix using the "50% rule"
as follows:
The letter frequencies in the column are sorted in decreasing order.
Letters with frequency less 50% of the maximum are discarded.
The letter used in this position in the consensus sequence is determined
by the first rule below that applies:
If there is only one letter left, or if the remaining letters exactly match
an ambiguous symbol in the alphabet, the letter or ambiguous symbol,
respectively, is used.
Otherwise, if the remaining set contains at least 50% of the core
symbols in the alphabet, the alphabet's wildcard
(e.g., "N" for DNA or RNA, and "X" for protein) is used.
Otherwise, the letter with the maximum frequency is used.
The expected number motifs that would have least one
region as comparatively enriched for
best matches to the motif as the reported region in the
positive sequences compared with the negative
sequences.
The Fisher E-value is the (one-sided) p-value of
the one-sided Fisher's exact test that at least as many best matches
in the region in the positive sequences that contain at least
one match, multiplied by the number of motifs in the input database(s).
The Fisher's exact test p-value is corrected for the number
of regions and score thresholds tested ("Multiple Tests").
Fisher's exact test assumes that the probability that the best match
(if any) falls into a given region is the same for all
positive and negative sequences.
The statistical significance of the enrichment of the motif, adjusted for multiple tests.
The enrichment p-value of a motif is calculated by using the one-tailed binomial test on the number of sequences with a match to the motif ("Sequence Matches") that have their best match in the reported region ("Region Matches"), corrected for the number of regions and score thresholds tested ("Multiple Tests"). The test assumes that the probability that the best match in a sequence falls in the region is the region width divided by the number of places a motif can align in the sequence (sequence length minus motif width plus 1).
The expected number motifs that would have at least one region as enriched for best matches to the motif as the reported region.
The E-value is the adjusted p-value multiplied by the number of motifs in the
input files(s).
The Matthew's Correlation Coefficient (MCC) gives a measure of the ability
of the motif to discriminate the positive sequences from the negative sequences:
TP is the number of positive sequences with a best match in the reported region,
FP is the number of negative sequences with a best match in the reported region,
TN is the number of negative sequences without a best match in the reported region, and
FN is the number of positive sequences without a best match in the reported region.
MCC ranges from -1 to +1, where a +1 result indicates that the occurrence
of the best match to the motif in the reported region perfectly discriminates positive
sequences from negative sequences.
The width (in sequence positions) of the most enriched region (default),
or two times the average distance between the center of the best site
and the sequence center if you used the option --cd.
A best match to the motif is counted as being in the region if
the center of the motif falls in the region.
The number of (positive) sequences whose best match to the motif is in the reported region. Note: This number may be less than the number of
(positive) sequences that have a best match in the region.
The reason for this is that a sequence may have many matches that score
equally best. If n matches have the best score in a sequence, 1/n is added to the
appropriate bin for each match.
The number of negative sequences where the best match to
the motif falls in the reported region. This value is rounded but the
underlying value may contain fractional counts.
Note: This number may be less than the number of negative
have a best match in the region. The reason for this is that a sequence may
have many matches that score equally best. If n matches have the
best score in a sequence, 1/n is added to the appropriate bin
for each match.
The number of negative sequences containing a match to the motif
above the score threshold. When score optimization is enabled the
score threshold may be raised higher than the minimum.
The maximum probability that the best match occurs at any single sequence position.
If the smoothing window size ("Window:", to right of graph) is set to "1", then this is value is
the maximum value of the match-probability curve.
Concentration is defined as the total probability of all the positions in the central
region whose width is the same as the size of the "smoothing window".
You can change the size of the smoothing window using the "Window:"
input field in the Graph options section, above.
(A value of "NaN" indicates that the smoothing window size is too
large for the motif.)
The "concentration" of the motif sites in the central window
can somtimes be more informative than the E-value. For example,
in some ChIP-seq datasets, motifs for cofactors show more significant
enrichment overall (smaller E-value), but are less
concentrated in a small (20 to 50bp) window than the motif for
the ChIP-ed transcription factor. In such cases, you may wish
to sort the motifs by Concentration, using the Sort
menu, below.
This is the number of multiple tests (n) done for this motif.
It was used to adjust the p-value of a region for
multiple tests using the formula:
p' = 1 - (1-p)n where p is the unadjusted p-value.
The number of multiple tests is the number of regions
considered times the number of score thresholds considered.
It depends on the motif length, sequence length, and the type of
optimizations being done (central enrichment, local enrichment, central distance or
score optimization).
The text box lists the sequence identifiers for sequences that
have at least one of their best matches in the most significant region ofall the
selected motifs.
The "Intersection" subheading gives the number of identifiers in the
text box and their percentage out of the total number of input sequences.
The "Union" subheading lists the number and percentage of
sequences that have at least one of their best matches in the most
significant region of any of the selected motifs and their
percentage out of the total number of input sequences.
Note that the number of sequences with a match to a given motif in
its best region may be larger than the value of "Region Matches". This is because
a sequence may have multiple equally best matches and in that case a
fractional match count is assigned to each of them when "Region Matches" is computed.
Sequence position where the (unsmoothed) match-probability curve for this motif
attains its maximum. Set the smoothing window size ("Window:", to right of graph) to
"1" to see the unsmoothed match probability curve.
If you use CentriMo in your research, please cite the following paper:
Timothy L. Bailey and Philip Machanick,
"Inferring direct DNA binding from ChIP-seq",
Nucleic Acids Research, 40:e128, 2012.
[Full Text]