blockify package¶
Submodules¶
blockify.algorithms module¶
-
class
blockify.algorithms.
Algorithm
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
object
Base class for Bayesian blocks algorithm functions
Derived classes should overload the following method:
algorithm(self, **kwargs)
:Compute the algorithm given a set of named arguments. Arguments accepted by algorithm must be among
[T_k, N_k, a_k, b_k, c_k]
(See 1 for details on the meaning of these parameters).
Additionally, other methods may be overloaded as well:
__init__(self, **kwargs)
:Initialize the algorithm function with any parameters beyond the normal
p0
andgamma
.validate_input(self, t, x, sigma)
:Enable specific checks of the input data (
t
,x
,sigma
) to be performed prior to the fit.compute_ncp_prior(self, N)
: Ifncp_prior
is not defined explicitly,this function is called in order to define it before fitting. This may be calculated from
gamma
,p0
, or whatever method you choose.p0_prior(self, N)
:Specify the form of the prior given the false-alarm probability
p0
(See 1 for details).
For examples of implemented algorithm functions, see
Events
,RegularEvents
, andPointMeasures
.References
- 1(1,2)
Scargle, J et al. (2012) http://adsabs.harvard.edu/abs/2012arXiv1207.5578S
-
p0_prior
(N)[source]¶ Empirical prior, parametrized by the false alarm probability
p0
See eq. 21 in Scargle (2012)Note that there was an error in this equation in the original Scargle paper (the “log” was missing). The following corrected form is taken from http://arxiv.org/abs/1304.2818
-
static
validate_input
(t, x=None, sigma=None)[source]¶ Validate inputs to the model.
- Parameters
t (array_like) – times of observations
x (array_like (optional)) – values observed at each time
sigma (float or array_like (optional)) – errors in values x
- Returns
t, x, sigma – validated and perhaps modified versions of inputs
- Return type
array_like, float or None
-
class
blockify.algorithms.
BayesianBlocks
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.OptimalPartitioning
Bayesian blocks algorithm for binned or unbinned events
- Parameters
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). For the Events type data,
p0
does not seem to be an accurate representation of the actual false alarm probability. If you are using this algorithm function for a triggering type condition, it is recommended that you run statistical trials on signal-free noise to determine an appropriate value ofgamma
orncp_prior
to use for a desired false alarm rate.gamma (float (optional)) – If specified, then use this gamma to compute the general prior form, \(p \sim {\tt gamma}^{N_{\rm blocks}}\). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
is ignored.
-
class
blockify.algorithms.
OptimalPartitioning
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.Algorithm
Bayesian blocks algorithm for regular events
This is for data which has a fundamental “tick” length, so that all measured values are multiples of this tick length. In each tick, there are either zero or one counts.
- Parameters
dt (float) – tick rate for data
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
are ignored.
-
segment
(t, x=None, sigma=None)[source]¶ Fit the Bayesian Blocks model given the specified algorithm function.
- Parameters
t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors
- Returns
edges – array containing the (M+1) edges defining the M optimal bins
- Return type
ndarray
-
class
blockify.algorithms.
PELT
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.OptimalPartitioning
Bayesian blocks algorithm for point measures
- Parameters
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
are ignored.
-
segment
(t, x=None, sigma=None)[source]¶ Fit the Bayesian Blocks model given the specified algorithm function.
- Parameters
t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors
- Returns
edges – array containing the (M+1) edges defining the M optimal bins
- Return type
ndarray
blockify.annotation module¶
-
blockify.annotation.
annotate
(input_file, regions_bed, background_file, measure='enrichment', intermediate=None, alpha=None, correction=None, p_value=None, distance=None, min_size=None, max_size=None, pseudocount=1, tight=False, summit=False)[source]¶ Core annotation and peak calling method.
- Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
intermediate (bool) – Whether or not to return intermediate calculations during peak calling
alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see
statsmodels.stats.multitest
for valid values)p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model
tight (bool) – Whether to tighten the regions in
regions_bed
summit (bool) – Whether to return peak summits instead of full peaks
- Returns
out_bed (BedTool object) – Set of peaks in BED6 format
df (
pandas
DataFrame or None) – Ifintermediate
specified, DataFrame containing intermediate calculations during peak calling
-
blockify.annotation.
annotate_from_command_line
(args)[source]¶ Wrapper function for the command line function
blockify call
- Parameters
args (
argparse.Namespace
object) – Input from command line- Returns
out_bed (BedTool object) – Set of peaks in BED6 format
df (
pandas
DataFrame or None) – Ifintermediate
specified, DataFrame containing intermediate calculations during peak calling
-
blockify.annotation.
getPeakSummits
(df, metric='pValue')[source]¶ From a list of peaks, get a set of peak summits
- Parameters
df (
pandas
DataFrame) – Set of peaks fromannotate
as a DataFramemetric (str) – Metric to use when filtering for summits. One of “pValue” or “density”
- Returns
summits – Set of peak summits as a DataFrame
- Return type
pandas
DataFrame
-
blockify.annotation.
parcelConsecutiveBlocks
(df)[source]¶ Concatenates consecutive blocks into a DataFrame. If there are multiple non-contiguous sets of consecutive blocks, creates one DataFrame per set.
- Parameters
df (
pandas
DataFrame) – Input set of blocks as a DataFrame- Returns
outlist – List of DataFrames, each of which is a set of consecutive blocks
- Return type
list of
pandas
DataFrames
-
blockify.annotation.
sizeFilter
(bed, min_size, max_size)[source]¶ Filter peaks by size.
- Parameters
bed (BedTool object) – Input data file
min_size (int) – Lower bound for peak size
max_size (int) – Upper bound for peak size
- Returns
filtered_peaks – Peaks after size selection
- Return type
BedTool object
-
blockify.annotation.
tighten
(data)[source]¶ Tightens block boundaries in a BedTool file. This function modifies block boundaries so that they coincide with data points.
- Parameters
data (BedTool object) – Input file of block boundaries
- Returns
refined – BedTool of tightened blocks
- Return type
BedTool object
-
blockify.annotation.
validateAnnotationArguments
(input_file, regions_bed, background_file, measure, alpha, correction, p_value, distance, min_size, max_size, pseudocount)[source]¶ Validates parameters passed via the command line.
- Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see
statsmodels.stats.multitest
for valid values)p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model
- Returns
None
- Return type
None
blockify.downsampling module¶
-
blockify.downsampling.
downsample
(input_file, n, seed=None, naive=False)[source]¶ Core downsampling method
- Parameters
input_file (
pandas
DataFrame) – Input data (e.g. BED, qBED, CCF) as apandas
DataFramen (int) – Number of entries to sample
seed (int) – Seed for random number generator
naive (bool) – Choose whether to sample each entry with equal probability (True) or weighted by the value in the fourth column (if supplied)
- Returns
downsampled_file – Input file after downsampling
- Return type
BedTool object
blockify.normalization module¶
-
blockify.normalization.
normalize
(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶ Core normalization method
- Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.
- Returns
bedgraph – A BedTool object in bedGraph format, using the intervals supplied in regions_bed
- Return type
BedTool
-
blockify.normalization.
normalize_from_command_line
(args)[source]¶ Wrapper function for the command line function
blockify normalize
- Parameters
args (
argparse.Namespace
object) – Input from command line- Returns
bedgraph – Normalized command line data in bedGraph format
- Return type
BedTool
-
blockify.normalization.
validateNormalizationArguments
(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶ Validates parameters passed via the command line.
- Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.
- Returns
None
- Return type
None
blockify.parsers module¶
blockify.segmentation module¶
-
class
blockify.segmentation.
SegmentationRecord
[source]¶ Bases:
object
A class to store a single Bayesian block genomic segmentation.
-
blockify.segmentation.
blocksToDF
(chrom, ranges)[source]¶ Convert a set of contiguous Bayesian blocks to
pandas
DataFrame format.- Parameters
chrom (str) – String specifying the chromsome
ranges (array) – Array whose entries specify the coordinates of block boundaries
- Returns
output
- Return type
pandas
DataFrame
-
blockify.segmentation.
segment
(input_file, method, p0=None, prior=None)[source]¶ Core segmentation method.
- Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
method (str) – String specifying whether to use OP or PELT for the segmentation
p0 (float, optional) – Float used to parameterize the prior on the total number of blocks; must be in the interval [0, 1]. Default: 0.05
prior (float, optional) – Explicit value for the total number of priors (specifying this is not recommended)
- Returns
segmentation – A SegmentationRecord from segmenting the provided data
- Return type
blockify.utilities module¶
-
blockify.utilities.
file_len
(fname)[source]¶ Fast method for getting number of lines in a file. For BED files, much faster than calling len() on a BedTool object. From https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python
- Parameters
fname (str) – Input (text) filename
- Returns
length – Length of fname
- Return type
int
-
blockify.utilities.
getChromosomesInDF
(df)[source]¶ Helper function to get a list of unique chromsomes in a
pandas
DataFrame.- Parameters
df (
pandas
DataFrame) – Input genomic data (e.g BED, qBED, CCF) as a DataFrame- Returns
chroms – List of chromosomes
- Return type
list