blockify package¶
Submodules¶
blockify.algorithms module¶

class
blockify.algorithms.
Algorithm
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
object
Base class for Bayesian blocks algorithm functions
Derived classes should overload the following method:
algorithm(self, **kwargs)
:Compute the algorithm given a set of named arguments. Arguments accepted by algorithm must be among
[T_k, N_k, a_k, b_k, c_k]
(See 1 for details on the meaning of these parameters).
Additionally, other methods may be overloaded as well:
__init__(self, **kwargs)
:Initialize the algorithm function with any parameters beyond the normal
p0
andgamma
.validate_input(self, t, x, sigma)
:Enable specific checks of the input data (
t
,x
,sigma
) to be performed prior to the fit.compute_ncp_prior(self, N)
: Ifncp_prior
is not defined explicitly,this function is called in order to define it before fitting. This may be calculated from
gamma
,p0
, or whatever method you choose.p0_prior(self, N)
:Specify the form of the prior given the falsealarm probability
p0
(See 1 for details).
For examples of implemented algorithm functions, see
Events
,RegularEvents
, andPointMeasures
.References
 1(1,2)
Scargle, J et al. (2012) http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

p0_prior
(N)[source]¶ Empirical prior, parametrized by the false alarm probability
p0
See eq. 21 in Scargle (2012)Note that there was an error in this equation in the original Scargle paper (the “log” was missing). The following corrected form is taken from http://arxiv.org/abs/1304.2818

static
validate_input
(t, x=None, sigma=None)[source]¶ Validate inputs to the model.
 Parameters
t (array_like) – times of observations
x (array_like (optional)) – values observed at each time
sigma (float or array_like (optional)) – errors in values x
 Returns
t, x, sigma – validated and perhaps modified versions of inputs
 Return type
array_like, float or None

class
blockify.algorithms.
BayesianBlocks
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.OptimalPartitioning
Bayesian blocks algorithm for binned or unbinned events
 Parameters
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). For the Events type data,
p0
does not seem to be an accurate representation of the actual false alarm probability. If you are using this algorithm function for a triggering type condition, it is recommended that you run statistical trials on signalfree noise to determine an appropriate value ofgamma
orncp_prior
to use for a desired false alarm rate.gamma (float (optional)) – If specified, then use this gamma to compute the general prior form, \(p \sim {\tt gamma}^{N_{\rm blocks}}\). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = \ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
is ignored.

class
blockify.algorithms.
OptimalPartitioning
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.Algorithm
Bayesian blocks algorithm for regular events
This is for data which has a fundamental “tick” length, so that all measured values are multiples of this tick length. In each tick, there are either zero or one counts.
 Parameters
dt (float) – tick rate for data
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = \ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
are ignored.

segment
(t, x=None, sigma=None)[source]¶ Fit the Bayesian Blocks model given the specified algorithm function.
 Parameters
t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors
 Returns
edges – array containing the (M+1) edges defining the M optimal bins
 Return type
ndarray

class
blockify.algorithms.
PELT
(p0=0.05, gamma=None, ncp_prior=None)[source]¶ Bases:
blockify.algorithms.OptimalPartitioning
Bayesian blocks algorithm for point measures
 Parameters
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of
ncp_prior
to compute the prior as above, using the definition \({\tt ncp\_prior} = \ln({\tt gamma})\). Ifncp_prior
is specified,gamma
andp0
are ignored.

segment
(t, x=None, sigma=None)[source]¶ Fit the Bayesian Blocks model given the specified algorithm function.
 Parameters
t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors
 Returns
edges – array containing the (M+1) edges defining the M optimal bins
 Return type
ndarray
blockify.annotation module¶

blockify.annotation.
annotate
(input_file, regions_bed, background_file, measure='enrichment', intermediate=None, alpha=None, correction=None, p_value=None, distance=None, min_size=None, max_size=None, pseudocount=1, tight=False, summit=False)[source]¶ Core annotation and peak calling method.
 Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
intermediate (bool) – Whether or not to return intermediate calculations during peak calling
alpha (float or None) – Multiplehypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see
statsmodels.stats.multitest
for valid values)p_value (float or None) – Straight pvalue cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model
tight (bool) – Whether to tighten the regions in
regions_bed
summit (bool) – Whether to return peak summits instead of full peaks
 Returns
out_bed (BedTool object) – Set of peaks in BED6 format
df (
pandas
DataFrame or None) – Ifintermediate
specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.
annotate_from_command_line
(args)[source]¶ Wrapper function for the command line function
blockify call
 Parameters
args (
argparse.Namespace
object) – Input from command line Returns
out_bed (BedTool object) – Set of peaks in BED6 format
df (
pandas
DataFrame or None) – Ifintermediate
specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.
getPeakSummits
(df, metric='pValue')[source]¶ From a list of peaks, get a set of peak summits
 Parameters
df (
pandas
DataFrame) – Set of peaks fromannotate
as a DataFramemetric (str) – Metric to use when filtering for summits. One of “pValue” or “density”
 Returns
summits – Set of peak summits as a DataFrame
 Return type
pandas
DataFrame

blockify.annotation.
parcelConsecutiveBlocks
(df)[source]¶ Concatenates consecutive blocks into a DataFrame. If there are multiple noncontiguous sets of consecutive blocks, creates one DataFrame per set.
 Parameters
df (
pandas
DataFrame) – Input set of blocks as a DataFrame Returns
outlist – List of DataFrames, each of which is a set of consecutive blocks
 Return type
list of
pandas
DataFrames

blockify.annotation.
sizeFilter
(bed, min_size, max_size)[source]¶ Filter peaks by size.
 Parameters
bed (BedTool object) – Input data file
min_size (int) – Lower bound for peak size
max_size (int) – Upper bound for peak size
 Returns
filtered_peaks – Peaks after size selection
 Return type
BedTool object

blockify.annotation.
tighten
(data)[source]¶ Tightens block boundaries in a BedTool file. This function modifies block boundaries so that they coincide with data points.
 Parameters
data (BedTool object) – Input file of block boundaries
 Returns
refined – BedTool of tightened blocks
 Return type
BedTool object

blockify.annotation.
validateAnnotationArguments
(input_file, regions_bed, background_file, measure, alpha, correction, p_value, distance, min_size, max_size, pseudocount)[source]¶ Validates parameters passed via the command line.
 Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
alpha (float or None) – Multiplehypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see
statsmodels.stats.multitest
for valid values)p_value (float or None) – Straight pvalue cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model
 Returns
None
 Return type
None
blockify.downsampling module¶

blockify.downsampling.
downsample
(input_file, n, seed=None, naive=False)[source]¶ Core downsampling method
 Parameters
input_file (
pandas
DataFrame) – Input data (e.g. BED, qBED, CCF) as apandas
DataFramen (int) – Number of entries to sample
seed (int) – Seed for random number generator
naive (bool) – Choose whether to sample each entry with equal probability (True) or weighted by the value in the fourth column (if supplied)
 Returns
downsampled_file – Input file after downsampling
 Return type
BedTool object
blockify.normalization module¶

blockify.normalization.
normalize
(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶ Core normalization method
 Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.
 Returns
bedgraph – A BedTool object in bedGraph format, using the intervals supplied in regions_bed
 Return type
BedTool

blockify.normalization.
normalize_from_command_line
(args)[source]¶ Wrapper function for the command line function
blockify normalize
 Parameters
args (
argparse.Namespace
object) – Input from command line Returns
bedgraph – Normalized command line data in bedGraph format
 Return type
BedTool

blockify.normalization.
validateNormalizationArguments
(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶ Validates parameters passed via the command line.
 Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.
 Returns
None
 Return type
None
blockify.parsers module¶
blockify.segmentation module¶

class
blockify.segmentation.
SegmentationRecord
[source]¶ Bases:
object
A class to store a single Bayesian block genomic segmentation.

blockify.segmentation.
blocksToDF
(chrom, ranges)[source]¶ Convert a set of contiguous Bayesian blocks to
pandas
DataFrame format. Parameters
chrom (str) – String specifying the chromsome
ranges (array) – Array whose entries specify the coordinates of block boundaries
 Returns
output
 Return type
pandas
DataFrame

blockify.segmentation.
segment
(input_file, method, p0=None, prior=None)[source]¶ Core segmentation method.
 Parameters
input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
method (str) – String specifying whether to use OP or PELT for the segmentation
p0 (float, optional) – Float used to parameterize the prior on the total number of blocks; must be in the interval [0, 1]. Default: 0.05
prior (float, optional) – Explicit value for the total number of priors (specifying this is not recommended)
 Returns
segmentation – A SegmentationRecord from segmenting the provided data
 Return type
blockify.utilities module¶

blockify.utilities.
file_len
(fname)[source]¶ Fast method for getting number of lines in a file. For BED files, much faster than calling len() on a BedTool object. From https://stackoverflow.com/questions/845058/howtogetlinecountcheaplyinpython
 Parameters
fname (str) – Input (text) filename
 Returns
length – Length of fname
 Return type
int

blockify.utilities.
getChromosomesInDF
(df)[source]¶ Helper function to get a list of unique chromsomes in a
pandas
DataFrame. Parameters
df (
pandas
DataFrame) – Input genomic data (e.g BED, qBED, CCF) as a DataFrame Returns
chroms – List of chromosomes
 Return type
list