blockify package¶

Submodules¶

blockify.algorithms module¶

class blockify.algorithms.Algorithm(p0=0.05, gamma=None, ncp_prior=None)[source]¶

Bases: object

Base class for Bayesian blocks algorithm functions

Derived classes should overload the following method:

algorithm(self, **kwargs):: Compute the algorithm given a set of named arguments. Arguments accepted by algorithm must be among [T_k, N_k, a_k, b_k, c_k] (See 1 for details on the meaning of these parameters).

Additionally, other methods may be overloaded as well:

__init__(self, **kwargs):: Initialize the algorithm function with any parameters beyond the normal p0 and gamma.
validate_input(self, t, x, sigma):: Enable specific checks of the input data (t, x, sigma) to be performed prior to the fit.
compute_ncp_prior(self, N): If ncp_prior is not defined explicitly,: this function is called in order to define it before fitting. This may be calculated from gamma, p0, or whatever method you choose.
p0_prior(self, N):: Specify the form of the prior given the false-alarm probability p0 (See 1 for details).

For examples of implemented algorithm functions, see Events, RegularEvents, and PointMeasures.

References

1(1,2): Scargle, J et al. (2012) http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

compute_ncp_prior(N)[source]¶: If ncp_prior is not explicitly defined, compute it from gamma or p0.

fitness(**kwargs)[source]¶

p0_prior(N)[source]¶

Empirical prior, parametrized by the false alarm probability p0 See eq. 21 in Scargle (2012)

Note that there was an error in this equation in the original Scargle paper (the “log” was missing). The following corrected form is taken from http://arxiv.org/abs/1304.2818

segment(t, x=None, sigma=None)[source]¶

static validate_input(t, x=None, sigma=None)[source]¶

Validate inputs to the model.

Parameters

t (array_like) – times of observations
x (array_like (optional)) – values observed at each time
sigma (float or array_like (optional)) – errors in values x

Returns

t, x, sigma – validated and perhaps modified versions of inputs

Return type

array_like, float or None

class blockify.algorithms.BayesianBlocks(p0=0.05, gamma=None, ncp_prior=None)[source]¶

Bases: blockify.algorithms.OptimalPartitioning

Bayesian blocks algorithm for binned or unbinned events

Parameters

p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). For the Events type data, p0 does not seem to be an accurate representation of the actual false alarm probability. If you are using this algorithm function for a triggering type condition, it is recommended that you run statistical trials on signal-free noise to determine an appropriate value of gamma or ncp_prior to use for a desired false alarm rate.
gamma (float (optional)) – If specified, then use this gamma to compute the general prior form, \(p \sim {\tt gamma}^{N_{\rm blocks}}\). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 is ignored.

fitness(N_k, T_k)[source]¶

class blockify.algorithms.OptimalPartitioning(p0=0.05, gamma=None, ncp_prior=None)[source]¶

Bases: blockify.algorithms.Algorithm

Bayesian blocks algorithm for regular events

This is for data which has a fundamental “tick” length, so that all measured values are multiples of this tick length. In each tick, there are either zero or one counts.

Parameters

dt (float) – tick rate for data
p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 are ignored.

fitness(T_k, N_k)[source]¶

static get_change_points(N, edges, last)[source]¶

segment(t, x=None, sigma=None)[source]¶

Fit the Bayesian Blocks model given the specified algorithm function.

Parameters

t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors

Returns

edges – array containing the (M+1) edges defining the M optimal bins

Return type

ndarray

class blockify.algorithms.PELT(p0=0.05, gamma=None, ncp_prior=None)[source]¶

Bases: blockify.algorithms.OptimalPartitioning

Bayesian blocks algorithm for point measures

Parameters

p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.
ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 are ignored.

fitness(N_k, T_k)[source]¶

segment(t, x=None, sigma=None)[source]¶

Fit the Bayesian Blocks model given the specified algorithm function.

Parameters

t (array_like) – data times (one dimensional, length N)
x (array_like (optional)) – data values
sigma (array_like or float (optional)) – data errors

Returns

edges – array containing the (M+1) edges defining the M optimal bins

Return type

ndarray

blockify.annotation module¶

blockify.annotation.annotate(input_file, regions_bed, background_file, measure='enrichment', intermediate=None, alpha=None, correction=None, p_value=None, distance=None, min_size=None, max_size=None, pseudocount=1, tight=False, summit=False)[source]¶

Core annotation and peak calling method.

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
intermediate (bool) – Whether or not to return intermediate calculations during peak calling
alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see statsmodels.stats.multitest for valid values)
p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model
tight (bool) – Whether to tighten the regions in regions_bed
summit (bool) – Whether to return peak summits instead of full peaks

Returns

out_bed (BedTool object) – Set of peaks in BED6 format
df (pandas DataFrame or None) – If intermediate specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.annotate_from_command_line(args)[source]¶

Wrapper function for the command line function blockify call

Parameters

args (argparse.Namespace object) – Input from command line

Returns

out_bed (BedTool object) – Set of peaks in BED6 format
df (pandas DataFrame or None) – If intermediate specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.getPeakSummits(df, metric='pValue')[source]¶

From a list of peaks, get a set of peak summits

Parameters

df (pandas DataFrame) – Set of peaks from annotate as a DataFrame
metric (str) – Metric to use when filtering for summits. One of “pValue” or “density”

Returns

summits – Set of peak summits as a DataFrame

Return type

pandas DataFrame

blockify.annotation.parcelConsecutiveBlocks(df)[source]¶

Concatenates consecutive blocks into a DataFrame. If there are multiple non-contiguous sets of consecutive blocks, creates one DataFrame per set.

Parameters: df (pandas DataFrame) – Input set of blocks as a DataFrame
Returns: outlist – List of DataFrames, each of which is a set of consecutive blocks
Return type: list of pandas DataFrames

blockify.annotation.sizeFilter(bed, min_size, max_size)[source]¶

Filter peaks by size.

Parameters

bed (BedTool object) – Input data file
min_size (int) – Lower bound for peak size
max_size (int) – Upper bound for peak size

Returns

filtered_peaks – Peaks after size selection

Return type

BedTool object

blockify.annotation.tighten(data)[source]¶

Tightens block boundaries in a BedTool file. This function modifies block boundaries so that they coincide with data points.

Parameters: data (BedTool object) – Input file of block boundaries
Returns: refined – BedTool of tightened blocks
Return type: BedTool object

blockify.annotation.validateAnnotationArguments(input_file, regions_bed, background_file, measure, alpha, correction, p_value, distance, min_size, max_size, pseudocount)[source]¶

Validates parameters passed via the command line.

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks
background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model
measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for
alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance
correction (str or None) – Multiple hypothesis correction to perform (see statsmodels.stats.multitest for valid values)
p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance
distance (int or None) – Merge significant features within specified distance cutoff
min_size (int or None) – Minimum size cutoff for peaks
max_size (int or None) – Maximum size cutoff for peaks
pseudocount (float) – Pseudocount added to adjust background model

Returns

None

Return type

None

blockify.downsampling module¶

blockify.downsampling.downsample(input_file, n, seed=None, naive=False)[source]¶

Core downsampling method

Parameters

input_file (pandas DataFrame) – Input data (e.g. BED, qBED, CCF) as a pandas DataFrame
n (int) – Number of entries to sample
seed (int) – Seed for random number generator
naive (bool) – Choose whether to sample each entry with equal probability (True) or weighted by the value in the fourth column (if supplied)

Returns

downsampled_file – Input file after downsampling

Return type

BedTool object

blockify.downsampling.downsample_from_command_line(args)[source]¶

Wrapper function for the command line function blockify downsample

Parameters: args (argparse.Namespace object) – Input from command line
Returns: downsampled_file – Downsampled command line data
Return type: BedTool

blockify.normalization module¶

blockify.normalization.normalize(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶

Core normalization method

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.

Returns

bedgraph – A BedTool object in bedGraph format, using the intervals supplied in regions_bed

Return type

BedTool

blockify.normalization.normalize_from_command_line(args)[source]¶

Wrapper function for the command line function blockify normalize

Parameters: args (argparse.Namespace object) – Input from command line
Returns: bedgraph – Normalized command line data in bedGraph format
Return type: BedTool

blockify.normalization.validateNormalizationArguments(input_file, regions_bed, libraryFactor, lengthFactor)[source]¶

Validates parameters passed via the command line.

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file
libraryFactor (float) – Scalar to normalize by input_file’s library size.
lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.

Returns

None

Return type

None

blockify.parsers module¶

blockify.segmentation module¶

class blockify.segmentation.SegmentationRecord[source]¶

Bases: object

A class to store a single Bayesian block genomic segmentation.

finalize()[source]¶: Store post hoc summary statistics of the segmentation.

blockify.segmentation.blocksToDF(chrom, ranges)[source]¶

Convert a set of contiguous Bayesian blocks to pandas DataFrame format.

Parameters

chrom (str) – String specifying the chromsome
ranges (array) – Array whose entries specify the coordinates of block boundaries

Returns

output

Return type

pandas DataFrame

blockify.segmentation.segment(input_file, method, p0=None, prior=None)[source]¶

Core segmentation method.

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
method (str) – String specifying whether to use OP or PELT for the segmentation
p0 (float, optional) – Float used to parameterize the prior on the total number of blocks; must be in the interval [0, 1]. Default: 0.05
prior (float, optional) – Explicit value for the total number of priors (specifying this is not recommended)

Returns

segmentation – A SegmentationRecord from segmenting the provided data

Return type

SegmentationRecord

blockify.segmentation.segment_from_command_line(args)[source]¶

Wrapper function for the command line function blockify segment

Parameters: args (argparse.Namespace object) – Input from command line
Returns: segmentation – A SegmentationRecord from segmenting the command line data
Return type: SegmentationRecord

blockify.segmentation.validateSegmentationArguments(input_file, p0, prior)[source]¶

Validates parameters passed via the command line.

Parameters

input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data
p0 (float) –
prior (float) –

Returns

None

Return type

None

blockify.utilities module¶

blockify.utilities.file_len(fname)[source]¶

Fast method for getting number of lines in a file. For BED files, much faster than calling len() on a BedTool object. From https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

Parameters: fname (str) – Input (text) filename
Returns: length – Length of fname
Return type: int

blockify.utilities.getChromosomesInDF(df)[source]¶

Helper function to get a list of unique chromsomes in a pandas DataFrame.

Parameters: df (pandas DataFrame) – Input genomic data (e.g BED, qBED, CCF) as a DataFrame
Returns: chroms – List of chromosomes
Return type: list

blockify.utilities.isSortedBEDFile(bed_file_path)[source]¶

Wrapper function to feed filepaths isSortedBEDObject.

Parameters: bed_file_path (str) – Path to BED/qBED/CCF data file
Returns: is_sorted
Return type: bool

blockify.utilities.isSortedBEDObject(bed_object)[source]¶

Tests whether a BedTool object is sorted.

Parameters: bed_object (BedTool object) – Input data as a BedTool object
Returns: is_sorted
Return type: bool

blockify package¶

Submodules¶

blockify.algorithms module¶

blockify.annotation module¶

blockify.downsampling module¶

blockify.normalization module¶

blockify.parsers module¶

blockify.segmentation module¶

blockify.utilities module¶

Module contents¶