blockify package

Submodules

blockify.algorithms module

class blockify.algorithms.Algorithm(p0=0.05, gamma=None, ncp_prior=None)[source]

Bases: object

Base class for Bayesian blocks algorithm functions

Derived classes should overload the following method:

algorithm(self, **kwargs):

Compute the algorithm given a set of named arguments. Arguments accepted by algorithm must be among [T_k, N_k, a_k, b_k, c_k] (See 1 for details on the meaning of these parameters).

Additionally, other methods may be overloaded as well:

__init__(self, **kwargs):

Initialize the algorithm function with any parameters beyond the normal p0 and gamma.

validate_input(self, t, x, sigma):

Enable specific checks of the input data (t, x, sigma) to be performed prior to the fit.

compute_ncp_prior(self, N): If ncp_prior is not defined explicitly,

this function is called in order to define it before fitting. This may be calculated from gamma, p0, or whatever method you choose.

p0_prior(self, N):

Specify the form of the prior given the false-alarm probability p0 (See 1 for details).

For examples of implemented algorithm functions, see Events, RegularEvents, and PointMeasures.

References

1(1,2)

Scargle, J et al. (2012) http://adsabs.harvard.edu/abs/2012arXiv1207.5578S

compute_ncp_prior(N)[source]

If ncp_prior is not explicitly defined, compute it from gamma or p0.

fitness(**kwargs)[source]
p0_prior(N)[source]

Empirical prior, parametrized by the false alarm probability p0 See eq. 21 in Scargle (2012)

Note that there was an error in this equation in the original Scargle paper (the “log” was missing). The following corrected form is taken from http://arxiv.org/abs/1304.2818

segment(t, x=None, sigma=None)[source]
static validate_input(t, x=None, sigma=None)[source]

Validate inputs to the model.

Parameters
  • t (array_like) – times of observations

  • x (array_like (optional)) – values observed at each time

  • sigma (float or array_like (optional)) – errors in values x

Returns

t, x, sigma – validated and perhaps modified versions of inputs

Return type

array_like, float or None

class blockify.algorithms.BayesianBlocks(p0=0.05, gamma=None, ncp_prior=None)[source]

Bases: blockify.algorithms.OptimalPartitioning

Bayesian blocks algorithm for binned or unbinned events

Parameters
  • p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). For the Events type data, p0 does not seem to be an accurate representation of the actual false alarm probability. If you are using this algorithm function for a triggering type condition, it is recommended that you run statistical trials on signal-free noise to determine an appropriate value of gamma or ncp_prior to use for a desired false alarm rate.

  • gamma (float (optional)) – If specified, then use this gamma to compute the general prior form, \(p \sim {\tt gamma}^{N_{\rm blocks}}\). If gamma is specified, p0 is ignored.

  • ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 is ignored.

fitness(N_k, T_k)[source]
class blockify.algorithms.OptimalPartitioning(p0=0.05, gamma=None, ncp_prior=None)[source]

Bases: blockify.algorithms.Algorithm

Bayesian blocks algorithm for regular events

This is for data which has a fundamental “tick” length, so that all measured values are multiples of this tick length. In each tick, there are either zero or one counts.

Parameters
  • dt (float) – tick rate for data

  • p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.

  • ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 are ignored.

fitness(T_k, N_k)[source]
static get_change_points(N, edges, last)[source]
segment(t, x=None, sigma=None)[source]

Fit the Bayesian Blocks model given the specified algorithm function.

Parameters
  • t (array_like) – data times (one dimensional, length N)

  • x (array_like (optional)) – data values

  • sigma (array_like or float (optional)) – data errors

Returns

edges – array containing the (M+1) edges defining the M optimal bins

Return type

ndarray

class blockify.algorithms.PELT(p0=0.05, gamma=None, ncp_prior=None)[source]

Bases: blockify.algorithms.OptimalPartitioning

Bayesian blocks algorithm for point measures

Parameters
  • p0 (float (optional)) – False alarm probability, used to compute the prior on \(N_{\rm blocks}\) (see eq. 21 of Scargle 2012). If gamma is specified, p0 is ignored.

  • ncp_prior (float (optional)) – If specified, use the value of ncp_prior to compute the prior as above, using the definition \({\tt ncp\_prior} = -\ln({\tt gamma})\). If ncp_prior is specified, gamma and p0 are ignored.

fitness(N_k, T_k)[source]
segment(t, x=None, sigma=None)[source]

Fit the Bayesian Blocks model given the specified algorithm function.

Parameters
  • t (array_like) – data times (one dimensional, length N)

  • x (array_like (optional)) – data values

  • sigma (array_like or float (optional)) – data errors

Returns

edges – array containing the (M+1) edges defining the M optimal bins

Return type

ndarray

blockify.annotation module

blockify.annotation.annotate(input_file, regions_bed, background_file, measure='enrichment', intermediate=None, alpha=None, correction=None, p_value=None, distance=None, min_size=None, max_size=None, pseudocount=1, tight=False, summit=False)[source]

Core annotation and peak calling method.

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks

  • background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model

  • measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for

  • intermediate (bool) – Whether or not to return intermediate calculations during peak calling

  • alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance

  • correction (str or None) – Multiple hypothesis correction to perform (see statsmodels.stats.multitest for valid values)

  • p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance

  • distance (int or None) – Merge significant features within specified distance cutoff

  • min_size (int or None) – Minimum size cutoff for peaks

  • max_size (int or None) – Maximum size cutoff for peaks

  • pseudocount (float) – Pseudocount added to adjust background model

  • tight (bool) – Whether to tighten the regions in regions_bed

  • summit (bool) – Whether to return peak summits instead of full peaks

Returns

  • out_bed (BedTool object) – Set of peaks in BED6 format

  • df (pandas DataFrame or None) – If intermediate specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.annotate_from_command_line(args)[source]

Wrapper function for the command line function blockify call

Parameters

args (argparse.Namespace object) – Input from command line

Returns

  • out_bed (BedTool object) – Set of peaks in BED6 format

  • df (pandas DataFrame or None) – If intermediate specified, DataFrame containing intermediate calculations during peak calling

blockify.annotation.getPeakSummits(df, metric='pValue')[source]

From a list of peaks, get a set of peak summits

Parameters
  • df (pandas DataFrame) – Set of peaks from annotate as a DataFrame

  • metric (str) – Metric to use when filtering for summits. One of “pValue” or “density”

Returns

summits – Set of peak summits as a DataFrame

Return type

pandas DataFrame

blockify.annotation.parcelConsecutiveBlocks(df)[source]

Concatenates consecutive blocks into a DataFrame. If there are multiple non-contiguous sets of consecutive blocks, creates one DataFrame per set.

Parameters

df (pandas DataFrame) – Input set of blocks as a DataFrame

Returns

outlist – List of DataFrames, each of which is a set of consecutive blocks

Return type

list of pandas DataFrames

blockify.annotation.sizeFilter(bed, min_size, max_size)[source]

Filter peaks by size.

Parameters
  • bed (BedTool object) – Input data file

  • min_size (int) – Lower bound for peak size

  • max_size (int) – Upper bound for peak size

Returns

filtered_peaks – Peaks after size selection

Return type

BedTool object

blockify.annotation.tighten(data)[source]

Tightens block boundaries in a BedTool file. This function modifies block boundaries so that they coincide with data points.

Parameters

data (BedTool object) – Input file of block boundaries

Returns

refined – BedTool of tightened blocks

Return type

BedTool object

blockify.annotation.validateAnnotationArguments(input_file, regions_bed, background_file, measure, alpha, correction, p_value, distance, min_size, max_size, pseudocount)[source]

Validates parameters passed via the command line.

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are annotation/calling peaks

  • background_file (BedTool object) – BedTool object (instantiated from pybedtools) used to parameterize the background model

  • measure (str) – Either “enrichment” or “depletion” to indicate which direction of effect to test for

  • alpha (float or None) – Multiple-hypothesis adjusted threshold for calling significance

  • correction (str or None) – Multiple hypothesis correction to perform (see statsmodels.stats.multitest for valid values)

  • p_value (float or None) – Straight p-value cutoff (unadjusted) for calling significance

  • distance (int or None) – Merge significant features within specified distance cutoff

  • min_size (int or None) – Minimum size cutoff for peaks

  • max_size (int or None) – Maximum size cutoff for peaks

  • pseudocount (float) – Pseudocount added to adjust background model

Returns

None

Return type

None

blockify.downsampling module

blockify.downsampling.downsample(input_file, n, seed=None, naive=False)[source]

Core downsampling method

Parameters
  • input_file (pandas DataFrame) – Input data (e.g. BED, qBED, CCF) as a pandas DataFrame

  • n (int) – Number of entries to sample

  • seed (int) – Seed for random number generator

  • naive (bool) – Choose whether to sample each entry with equal probability (True) or weighted by the value in the fourth column (if supplied)

Returns

downsampled_file – Input file after downsampling

Return type

BedTool object

blockify.downsampling.downsample_from_command_line(args)[source]

Wrapper function for the command line function blockify downsample

Parameters

args (argparse.Namespace object) – Input from command line

Returns

downsampled_file – Downsampled command line data

Return type

BedTool

blockify.normalization module

blockify.normalization.normalize(input_file, regions_bed, libraryFactor, lengthFactor)[source]

Core normalization method

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file

  • libraryFactor (float) – Scalar to normalize by input_file’s library size.

  • lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.

Returns

bedgraph – A BedTool object in bedGraph format, using the intervals supplied in regions_bed

Return type

BedTool

blockify.normalization.normalize_from_command_line(args)[source]

Wrapper function for the command line function blockify normalize

Parameters

args (argparse.Namespace object) – Input from command line

Returns

bedgraph – Normalized command line data in bedGraph format

Return type

BedTool

blockify.normalization.validateNormalizationArguments(input_file, regions_bed, libraryFactor, lengthFactor)[source]

Validates parameters passed via the command line.

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • regions_bed (BedTool object) – BedTool object (instantiated from pybedtools) for regions over which we are normalizing input_file

  • libraryFactor (float) – Scalar to normalize by input_file’s library size.

  • lengthFactor (float or None) – Scalar to normalize by each block’s length. If None, no length normalization is performed.

Returns

None

Return type

None

blockify.parsers module

blockify.segmentation module

class blockify.segmentation.SegmentationRecord[source]

Bases: object

A class to store a single Bayesian block genomic segmentation.

finalize()[source]

Store post hoc summary statistics of the segmentation.

blockify.segmentation.blocksToDF(chrom, ranges)[source]

Convert a set of contiguous Bayesian blocks to pandas DataFrame format.

Parameters
  • chrom (str) – String specifying the chromsome

  • ranges (array) – Array whose entries specify the coordinates of block boundaries

Returns

output

Return type

pandas DataFrame

blockify.segmentation.segment(input_file, method, p0=None, prior=None)[source]

Core segmentation method.

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • method (str) – String specifying whether to use OP or PELT for the segmentation

  • p0 (float, optional) – Float used to parameterize the prior on the total number of blocks; must be in the interval [0, 1]. Default: 0.05

  • prior (float, optional) – Explicit value for the total number of priors (specifying this is not recommended)

Returns

segmentation – A SegmentationRecord from segmenting the provided data

Return type

SegmentationRecord

blockify.segmentation.segment_from_command_line(args)[source]

Wrapper function for the command line function blockify segment

Parameters

args (argparse.Namespace object) – Input from command line

Returns

segmentation – A SegmentationRecord from segmenting the command line data

Return type

SegmentationRecord

blockify.segmentation.validateSegmentationArguments(input_file, p0, prior)[source]

Validates parameters passed via the command line.

Parameters
  • input_file (BedTool object) – BedTool object (instantiated from pybedtools) for input data

  • p0 (float) –

  • prior (float) –

Returns

None

Return type

None

blockify.utilities module

blockify.utilities.file_len(fname)[source]

Fast method for getting number of lines in a file. For BED files, much faster than calling len() on a BedTool object. From https://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

Parameters

fname (str) – Input (text) filename

Returns

length – Length of fname

Return type

int

blockify.utilities.getChromosomesInDF(df)[source]

Helper function to get a list of unique chromsomes in a pandas DataFrame.

Parameters

df (pandas DataFrame) – Input genomic data (e.g BED, qBED, CCF) as a DataFrame

Returns

chroms – List of chromosomes

Return type

list

blockify.utilities.isSortedBEDFile(bed_file_path)[source]

Wrapper function to feed filepaths isSortedBEDObject.

Parameters

bed_file_path (str) – Path to BED/qBED/CCF data file

Returns

is_sorted

Return type

bool

blockify.utilities.isSortedBEDObject(bed_object)[source]

Tests whether a BedTool object is sorted.

Parameters

bed_object (BedTool object) – Input data as a BedTool object

Returns

is_sorted

Return type

bool

Module contents