HErmes.selection package

Submodules

HErmes.selection.categories module

Categories of data, like “signal” of “background” etc

class HErmes.selection.categories.AbstractBaseCategory(name)[source]

Bases: object

Stands for a specific type of data, e.g. detector data in a specific configuarion, simulated data etc.

add_cut(cut)[source]

Add a cut without applying it yet

Parameters:cut (pyevsel.variables.cut.Cut) – Append this cut to the internal cutlist
add_livetime_weighted(other, self_livetime=None, other_livetime=None)[source]

Combine two datasets livetime weighted. If it is simulated data, then in general it does not know about the detector livetime. In this case the livetimes for the two datasets can be given

Parameters:

other (pyevsel.categories.Category) – Add this dataset

Keyword Arguments:
 
  • self_livetime (float) – the data livetime for this dataset
  • other_livetime (float) – the data livetime for the other dataset
add_plotoptions(options)[source]

Add options on how to plot this category. If available, they will be used.

Parameters:options (dict) – For the names which are currently supported, please see the example file
add_variable(variable)[source]

Add a variable to this category

Parameters:variable (pyevsel.variables.variables.Variable) – A Variable instalce
apply_cuts(inplace=False)[source]

Apply the added cuts.

Keyword Arguments:
 inplace (bool) – If True, cut the internal variable buffer (Can not be undone except variable is reloaded)
calculate_weights(model, model_args=None)[source]
declare_harvested()[source]

Set the flag that all the variables have been read out

delete_cuts()[source]

Get rid of previously added cuts and undo them

delete_variable(varname)[source]

Remove a variable entirely from the category

Parameters:varname (str) – The name of the variable as stored in self.variable dict
Returns:None
distribution(varname, bins=None, color=None, alpha=0.5, fig=None, xlabel=None, norm=False, filled=None, legend=True, style='line', log=False, transform=None, extra_weights=None, figure_factory=None, return_histo=False)[source]

Plot the distribution of variable in the category

Parameters:

varname (str) – The name of the variable in the catagory

Keyword Arguments:
 
  • bins (int/np.ndarray) – Bins for the distribution
  • color (str/int) – A color identifier, either number 0-5 or matplotlib compatible
  • alpha (float) – 0-1 alpha value for histogram
  • fig (matplotlib.figure.Figure) – Canvas for plotting, if None an empty one will be created
  • xlabel (str) – xlabel for the plot. If None, default is used
  • norm (str) – “n” or “density” - make normed histogram
  • style (str) – Either “line” or “scatter”
  • filled (bool) – Draw filled histogram
  • legend (bool) – if available, plot a legend
  • transform (callable) – Apply transformation to the data before plotting
  • log (bool) – Plot yaxis in log scale
  • extra_weights (numpy.ndarray) – Use this for weighting. Will overwrite any other weights in the dataset
  • figure_factory (func) – Must return a single matplotlib.Figure, NOTE: figure_factory has priority over fig keyword
  • return_histo (bool) – Return the histogram instead of the figure. WARNING: changes return type!
Returns:

matplotlib.figure.Figure or dashi.histogram.hist1d

distribution2d(varnames, bins=None, figure_factory=None, fig=None, norm=False, log=True, cmap=<Mock name='mock.get_cmap()' id='140553631437208'>, interpolation='gaussian', cblabel='events', weights=None, despine=False, return_histo=False)[source]

Draw a 2d distribution of 2 variables in the same category. :param varnames: The names of the variable in the catagory :type varnames: tuple(str,str)

Keyword Arguments:
 
  • bins (tuple(int/np.ndarray)) – Bins for the distribution
  • cmap – A colormap
  • alpha (//) – 0-1 alpha value for histogram
  • fig (matplotlib.figure.Figure) – Canvas for plotting, if None an empty one will be created
  • xlabel (//) – xlabel for the plot. If None, default is used
  • norm (str) – “n” or “density” - make normed histogram
  • style (//) – Either “line” or “scatter”
  • transform (callable) – Apply transformation to the data before plotting
  • log (bool) – Plot yaxis in log scale
  • figure_factory (func) – Must return a single matplotlib.Figure, NOTE: figure_factory has priority over fig keyword
  • return_histo (bool) – Return the histogram instead of the figure. WARNING: changes return type!
Returns:

matplotlib.figure.Figure or dashi.histogram.hist1d

drop_empty_variables()[source]

Delete variables which have no len

Returns:None
explore_files()[source]

Get a sneak preview of what variables are avaukabke for readout

Returns:list
get(varkey, uncut=False)[source]

Retrieve the data of a variable

Parameters:varkey (str) – The name of the variable
Keyword Arguments:
 uncut (bool) – never return cutted values
get_datacube()[source]
get_files(*args, **kwargs)[source]

Load files for this category uses HErmes.utils.files.harvest_files

Parameters:

*args (list of strings) – Path to possible files

Keyword Arguments:
 
  • (dict(dataset_id (datasets) – nfiles)): i given, load only files from dataset dataset_id set nfiles parameter to amount of L2 files the loaded files will represent
  • force (bool) – forcibly reload filelist (pre-readout vars will be lost)
  • append (bool) – keep the already aquired files and only append the new ones
  • other kwargs will be passed to (all) –
  • utils.files.harvest_files
harvested
integrated_rate

Calculate the total eventrate of this category (requires weights)

Returns (tuple): rate and quadratic error

load_vardefs(module)[source]

Load the variable definitions from a module

Parameters:module (python module) – Needs to contain variable definitions
raw_count

Gives a number of “how many events are actually there”

Returns:int
read_variables(names=None, max_cpu_cores=6)[source]

Harvest the variables in self.vardict

Keyword Arguments:
 
  • names (list) – havest only these variables
  • max_cpu_cores (list) – use a maximum of X cores of the cpu
show()[source]

Print out the names of the loaded variables

Returns:dict (name, len)
undo_cuts()[source]

Conveniently undo a previous “apply_cuts”

variablenames
weights
weightvarname = None
class HErmes.selection.categories.CombinedCategory(name, categories)[source]

Bases: object

Create a combined category out of several others This is mainly useful for plotting FIXME: should this inherit from category as well? The difference compared to the dataset is that this is flat

add_plotoptions(options)[source]

Add options on how to plot this category. If available, they will be used.

Parameters:options (dict) – For the names which are currently supported, please see the example file
get(varname)[source]
integrated_rate

Calculate the total eventrate of this category (requires weights)

Returns (tuple): rate and quadratic error

vardict
weights
class HErmes.selection.categories.Data(name)[source]

Bases: HErmes.selection.categories.AbstractBaseCategory

An interface to real time event data Simplified weighting only

calculate_weights(model=None, model_args=None)[source]

Calculate weights as rate, that is number of events per livetime

Keyword Args: for compatibility…

estimate_livetime(force=False)[source]

Calculate the livetime from run start/stop times, account for gaps

Keyword Arguments:
 force (bool) – overide existing livetime
livetime
set_livetime(livetime)[source]

Override the private _livetime member

Parameters:livetime – The time needed for data-taking
Returns:None
set_run_start_stop(runstart_var=<Variable: None>, runstop_var=<Variable: None>)[source]

Let the simulation category know which are the paramters describing the primary

Keyword Arguments:
 
  • runstart_var (pyevself.variables.variables.Variable/str) – beginning of a run
  • runstop_var (pyevself.variables.variables.Variable/str) – beginning of a run
set_weightfunction(func)[source]
class HErmes.selection.categories.ReweightedSimulation(name, mother)[source]

Bases: HErmes.selection.categories.Simulation

A proxy for simulation dataset, when only the weighting differs

add_livetime_weighted(other)[source]

Combine two datasets livetime weighted. If it is simulated data, then in general it does not know about the detector livetime. In this case the livetimes for the two datasets can be given

Parameters:

other (pyevsel.categories.Category) – Add this dataset

Keyword Arguments:
 
  • self_livetime (float) – the data livetime for this dataset
  • other_livetime (float) – the data livetime for the other dataset
datasets
files
get(varname, uncut=False)[source]

Retrieve the data of a variable

Parameters:varkey (str) – The name of the variable
Keyword Arguments:
 uncut (bool) – never return cutted values
harvested
mother
raw_count

Gives a number of “how many events are actually there”

Returns:int
read_mc_primary(energy_var='mc_p_en', type_var='mc_p_ty', zenith_var='mc_p_ze', weight_var='mc_p_we')[source]

Trigger the readout of MC Primary information Rename variables to magic keywords if necessary

Keyword Arguments:
 
  • energy_var (str) – simulated primary energy
  • type_var (str) – simulated primary type
  • zenith_var (str) – simulated primary zenith
  • weight_var (str) – a weight, e.g. interaction propability
read_variables(names=None, max_cpu_cores=6)[source]

Harvest the variables in self.vardict

Keyword Arguments:
 
  • names (list) – havest only these variables
  • max_cpu_cores (list) – use a maximum of X cores of the cpu
setter(other)
vardict
class HErmes.selection.categories.Simulation(name, weightvarname=None)[source]

Bases: HErmes.selection.categories.AbstractBaseCategory

An interface to variables from simulated data Allows to weight the events

calculate_weights(model=None, model_args=None)[source]

Walk the variables of this category and identify the weighting variables and calculate them.

Usage example: calculate_weights(model=lambda x: np.pow(x, -2.), model_args=[“primary_energy”])

Keyword Arguments:
 
  • model (func) – The target flux to weight to, if None, generated flux is used for weighting
  • model_args (list) – The variables the model should be applied to from the variable dict
Returns:

np.ndarray

livetime
mc_p_readout
read_mc_primary(energy_var='mc_p_en', type_var='mc_p_ty', zenith_var='mc_p_ze', weight_var='mc_p_we')[source]

Trigger the readout of MC Primary information Rename variables to magic keywords if necessary

Keyword Arguments:
 
  • energy_var (str) – simulated primary energy
  • type_var (str) – simulated primary type
  • zenith_var (str) – simulated primary zenith
  • weight_var (str) – a weight, e.g. interaction propability
HErmes.selection.categories.cut_with_nans(data, cutmask)[source]

Cut the individual fields of a 2d array and keep the shape by filling up with nans

Parameters:
  • data (np.ndarray) – The array to cut
  • cutmask (np.ndarray) – Cut with this boolean array
Returns:

data with applied cuts

Return type:

np.ndarray

HErmes.selection.cut module

Remove part of the data which falls below a certain criteria.

class HErmes.selection.cut.Cut(*cuts, **kwargs)[source]

Bases: object

Cuts are basically conditions on a set of parameters.

variablenames

The names of the variables the cut will be applied to

HErmes.selection.dataset module

Datasets group categories together. Method calls on datasets invoke the individual methods on the individual categories. Cuts applied to datasets will act on each individual category.

class HErmes.selection.dataset.Dataset(*args, **kwargs)[source]

Bases: object

Holds different categories, relays calls to each of them.

add_category(category)[source]

Add another category to the dataset

Parameters:category (HErmes.selection.categories.Category) – add this category
add_cut(cut)[source]

Add a cut without applying it yet

Parameters:cut (HErmes.selection.variables.cut.Cut) – Append this cut to the internal cutlist
add_variable(variable)[source]

Add a variable to this category

Parameters:variable (HErmes.selection.variables.variables.Variable) – A Variable instalce
apply_cuts(inplace=False)[source]

Apply them all!

calc_ratio(nominator=None, denominator=None)[source]

Calculate a ratio of the given categories

Parameters:
  • nominator (list) –
  • denominator (list) –
Returns:

tuple

calculate_weights(model=None, model_args=None)[source]

Calculate the weights for all categories

Keyword Arguments:
 
  • model (dict/func) – Either a dict catname -> func or a single func If it is a single funct it will be applied to all categories
  • model_args (dict/list) – variable names as arguments for the function
categorynames
combined_categorynames
delete_cuts()[source]

Completely purge all cuts from this dataset

delete_variable(varname)[source]

Delete a variable entirely from the dataset

Parameters:varname (str) – the name of the variable
Returns:None
distribution(name, ratio=([], []), cumulative=True, log=False, transform=None, color_palette='dark', normalized=False, styles={}, style='classic', ylabel='rate/bin [1/s]', axis_properties=None, ratiolabel='data/$\\Sigma$ bg', bins=None, external_weights=None, figure_factory=None)[source]

One shot short-cut for one of the most used plots in eventselections

Parameters:

name (string) – The name of the variable to plot

Keyword Arguments:
 
  • path (str) – The path under which the plot will be saved.
  • ratio (list) – A ratio plot of these categories will be crated
  • color_palette (str) – A predifined color palette (from seaborn or HErmes.plotting.colors)
  • normalized (bool) – Normalize the histogram by number of events
  • transform (callable) – Apply this transformation before plotting
  • styles (dict) – plot styling options
  • ylabel (str) – general label for y-axis
  • ratiolabel (str) – different label for the ratio part of the plot
  • bins (np.ndarray) – binning, if None binning will be deduced from the variable definition
  • figure_factory (func) – factory function which return a matplotlib.Figure
  • style (string) – TODO “modern” || “classic” || “modern-cumul” || “classic-cumul”
  • external_weights (dict) – supply external weights - this will OVERIDE ANY INTERNALLY CALCULATED WEIGHTS and use the supplied weights instead. must be in the form { “categoryname” : weights}
  • axis_properties (dict) –

    Manually define a plot layout with up to three axes. For example, it can look like this: {

    ”top”: {“type”: “h”, # histogram
    ”height”: 0.4, # height in percent “index”: 2}, # used internally
    ”center”: {“type”: “r”, # ratio plot
    ”height”: 0.2, “index”: 1},
    ”bottom”: { “type”: “c”, # cumulative histogram
    ”height”: 0.2, “index”: 0}

    }

Returns:

HErmes.selection.variables.VariableDistributionPlot

drop_empty_variables()[source]

Delete variables which have no len

Returns:None
files
get_category(categoryname)[source]

Get a reference to a category.

Parameters:category – A name which has to be associated to a category
Returns:HErmes.selection.categories.Category
get_sparsest_category(omit_empty_cat=True)[source]

Find out which category of the dataset has the least statistical power

Keyword Arguments:
 omit_empty_cat (bool) – if a category has no entries at all, omit
Returns:category name
Return type:str
get_variable(varname)[source]

Get a pandas dataframe for all categories

Parameters:varname (str) – A name of a variable
Returns:A 2d dataframe category -> variable
Return type:pandas.DataFrame
integrated_rate

Integrated rate for each category

Returns:rate with error
Return type:pandas.Panel
load_vardefs(vardefs)[source]

Load the variable definitions from a module

Parameters:vardefs (python module/dict) – A module needs to contain variable definitions. It can also be a dictionary of categoryname->module
read_variables(names=None, max_cpu_cores=6)[source]

Read out the variable for all categories

Keyword Arguments:
 
  • names (str) – Readout only these variables if given
  • max_cpu_cores (int) – Maximum number of cpu cores which will be used
Returns:

None

set_default_plotstyles(styledict)[source]

Define a standard for each category how it should appear in plots

Parameters:styledict (dict) –
set_livetime(livetime)[source]

Define a livetime for this dataset.

Parameters:livetime (float) – Time interval the data was taken in. (Used for rate calculation)
Returns:None
set_weightfunction(weightfunction=<function Dataset.<lambda>>)[source]

Defines a function which is used for weighting

Parameters:weightfunction (func or dict) – if func is provided, set this to all categories if needed, provide dict, cat.name -> func for individula setting
Returns:None
sum_rate(categories=None)[source]

Sum up the integrated rates for categories

Parameters:categories – categories considerred background
Returns:rate with error
Return type:tuple
tinytable(signal=None, background=None, layout='v', format='html', order_by=<function Dataset.<lambda>>, livetime=1.0)[source]

Use dashi.tinytable.TinyTable to render a nice html representation of a rate table

Parameters:
  • signal (list) – summing up signal categories to calculate total signal rate
  • background (list) – summing up background categories to calculate total background rate
  • layout (str) – “v” for vertical, “h” for horizontal
  • format (str) – “html”,”latex”,”wiki”
Returns:

formatted table in desired markup

Return type:

str

undo_cuts()[source]

Undo previously done cuts, but keep them so that they can be re-applied

variablenames
weights

Get the weights for all categories in this dataset

HErmes.selection.dataset.get_label(category)[source]

Get the label for labeling plots from a datasets plot_options dictionary.

Parameters:category (HErmes.selection.categories.category) – Query the category’s plot_options dict, if not fall back to category.name
Returns:string

HErmes.selection.magic_keywords module

All magic keywords shall summon here

HErmes.selection.variables module

Container classes for variables

class HErmes.selection.variables.AbstractBaseVariable[source]

Bases: object

Read out tagged numerical data from files

ROLES

alias of VariableRole

bins
calculate_fd_bins()[source]

Calculate a reasonable binning

Returns:Freedman Diaconis bins
Return type:numpy.ndarray
data
declare_harvested()[source]
harvest(*files)[source]

Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

harvested
ndim
rewire_variables(vardict)[source]
class HErmes.selection.variables.CompoundVariable(name, variables=None, label='', bins=None, operation=<function CompoundVariable.<lambda>>, role=<VariableRole.SCALAR: 10>)[source]

Bases: HErmes.selection.variables.AbstractBaseVariable

Calculate a variable from other variables. This kind of variable will not read any file.

harvest(*filenames)[source]

Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

rewire_variables(vardict)[source]

Use to avoid the necessity to read out variables twice as the variables are copied over by the categories, the refernce is lost. Can be rewired though

class HErmes.selection.variables.Variable(name, definitions=None, bins=None, label='', transform=<function Variable.<lambda>>, role=<VariableRole.SCALAR: 10>, nevents=None, reduce_dimension=None)[source]

Bases: HErmes.selection.variables.AbstractBaseVariable

A hook to a single variable read out from a file

rewire_variables(vardict)[source]

Make sure all the variables are connected properly. This is only needed for combined/compound variables

Returns:None
class HErmes.selection.variables.VariableList(name, variables=None, label='', bins=None, role=<VariableRole.SCALAR: 10>)[source]

Bases: HErmes.selection.variables.AbstractBaseVariable

A list of variable. Can not be read out from files.

data
harvest(*filenames)[source]

Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

rewire_variables(vardict)[source]

Use to avoid the necessity to read out variables twice as the variables are copied over by the categories, the refernce is lost. Can be rewired though

class HErmes.selection.variables.VariableRole[source]

Bases: enum.Enum

Define roles for variables. Some variables used in a special context (like weights) are easily recognizable by this flag.

ARRAY = 20
ENDTIME = 70
EVENTID = 50
GENERATORWEIGHT = 30
RUNID = 40
SCALAR = 10
STARTIME = 60
UNKNOWN = 0
HErmes.selection.variables.extract_from_root(filename, definitions, nevents=None, reduce_dimension=None)[source]

Use the uproot system to get information from rootfiles. Supports a basic tree of primitive datatype like structure.

Parameters:
  • filename (str) – datafile
  • defininitiions (list) – tree and branch adresses
Keyword Arguments:
 
  • nevents (int) – number of events to read out
  • reduce_dimension (int) – If data is vector-type, reduce it by taking the n-th element
HErmes.selection.variables.freedman_diaconis_bins(data, leftedge, rightedge, minbins=20, maxbins=70, fallbackbins=70)[source]

Get a number of bins for a histogram following Freedman/Diaconis

Parameters:
  • leftedge (float) – left bin edge
  • rightedge (float) – right bin edge
  • minbins (int) – the minimum number of bins
  • maxbins (int) – the maximum number of bins
  • fallbackbins (int) – a number of bins which is returned if calculation failse
Returns:

number of bins, minbins < bins < maxbins

Return type:

nbins (int)

HErmes.selection.variables.harvest(filenames, definitions, **kwargs)[source]

Read variables from files into memory. Will be used by HErmes.selection.variables.Variable.harvest This will be run multi-threaded. Keep that in mind, arguments have to be picklable, also everything thing which is read out must be picklable. Lambda functions are NOT picklable

Parameters:
  • filenames (list) – the files to extract the variables from. currently supported: hdf
  • definitions (list) – where to find the data in the files. They usually have some tree-like structure, so this a list of leaf-value pairs. If there is more than one all of them will be tried. (As it might be that in some files a different naming scheme was used) Example: [(“hello_reoncstruction”, “x”), (“hello_reoncstruction”, “y”)] ]
Keyword Arguments:
 
  • transformation (func) – After the data is read out from the files, transformation will be applied, e.g. the log to the energy.
  • fill_empty (bool) – Fill empty fields with zeros
  • nevents (int) – ROOT only - read out only nevents from the files
  • reduce_dimension (str) – ROOT only - multidimensional data can be reduced by only using the index given by reduce_dimension. E.g. in case of a TVector3, and we want to have onlz x, that would be 0, y -> 1 and z -> 2.
  • FIXME – Not implemented yet! precision (int): Precision in bit
Returns:

pd.Series or pd.DataFrame

Module contents

Provides containers for in-memory variable. These containers are called “categroies”, and they represent a set of variables for a certain type of data. Categories can be further grouped into “Datasets”. Variables can be read out from files and stored in memory in the form of numpy arrays or pandas DataSeries/DataFrames. Selection criteria can be applied simultaniously (and reversibly) to all categories in a dataset with the “Cut” class.

HErmes.selection provides the following submodules:

  • categories : Container classes for variables.
  • dataset : Grouping categories together.
  • cut : Apply selection criteria on variables in a category.
  • variables : Variable definition. Harvest variables from files.
  • magic_keywords : A bunch of fixed names for automatic weight calculation.
HErmes.selection.load_dataset(config, variables=None, max_cpu_cores=6)[source]

Read a json configuration file and load a dataset populated with variables from the files given in the configuration file.

Parameters:

config (str/dict) – json style config file or dict

Keyword Arguments:
 
  • variables (list) – list of strings of variable names to read out
  • max_cpu_cores (int) – maximum number of cpu ucores to use for variable readout
Returns:

HErmes.selection.dataset.Dataset