HErmes.selection package¶

Submodules¶

HErmes.selection.categories module¶

Categories of data, like “signal” of “background” etc

class HErmes.selection.categories.AbstractBaseCategory(name)[source]¶

Bases: object

Stands for a specific type of data, e.g. detector data in a specific configuarion, simulated data etc.

add_cut(cut)[source]¶

Add a cut without applying it yet

Parameters:	cut (pyevsel.variables.cut.Cut) – Append this cut to the internal cutlist

add_livetime_weighted(other, self_livetime=None, other_livetime=None)[source]¶

Combine two datasets livetime weighted. If it is simulated data, then in general it does not know about the detector livetime. In this case the livetimes for the two datasets can be given

Keyword Arguments:
Parameters:	other (pyevsel.categories.Category) – Add this dataset
	self_livetime (float) – the data livetime for this dataset other_livetime (float) – the data livetime for the other dataset

add_plotoptions(options)[source]¶

Add options on how to plot this category. If available, they will be used.

Parameters:	options (dict) – For the names which are currently supported, please see the example file

add_variable(variable)[source]¶

Add a variable to this category

Parameters:	variable (pyevsel.variables.variables.Variable) – A Variable instalce

apply_cuts(inplace=False)[source]¶

Apply the added cuts.

Keyword Arguments:
	inplace (bool) – If True, cut the internal variable buffer (Can not be undone except variable is reloaded)

calculate_weights(model, model_args=None)[source]¶

declare_harvested()[source]¶: Set the flag that all the variables have been read out

delete_cuts()[source]¶: Get rid of previously added cuts and undo them

delete_variable(varname)[source]¶

Remove a variable entirely from the category

Parameters:	varname (str) – The name of the variable as stored in self.variable dict
Returns:	None

distribution(varname, bins=None, color=None, alpha=0.5, fig=None, xlabel=None, norm=False, filled=None, legend=True, style='line', log=False, transform=None, extra_weights=None, figure_factory=None, return_histo=False)[source]¶

Plot the distribution of variable in the category

Keyword Arguments:
Parameters:	varname (str) – The name of the variable in the catagory
	bins (int/np.ndarray) – Bins for the distribution color (str/int) – A color identifier, either number 0-5 or matplotlib compatible alpha (float) – 0-1 alpha value for histogram fig (matplotlib.figure.Figure) – Canvas for plotting, if None an empty one will be created xlabel (str) – xlabel for the plot. If None, default is used norm (str) – “n” or “density” - make normed histogram style (str) – Either “line” or “scatter” filled (bool) – Draw filled histogram legend (bool) – if available, plot a legend transform (callable) – Apply transformation to the data before plotting log (bool) – Plot yaxis in log scale extra_weights (numpy.ndarray) – Use this for weighting. Will overwrite any other weights in the dataset figure_factory (func) – Must return a single matplotlib.Figure, NOTE: figure_factory has priority over fig keyword return_histo (bool) – Return the histogram instead of the figure. WARNING: changes return type!
Returns:	matplotlib.figure.Figure or dashi.histogram.hist1d

distribution2d(varnames, bins=None, figure_factory=None, fig=None, norm=False, log=True, cmap=<Mock name='mock.get_cmap()' id='140553631437208'>, interpolation='gaussian', cblabel='events', weights=None, despine=False, return_histo=False)[source]¶

Draw a 2d distribution of 2 variables in the same category. :param varnames: The names of the variable in the catagory :type varnames: tuple(str,str)

Keyword Arguments:

bins (tuple(int/np.ndarray)) – Bins for the distribution
cmap – A colormap
alpha (//) – 0-1 alpha value for histogram
fig (matplotlib.figure.Figure) – Canvas for plotting, if None an empty one will be created
xlabel (//) – xlabel for the plot. If None, default is used
norm (str) – “n” or “density” - make normed histogram
style (//) – Either “line” or “scatter”
transform (callable) – Apply transformation to the data before plotting
log (bool) – Plot yaxis in log scale
figure_factory (func) – Must return a single matplotlib.Figure, NOTE: figure_factory has priority over fig keyword
return_histo (bool) – Return the histogram instead of the figure. WARNING: changes return type!

Returns:

matplotlib.figure.Figure or dashi.histogram.hist1d

drop_empty_variables()[source]¶

Delete variables which have no len

Returns:	None

explore_files()[source]¶

Get a sneak preview of what variables are avaukabke for readout

Returns:	list

get(varkey, uncut=False)[source]¶

Retrieve the data of a variable

Keyword Arguments:
Parameters:	varkey (str) – The name of the variable
	uncut (bool) – never return cutted values

get_datacube()[source]¶

get_files(*args, **kwargs)[source]¶

Load files for this category uses HErmes.utils.files.harvest_files

Parameters:

*args (list of strings) – Path to possible files

Keyword Arguments:

(dict(dataset_id (datasets) – nfiles)): i given, load only files from dataset dataset_id set nfiles parameter to amount of L2 files the loaded files will represent
force (bool) – forcibly reload filelist (pre-readout vars will be lost)
append (bool) – keep the already aquired files and only append the new ones
other kwargs will be passed to (all) –
utils.files.harvest_files –

harvested¶

integrated_rate¶

Calculate the total eventrate of this category (requires weights)

Returns (tuple): rate and quadratic error

load_vardefs(module)[source]¶

Load the variable definitions from a module

Parameters:	module (python module) – Needs to contain variable definitions

raw_count¶

Gives a number of “how many events are actually there”

Returns:	int

read_variables(names=None, max_cpu_cores=6)[source]¶

Harvest the variables in self.vardict

Keyword Arguments:
	names (list) – havest only these variables max_cpu_cores (list) – use a maximum of X cores of the cpu

show()[source]¶

Print out the names of the loaded variables

Returns:	dict (name, len)

undo_cuts()[source]¶: Conveniently undo a previous “apply_cuts”

variablenames¶

weights¶

weightvarname = None¶

class HErmes.selection.categories.CombinedCategory(name, categories)[source]¶

Bases: object

Create a combined category out of several others This is mainly useful for plotting FIXME: should this inherit from category as well? The difference compared to the dataset is that this is flat

add_plotoptions(options)[source]¶

Add options on how to plot this category. If available, they will be used.

Parameters:	options (dict) – For the names which are currently supported, please see the example file

get(varname)[source]¶

integrated_rate¶

Calculate the total eventrate of this category (requires weights)

Returns (tuple): rate and quadratic error

vardict¶

weights¶

class HErmes.selection.categories.Data(name)[source]¶

Bases: HErmes.selection.categories.AbstractBaseCategory

An interface to real time event data Simplified weighting only

calculate_weights(model=None, model_args=None)[source]¶

Calculate weights as rate, that is number of events per livetime

Keyword Args: for compatibility…

estimate_livetime(force=False)[source]¶

Calculate the livetime from run start/stop times, account for gaps

Keyword Arguments:
	force (bool) – overide existing livetime

livetime¶

set_livetime(livetime)[source]¶

Override the private _livetime member

Parameters:	livetime – The time needed for data-taking
Returns:	None

set_run_start_stop(runstart_var=<Variable: None>, runstop_var=<Variable: None>)[source]¶

Let the simulation category know which are the paramters describing the primary

Keyword Arguments:
	runstart_var (pyevself.variables.variables.Variable/str) – beginning of a run runstop_var (pyevself.variables.variables.Variable/str) – beginning of a run

set_weightfunction(func)[source]¶

class HErmes.selection.categories.ReweightedSimulation(name, mother)[source]¶

Bases: HErmes.selection.categories.Simulation

A proxy for simulation dataset, when only the weighting differs

add_livetime_weighted(other)[source]¶

Combine two datasets livetime weighted. If it is simulated data, then in general it does not know about the detector livetime. In this case the livetimes for the two datasets can be given

Keyword Arguments:
Parameters:	other (pyevsel.categories.Category) – Add this dataset
	self_livetime (float) – the data livetime for this dataset other_livetime (float) – the data livetime for the other dataset

datasets¶

files¶

get(varname, uncut=False)[source]¶

Retrieve the data of a variable

Keyword Arguments:
Parameters:	varkey (str) – The name of the variable
	uncut (bool) – never return cutted values

harvested¶

mother¶

raw_count¶

Gives a number of “how many events are actually there”

Returns:	int

read_mc_primary(energy_var='mc_p_en', type_var='mc_p_ty', zenith_var='mc_p_ze', weight_var='mc_p_we')[source]¶

Trigger the readout of MC Primary information Rename variables to magic keywords if necessary

Keyword Arguments:
	energy_var (str) – simulated primary energy type_var (str) – simulated primary type zenith_var (str) – simulated primary zenith weight_var (str) – a weight, e.g. interaction propability

read_variables(names=None, max_cpu_cores=6)[source]¶

Harvest the variables in self.vardict

Keyword Arguments:
	names (list) – havest only these variables max_cpu_cores (list) – use a maximum of X cores of the cpu

setter(other)¶

vardict¶

class HErmes.selection.categories.Simulation(name, weightvarname=None)[source]¶

Bases: HErmes.selection.categories.AbstractBaseCategory

An interface to variables from simulated data Allows to weight the events

calculate_weights(model=None, model_args=None)[source]¶

Walk the variables of this category and identify the weighting variables and calculate them.

Usage example: calculate_weights(model=lambda x: np.pow(x, -2.), model_args=[“primary_energy”])

Keyword Arguments:
	model (func) – The target flux to weight to, if None, generated flux is used for weighting model_args (list) – The variables the model should be applied to from the variable dict
Returns:	np.ndarray

livetime¶

mc_p_readout¶

read_mc_primary(energy_var='mc_p_en', type_var='mc_p_ty', zenith_var='mc_p_ze', weight_var='mc_p_we')[source]¶

Trigger the readout of MC Primary information Rename variables to magic keywords if necessary

Keyword Arguments:
	energy_var (str) – simulated primary energy type_var (str) – simulated primary type zenith_var (str) – simulated primary zenith weight_var (str) – a weight, e.g. interaction propability

HErmes.selection.categories.cut_with_nans(data, cutmask)[source]¶

Cut the individual fields of a 2d array and keep the shape by filling up with nans

Parameters:	data (np.ndarray) – The array to cut cutmask (np.ndarray) – Cut with this boolean array
Returns:	data with applied cuts
Return type:	np.ndarray

HErmes.selection.cut module¶

Remove part of the data which falls below a certain criteria.

class HErmes.selection.cut.Cut(*cuts, **kwargs)[source]¶

Bases: object

Cuts are basically conditions on a set of parameters.

variablenames¶: The names of the variables the cut will be applied to

HErmes.selection.dataset module¶

Datasets group categories together. Method calls on datasets invoke the individual methods on the individual categories. Cuts applied to datasets will act on each individual category.

class HErmes.selection.dataset.Dataset(*args, **kwargs)[source]¶

Bases: object

Holds different categories, relays calls to each of them.

add_category(category)[source]¶

Add another category to the dataset

Parameters:	category (HErmes.selection.categories.Category) – add this category

add_cut(cut)[source]¶

Add a cut without applying it yet

Parameters:	cut (HErmes.selection.variables.cut.Cut) – Append this cut to the internal cutlist

add_variable(variable)[source]¶

Add a variable to this category

Parameters:	variable (HErmes.selection.variables.variables.Variable) – A Variable instalce

apply_cuts(inplace=False)[source]¶: Apply them all!

calc_ratio(nominator=None, denominator=None)[source]¶

Calculate a ratio of the given categories

Parameters:	nominator (list) – denominator (list) –
Returns:	tuple

calculate_weights(model=None, model_args=None)[source]¶

Calculate the weights for all categories

Keyword Arguments:
	model (dict/func) – Either a dict catname -> func or a single func If it is a single funct it will be applied to all categories model_args (dict/list) – variable names as arguments for the function

categorynames¶

combined_categorynames¶

delete_cuts()[source]¶: Completely purge all cuts from this dataset

delete_variable(varname)[source]¶

Delete a variable entirely from the dataset

Parameters:	varname (str) – the name of the variable
Returns:	None

distribution(name, ratio=([], []), cumulative=True, log=False, transform=None, color_palette='dark', normalized=False, styles={}, style='classic', ylabel='rate/bin [1/s]', axis_properties=None, ratiolabel='data/$\\Sigma$ bg', bins=None, external_weights=None, figure_factory=None)[source]¶

One shot short-cut for one of the most used plots in eventselections

Keyword Arguments:
Parameters:	name (string) – The name of the variable to plot
	path (str) – The path under which the plot will be saved. ratio (list) – A ratio plot of these categories will be crated color_palette (str) – A predifined color palette (from seaborn or HErmes.plotting.colors) normalized (bool) – Normalize the histogram by number of events transform (callable) – Apply this transformation before plotting styles (dict) – plot styling options ylabel (str) – general label for y-axis ratiolabel (str) – different label for the ratio part of the plot bins (np.ndarray) – binning, if None binning will be deduced from the variable definition figure_factory (func) – factory function which return a matplotlib.Figure style (string) – TODO “modern” \|\| “classic” \|\| “modern-cumul” \|\| “classic-cumul” external_weights (dict) – supply external weights - this will OVERIDE ANY INTERNALLY CALCULATED WEIGHTS and use the supplied weights instead. must be in the form { “categoryname” : weights} axis_properties (dict) – Manually define a plot layout with up to three axes. For example, it can look like this: { ”top”: {“type”: “h”, # histogram ”height”: 0.4, # height in percent “index”: 2}, # used internally ”center”: {“type”: “r”, # ratio plot ”height”: 0.2, “index”: 1}, ”bottom”: { “type”: “c”, # cumulative histogram ”height”: 0.2, “index”: 0} }
Returns:	HErmes.selection.variables.VariableDistributionPlot

drop_empty_variables()[source]¶

Delete variables which have no len

Returns:	None

files¶

get_category(categoryname)[source]¶

Get a reference to a category.

Parameters:	category – A name which has to be associated to a category
Returns:	HErmes.selection.categories.Category

get_sparsest_category(omit_empty_cat=True)[source]¶

Find out which category of the dataset has the least statistical power

Keyword Arguments:
	omit_empty_cat (bool) – if a category has no entries at all, omit
Returns:	category name
Return type:	str

get_variable(varname)[source]¶

Get a pandas dataframe for all categories

Parameters:	varname (str) – A name of a variable
Returns:	A 2d dataframe category -> variable
Return type:	pandas.DataFrame

integrated_rate¶

Integrated rate for each category

Returns:	rate with error
Return type:	pandas.Panel

load_vardefs(vardefs)[source]¶

Load the variable definitions from a module

Parameters:	vardefs (python module/dict) – A module needs to contain variable definitions. It can also be a dictionary of categoryname->module

read_variables(names=None, max_cpu_cores=6)[source]¶

Read out the variable for all categories

Keyword Arguments:
	names (str) – Readout only these variables if given max_cpu_cores (int) – Maximum number of cpu cores which will be used
Returns:	None

set_default_plotstyles(styledict)[source]¶

Define a standard for each category how it should appear in plots

Parameters:	styledict (dict) –

set_livetime(livetime)[source]¶

Define a livetime for this dataset.

Parameters:	livetime (float) – Time interval the data was taken in. (Used for rate calculation)
Returns:	None

set_weightfunction(weightfunction=<function Dataset.<lambda>>)[source]¶

Defines a function which is used for weighting

Parameters:	weightfunction (func or dict) – if func is provided, set this to all categories if needed, provide dict, cat.name -> func for individula setting
Returns:	None

sum_rate(categories=None)[source]¶

Sum up the integrated rates for categories

Parameters:	categories – categories considerred background
Returns:	rate with error
Return type:	tuple

tinytable(signal=None, background=None, layout='v', format='html', order_by=<function Dataset.<lambda>>, livetime=1.0)[source]¶

Use dashi.tinytable.TinyTable to render a nice html representation of a rate table

Parameters:	signal (list) – summing up signal categories to calculate total signal rate background (list) – summing up background categories to calculate total background rate layout (str) – “v” for vertical, “h” for horizontal format (str) – “html”,”latex”,”wiki”
Returns:	formatted table in desired markup
Return type:	str

undo_cuts()[source]¶: Undo previously done cuts, but keep them so that they can be re-applied

variablenames¶

weights¶: Get the weights for all categories in this dataset

HErmes.selection.dataset.get_label(category)[source]¶

Get the label for labeling plots from a datasets plot_options dictionary.

Parameters:	category (HErmes.selection.categories.category) – Query the category’s plot_options dict, if not fall back to category.name
Returns:	string

HErmes.selection.magic_keywords module¶

All magic keywords shall summon here

HErmes.selection.variables module¶

Container classes for variables

class HErmes.selection.variables.AbstractBaseVariable[source]¶

Bases: object

Read out tagged numerical data from files

ROLES¶: alias of VariableRole

bins¶

calculate_fd_bins()[source]¶

Calculate a reasonable binning

Returns:	Freedman Diaconis bins
Return type:	numpy.ndarray

data¶

declare_harvested()[source]¶

harvest(*files)[source]¶: Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

harvested¶

ndim¶

rewire_variables(vardict)[source]¶

class HErmes.selection.variables.CompoundVariable(name, variables=None, label='', bins=None, operation=<function CompoundVariable.<lambda>>, role=<VariableRole.SCALAR: 10>)[source]¶

Bases: HErmes.selection.variables.AbstractBaseVariable

Calculate a variable from other variables. This kind of variable will not read any file.

harvest(*filenames)[source]¶: Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

rewire_variables(vardict)[source]¶: Use to avoid the necessity to read out variables twice as the variables are copied over by the categories, the refernce is lost. Can be rewired though

class HErmes.selection.variables.Variable(name, definitions=None, bins=None, label='', transform=<function Variable.<lambda>>, role=<VariableRole.SCALAR: 10>, nevents=None, reduce_dimension=None)[source]¶

Bases: HErmes.selection.variables.AbstractBaseVariable

A hook to a single variable read out from a file

rewire_variables(vardict)[source]¶

Make sure all the variables are connected properly. This is only needed for combined/compound variables

Returns:	None

class HErmes.selection.variables.VariableList(name, variables=None, label='', bins=None, role=<VariableRole.SCALAR: 10>)[source]¶

Bases: HErmes.selection.variables.AbstractBaseVariable

A list of variable. Can not be read out from files.

data¶

harvest(*filenames)[source]¶: Hook to the harvest method. Don’t use in case of multiprocessing! :param *files: walk through these files and readout

rewire_variables(vardict)[source]¶: Use to avoid the necessity to read out variables twice as the variables are copied over by the categories, the refernce is lost. Can be rewired though

class HErmes.selection.variables.VariableRole[source]¶

Bases: enum.Enum

Define roles for variables. Some variables used in a special context (like weights) are easily recognizable by this flag.

ARRAY = 20¶

ENDTIME = 70¶

EVENTID = 50¶

GENERATORWEIGHT = 30¶

RUNID = 40¶

SCALAR = 10¶

STARTIME = 60¶

UNKNOWN = 0¶

HErmes.selection.variables.extract_from_root(filename, definitions, nevents=None, reduce_dimension=None)[source]¶

Use the uproot system to get information from rootfiles. Supports a basic tree of primitive datatype like structure.

Keyword Arguments:
Parameters:	filename (str) – datafile defininitiions (list) – tree and branch adresses
	nevents (int) – number of events to read out reduce_dimension (int) – If data is vector-type, reduce it by taking the n-th element

HErmes.selection.variables.freedman_diaconis_bins(data, leftedge, rightedge, minbins=20, maxbins=70, fallbackbins=70)[source]¶

Get a number of bins for a histogram following Freedman/Diaconis

Parameters:	leftedge (float) – left bin edge rightedge (float) – right bin edge minbins (int) – the minimum number of bins maxbins (int) – the maximum number of bins fallbackbins (int) – a number of bins which is returned if calculation failse
Returns:	number of bins, minbins < bins < maxbins
Return type:	nbins (int)

HErmes.selection.variables.harvest(filenames, definitions, **kwargs)[source]¶

Read variables from files into memory. Will be used by HErmes.selection.variables.Variable.harvest This will be run multi-threaded. Keep that in mind, arguments have to be picklable, also everything thing which is read out must be picklable. Lambda functions are NOT picklable

Keyword Arguments:
Parameters:	filenames (list) – the files to extract the variables from. currently supported: hdf definitions (list) – where to find the data in the files. They usually have some tree-like structure, so this a list of leaf-value pairs. If there is more than one all of them will be tried. (As it might be that in some files a different naming scheme was used) Example: [(“hello_reoncstruction”, “x”), (“hello_reoncstruction”, “y”)] ]
	transformation (func) – After the data is read out from the files, transformation will be applied, e.g. the log to the energy. fill_empty (bool) – Fill empty fields with zeros nevents (int) – ROOT only - read out only nevents from the files reduce_dimension (str) – ROOT only - multidimensional data can be reduced by only using the index given by reduce_dimension. E.g. in case of a TVector3, and we want to have onlz x, that would be 0, y -> 1 and z -> 2. FIXME – Not implemented yet! precision (int): Precision in bit
Returns:	pd.Series or pd.DataFrame

Module contents¶

Provides containers for in-memory variable. These containers are called “categroies”, and they represent a set of variables for a certain type of data. Categories can be further grouped into “Datasets”. Variables can be read out from files and stored in memory in the form of numpy arrays or pandas DataSeries/DataFrames. Selection criteria can be applied simultaniously (and reversibly) to all categories in a dataset with the “Cut” class.

HErmes.selection provides the following submodules:

categories : Container classes for variables.
dataset : Grouping categories together.
cut : Apply selection criteria on variables in a category.
variables : Variable definition. Harvest variables from files.
magic_keywords : A bunch of fixed names for automatic weight calculation.

HErmes.selection.load_dataset(config, variables=None, max_cpu_cores=6)[source]¶

Read a json configuration file and load a dataset populated with variables from the files given in the configuration file.

Keyword Arguments:
Parameters:	config (str/dict) – json style config file or dict
	variables (list) – list of strings of variable names to read out max_cpu_cores (int) – maximum number of cpu ucores to use for variable readout
Returns:	HErmes.selection.dataset.Dataset