# RSGISLib Stats Tools

## Histogram Thresholds

rsgislib.tools.stats.calc_otsu_threshold(data)

A function to calculate otsu’s threshold for a dataset. Input is expected to be a 1d numpy array.

Parameters

data – 1d numeric numpy array

Returns

float (threshold)

rsgislib.tools.stats.calc_yen_threshold(data)

A function to calculate yen threshold for a dataset. Input is expected to be a 1d numpy array.

Yen J.C., Chang F.J., and Chang S. (1995) “A New Criterion for Automatic Multilevel Thresholding” IEEE Trans. on Image Processing, 4(3): 370-378. DOI:10.1109/83.366472

Parameters

data – 1d numeric numpy array

Returns

float (threshold)

rsgislib.tools.stats.calc_isodata_threshold(data)

A function to calculate inter-means threshold for a dataset. Input is expected to be a 1d numpy array. Histogram-based threshold, known as Ridler-Calvard method or inter-means. Threshold values returned satisfy the following equality:

threshold = (data[data <= threshold].mean() + data[data > threshold].mean()) / 2.0

Ridler, TW & Calvard, S (1978), “Picture thresholding using an iterative selection method” IEEE Transactions on Systems, Man and Cybernetics 8: 630-632, DOI:10.1109/TSMC.1978.4310039

Parameters

data – 1d numeric numpy array

Returns

float (threshold)

rsgislib.tools.stats.calc_hist_cross_entropy(data, threshold)

A function which computes the cross-entropy between distributions above and below a threshold. Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events.

See Li and Lee (1993); this is the objective function `threshold_li` minimizes. This function can be improved but this implementation most closely matches equation 8 in Li and Lee (1993) and equations 1-3 in Li and Tam (1998).

Li C.H. and Lee C.K. (1993) “Minimum Cross Entropy Thresholding” Pattern Recognition, 26(4): 617-625 DOI:10.1016/0031-3203(93)90115-D

Li C.H. and Tam P.K.S. (1998) “An Iterative Algorithm for Minimum Cross Entropy Thresholding” Pattern Recognition Letters, 18(8): 771-776 DOI: 10.1016/S0167-8655(98)00057-9

Parameters
• data – 1d numeric numpy array

• threshold – float spliting to the two parts of the histogram.

Returns

float (cross-entropy target value)

rsgislib.tools.stats.calc_li_threshold(data, tolerance=None, initial_guess=None)

A function which calculates a threshold value by Li’s iterative Minimum Cross Entropy method.

Li C.H. and Lee C.K. (1993) “Minimum Cross Entropy Thresholding” Pattern Recognition, 26(4): 617-625 DOI:10.1016/0031-3203(93)90115-D

Parameters
• data – 1d numeric numpy array

• tolerance – float (optional) - Finish the computation when the change in the threshold in an iteration is less than this value. By default, this is half the smallest difference between data values.

• initial_guess – float (optional) - Li’s iterative method uses gradient descent to find the optimal threshold. If the histogram contains more than two modes (peaks), the gradient descent could get stuck in a local optimum. An initial guess for the iteration can help the algorithm find the globally-optimal threshold.

Returns

float (threshold)

rsgislib.tools.stats.calc_kurt_skew_threshold(data, max_val, min_val, init_thres, low_thres=True, contamination=10.0, only_kurtosis=False)

A function to calculate a threshold either side of the histogram based on

Parameters
• data – 1d numeric numpy array

• max_val

• min_val

• init_thres

• low_thres

• contamination

• only_kurtosis

Returns

## Histograms

rsgislib.tools.stats.get_nbins_histogram(data)

Calculating the number of bins and the width of those bins for a histogram.

Parameters

data – 1-d numpy array.

Returns

(n_bins, bin_width) n_bins: int for the number of bins. bin_width: float with the width of the bins.

rsgislib.tools.stats.get_bin_centres(bin_edges, geometric=False)

A function to calculate the centre points of bins from the bin edges from a histogram e.g., numpy.histogram. My default the arithmetic mean is provided (max+min)/2 but the geometric mean can also be calculated sqrt(min*max), this is useful for logarithmically spaced bins.

Parameters
• bin_edges – numpy array of the bin edges

• geometric – boolean, if False (default) then the arithmetic mean return if True then the geometric mean is returned.

Returns

bin_centres - numpy array

## Feature Selection

rsgislib.tools.stats.corr_feature_selection(df, dep_vars, ind_vars, n_min_clusters=3, n_max_clusters=25)

A function which performs a correlation based feature selection. This analysis will cluster the independent (predictor) variables based on the Pearson correlation distance metric. The Silhouette coefficient (Rousseeuw, 1987) is used to find the optimal number of clusters.

Parameters
• df – pandas dataframe where the columns are the predictor variables

• dep_vars – list of dependent variables within the dataframe

• ind_vars – list of independent (predictor) variables within the dataframe

• n_min_clusters – The minimum number of clusters within the search (Default: 3)

• n_max_clusters – The maximum number of clusters within the search (Default: 25)

Returns

list of column names for good predictor variables

rsgislib.tools.stats.lassolars_feature_selection(df, dep_vars, ind_vars, alpha_val=None)

A function which undertake regularisation-based feature selection using the LassoLars regressor in Scikit-Learn. the Lasso (least absolute shrinkage and selection operator) regression algorithm is linear model that uses L1 regularisation to assign coefficients of zero to uninformative predictor variables (effectively eliminating them from the regression model). The LARS algorithm (Efron et al., 2004) provides a means of estimating which variables to include in the model, as well as their coefficients.

Parameters
• df – pandas dataframe where the columns are the predictor variables

• dep_vars – list of dependent variables within the dataframe

• ind_vars – list of independent (predictor) variables within the dataframe

• alpha_val – Value of the regularization parameter (alpha) for the Lasso estimator. If None then the value will be estimated using Bayes Information criterion (BIC) and cross-validation. (Default: None).

Returns

list of column names for good predictor variables

## Statistical Tests

rsgislib.tools.stats.breusch_pagan_test(x, y)

A function to perform a Breusch-Pagan test for heteroskedasticity in a linear model: H_0 = No heteroskedasticity. H_1 = Heteroskedasticity is present.

Returns a list containing three elements: 1. the Breusch-Pagan test statistic. 2. the p-value for the test. 3. the test result.

Parameters
• x – a numpy.ndarray containing the predictor variables. Shape = (nSamples, nPredictors).

• y – a 1D numpy.ndarray containing the response variable. Shape = (nSamples, ).

Returns

list containing 3 elements (Breusch-Pagan test statistic, p-value for the test, test result)

rsgislib.tools.stats.calc_pandas_vif(df, cols=None)

A function to calculate variance inflation factors to investigate multicollinearity between predictor variables.

Interpretation of VIF scores (somewhat subjective): 1 = No multicollinearity. 1-5 = Moderate multicollinearity. > 5 = High multicollinearity. > 10 = This predictor should be removed from the model.

Parameters
• df – pandas dataframe where the columns are the predictor variables

• cols – list of columns in the dataframe

Returns

A pandas series containing the VIF for each predictor variable.

```df = pandas.read_csv('metrics.csv')
cols = list(df.columns)
# Subset to the column names of interest
ind_vars = cols[6:]
vifs_series = calc_pandas_vif(df, ind_vars)
vifs_series.to_csv('VIF_scores.csv')
```
rsgislib.tools.stats.cqv_threshold(df, cols=None, lowthreshold=0.25, highthreshold=0.75)

A function to remove features with low & high variance based on the coefficient of quartile variation (CQV).

Low CQV == uninformative predictor. High CQV == unstable predictor variable.

Inspired by sklearn.feature_selection.VarianceThreshold(), but more robust since the coefficient of quartile variation is independent of feature scaling. It is also less susceptible to outliers than the coefficient of variation.

Parameters
• df – pandas dataframe where the columns are the predictor variables

• cols – list of columns in the dataframe

• lowthreshold – Float defining the CQV below which the predictors are uninfomative

• highthreshold – Float defining the CQV above which the predictors are unreliable

Returns

list of column names for good predictor variables

## Summary Statistics

rsgislib.tools.stats.bin_accuracy_scores_prob(y_true, y_prob)

A function to calculate accuracy measures for probabilistic responses with sklearn and scipy. Function written by Osian Roberts.

Parameters
• y_true – binary class labels, where 0 is absence and 1 is presence.

• y_prob – probability of presence scores e.g., generated by a species distribution model.

Returns

a list containing two arrays - metrics = names of test metrics. scores = test scores for each metric.

rsgislib.tools.stats.accuracy_scores_binary(y_true, y_pred)

A function to calculate accuracy measures for a binary classification. Function written by Osian Roberts.

Parameters: :param y_true: observed binary labels, where 0 is absence and 1 is presence. :param y_pred: predicted binary labels, where 0 is absence and 1 is presence. :returns: a list containing two numpy.arrays - (metrics: name of test metrics, scores: test scores for each metric)

Reference: See pages 253 - 255 in: Guisan et al. (2017). Habitat suitability and distribution models: with applications in R.

rsgislib.tools.stats.bias_score(y_true, y_pred)

A function to calculate the regression model bias for each response variable in a multivariate regression.

Parameters
• y_true – numpy.ndarray of true y values, shape=(n_samples, n_responses).

• y_true – numpy.ndarray of predicted y values, shape=(n_samples, n_responses).

Returns

bias (absolute bias), norm_bias (normalised bias (aka percentage bias))

rsgislib.tools.stats.decompose_bias_variance(y_true, y_pred)

A function to perform bias-variance decomposition. Decompose the mean squared error into: Bias^2 + Variance + Irreducible Error. Useful example: https://towardsdatascience.com/mse-and-bias-variance-decomposition-77449dd2ff55

Parameters
• y_true – numpy.ndarray of true y values, shape=(n_samples, n_responses).

• y_true – numpy.ndarray of predicted y values, shape=(n_samples, n_responses).

Returns

set (mse, bias_squared, variance, noise)

## Numpy Array Manipulation

rsgislib.tools.stats.mask_data_to_valid(data: numpy.array, lower_limit: Optional[float] = None, upper_limit: Optional[float] = None)

A function which removes rows from an nxm numpy array which are not finite or outside of the upper and lower thresholds provided.

Parameters
• data – nxm numpy array (n rows, m features)

• lower_limit – lower threshold valid data is greater than this value.

• upper_limit – upper threshold valid data is lower than this value.