RSGISLib Modelling
Species Distribution Modelling
Generate Samples
- rsgislib.modelling.species_distribution.gen_pseudo_absences_smpls(in_msk_img: str, img_msk_val: int, out_vec_file: str, out_vec_lyr: str, n_smpls: int = 10000, xtra_n_smpls: int = 30000, presence_smpls_vec_file: str = None, presence_smpls_vec_lyr: str = None, presence_smpls_dist_thres: float = 1000, out_format: str = 'GeoJSON', rnd_seed: int = None)
A function which generates pseudo absences samples for species distribution modelling. Note, it is expected that all input files have the same projection.
- Parameters:
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
out_vec_file – Output vector file path
out_vec_lyr – Output vector layer name
n_smpls – the maximum number of samples to be outputted
xtra_n_smpls – the number of extra samples to be produced internally so after the various masking steps the final number is likely to be near or higher than n_smpls
presence_smpls_vec_file – Optionally a set of presence samples can be provided and this is the file path to that file. If None then ignored (Default: None).
presence_smpls_vec_lyr – Optionally a set of presence samples can be provided and this the layer name of the vector.
presence_smpls_dist_thres – If provided this is a distance threshold to presence samples below which absences points are removed. Unit is the same as the projection of the input files.
out_format – The output vector format (Default: GeoJSON)
rnd_seed – Optionally provide a random seed for the random number generator. Default: None.
- rsgislib.modelling.species_distribution.combine_presence_absence_data(presence_smpls_vec_file: str, presence_smpls_vec_lyr: str, absence_smpls_vec_file: str, absence_smpls_vec_lyr: str, env_vars: Dict[str, EnvVarInfo], out_vec_file: str, out_vec_lyr: str, out_format: str = 'GPKG', equalise_smpls: bool = False, cls_col: str = 'clsid', rnd_seed: int = None) List[str]
A function which combines the presence and absence data into a single set with the option to equalise the number of samples within the two classes. The output file will only have the columns listed within env_vars and the classification column.
- Parameters:
presence_smpls_vec_file – path to the vector file with the presence data.
presence_smpls_vec_lyr – layer name of the vector file with the presence data.
absence_smpls_vec_file – path to the vector file with the absence data.
absence_smpls_vec_lyr – layer name of the vector file with the absence data.
env_vars – A dictionary of environment variables populated onto both the presence and absence data.
out_vec_file – the output vector file populated with presence and absence data.
out_vec_lyr – the output vector layer name.
out_format – the output vector file format (Default: GPKG)
equalise_smpls – optionally decide whether to equalise the number of samples. This would normally be done if you are using a tree based modelling (e.g., random forests)
cls_col – the name of the classification column.
rnd_seed – A seed for the random selection. Default: None.
- Returns:
a list of the variables the order in which they are present.
- rsgislib.modelling.species_distribution.create_train_test_sets(vec_file: str, vec_lyr: str, train_vec_file: str, train_vec_lyr: str, test_vec_file: str, test_vec_lyr: str, split_prop: float = 0.8, cls_col: str = 'clsid', out_format: str = 'GPKG', rnd_seed: int = None)
A function which splits the input vector layer into training and testing sets. The presences (1) and absences (0) are processed separately so the proportion of the presence (1) and absence (0) are maintained in the training and testing sets.
- Parameters:
vec_file – Input vector file path.
vec_lyr – Input vector layer name.
train_vec_file – Output training vector file path.
train_vec_lyr – Output training vector layer name.
test_vec_file – Output testing vector file path.
test_vec_lyr – Output testing vector layer name.
split_prop – The proportion of the data to be used as the training set. Default is 0.8 (i.e., 80 % for training and 20 % for testing).
cls_col – the classification column name.
out_format – the output vector file format
rnd_seed – Optionally provide a random seed for the random number generator. Default: None.
Populate Samples
- rsgislib.modelling.species_distribution.extract_env_var_data(env_vars: Dict[str, EnvVarInfo], smpls_vec_file: str, smpls_vec_lyr: str, out_vec_file: str, out_vec_lyr: str, out_format: str = 'GPKG', out_no_data_val: float = -9999, replace: bool = False)
A function which extracts samples for the environment variables for a set of samples (either presences or absences).
- Parameters:
env_vars – A dictionary of environment variables.
smpls_vec_file – path to the vector file.
smpls_vec_lyr – layer name of the vector file.
out_vec_file – output vector file populated with environment variables.
out_vec_lyr – output vector layer name.
out_format – output vector file format (Default: GPKG)
out_no_data_val – output no data value (Default: -9999)
- rsgislib.modelling.species_distribution.pop_normalise_coeffs(env_vars: Dict[str, EnvVarInfo], vec_file: str, vec_lyr: str)
A function which populates the env_vars dictionary of EnvVarInfo objects with the normalisation coefficients (mean and standard deviation) calculated from the inputted vector data. The inputted vector data is expected to be the combined presences and absences data.
- Parameters:
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will be populated with the mean and standard deviation.
vec_file – the input vector file path
vec_lyr – the input vector layer name
- rsgislib.modelling.species_distribution.apply_normalise_coeffs(env_vars: Dict[str, EnvVarInfo], vec_file: str, vec_lyr: str, out_vec_file: str, out_vec_lyr: str, out_format: str = 'GPKG')
A function which normalises the continuous variables using the mean and standard deviation provided within the env_vars dictionary.
- Parameters:
env_vars –
vec_file –
vec_lyr –
out_vec_file –
out_vec_lyr –
out_format –
- class rsgislib.modelling.species_distribution.EnvVarInfo(name: str = None, file: str = None, band: int = 1, data_type: int = 0, min_vld_val: float = None, max_vld_val: float = None, norm_mean: float = None, norm_std: float = None)
This is a class to store the defining the parameters for each environment variable.
- Parameters:
name – Name of the variable.
file – Image file path for the variable.
band – Band in the image representing the variable.
data_type – Variable type (Either rsgislib.VAR_TYPE_CONTINUOUS or rsgislib.VAR_TYPE_CATEGORICAL).
min_vld_val – The minimum valid value for the variable.
max_vld_val – The maximum valid value for the variable.
norm_mean – Mean value for normalisation.
norm_std – standard deviation value for normalisation.
name – Name of the variable.
file – Image file path for the variable.
band – Band in the image representing the variable.
data_type – Variable type (Either rsgislib.VAR_TYPE_CONTINUOUS or rsgislib.VAR_TYPE_CATEGORICAL).
min_vld_val – The minimum valid value for the variable.
max_vld_val – The maximum valid value for the variable.
norm_mean – Mean value for normalisation.
norm_std – standard deviation value for normalisation.
Summary Statistics
- rsgislib.modelling.species_distribution.comparison_box_plots(env_vars: Dict[str, EnvVarInfo], vec_file: str, vec_lyr: str, out_dir: str = 'boxplots', cls_col: str = 'clsid', box_plots: bool = False)
Create plots for each of the variables comparing the
- Parameters:
env_vars – A dictionary of environment variables populated onto both the presence and absence data.
vec_file – the input vector file path
vec_lyr – the input vector layer name
out_dir – output directory where the plots will be saved
cls_col – the column name of the class variable
box_plots – boolean flag indicating whether to plot boxplots (True) or violin plots (False; Default)
- rsgislib.modelling.species_distribution.correlation_matrix(env_vars: Dict[str, EnvVarInfo], vec_file: str, vec_lyr: str, out_corr_file: str = 'correlation_matrix.csv', out_plt_file: str = 'correlation_matrix.png', fig_width: int = 15, fig_height: int = 14)
Calculate the correlation matrix between all the variables and optionally output a plot of the correlation matrix.
- Parameters:
env_vars – A dictionary of environment variables populated onto both the presence and absence data.
vec_file – the input vector file path
vec_lyr – the input vector layer name
out_corr_file – output correlation matrix CSV file
out_plt_file – output plot file (if None then ignored)
fig_width – The width of the plot figure
fig_height – The height of the plot figure
- rsgislib.modelling.species_distribution.calc_vif_multicollinearity(env_vars: List[str], vec_file: str, vec_lyr: str, out_vif_file: str = 'vif_multicollinearity.csv')
A function to calculate variance inflation factors to investigate multicollinearity between predictor variables.
Interpretation of VIF scores (somewhat subjective): 1 = No multicollinearity. 1-5 = Moderate multicollinearity. > 5 = High multicollinearity. > 10 = This predictor should be removed from the model.
- Parameters:
env_vars – A dictionary of environment variables populated onto both the presence and absence data.
vec_file – the input vector file path
vec_lyr – the input vector layer name
out_vif_file – output VIF CSV file
Model Fitting
- rsgislib.modelling.species_distribution.search_mdl_params(search_obj: BaseSearchCV, train_vec_file: str, train_vec_lyr: str, analysis_vars: List[str], cls_col: str = 'clsid') Tuple[BaseEstimator, Dict[str, Any]]
A function which will run a scikit-learn search (e.g., GridSearchCV) to find optimal parameters for the model estimator.
- Parameters:
search_obj – A scikit-learn SearchCV object
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
cls_col – the name of the column specifying the class within the input vector layer.
- Returns:
returns the estimator initialised with the best parameters and a dictionary of the best parameters.
- rsgislib.modelling.species_distribution.fit_sklearn_mdl(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, test_vec_file: str, test_vec_lyr: str, analysis_vars: List[str], cls_col: str = 'clsid', roc_curve_plot: str = None) Tuple[float, float, float]
A function which fits a single scikit-learn estimator model returning the accuracy statistics and optionally plotting a ROC curve.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
test_vec_file – file path to a vector file with the testing data.
test_vec_lyr – vector layer name for the testing data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
cls_col – the name of the column specifying the class within the input vector layers.
roc_curve_plot – A file path for the ROC curve plot to be outputted.
- Returns:
returns the training accuracy, testing accuracy and ROC AUC score.
- rsgislib.modelling.species_distribution.fit_sklearn_slg_cls_mdl(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, test_vec_file: str, test_vec_lyr: str, analysis_vars: List[str], cls_col: str = 'clsid', roc_curve_plot: str = None) Tuple[float, float]
A function which fits a single scikit-learn outlier estimator model returning the accuracy statistics and optionally plotting a ROC curve.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
test_vec_file – file path to a vector file with the testing data.
test_vec_lyr – vector layer name for the testing data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
cls_col – the name of the column specifying the class within the input vector layers.
roc_curve_plot – A file path for the ROC curve plot to be outputted.
- Returns:
returns the testing accuracy and ROC AUC score.
- rsgislib.modelling.species_distribution.fit_kfold_sklearn_mdls(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, test_vec_file: str, test_vec_lyr: str, analysis_vars: List[str], n_kfolds: int = 10, fold_prop: float = 0.6, cls_col: str = 'clsid', rnd_seed: int = None, sel_replacement: bool = True) List[BaseEstimator]
Fit an ensemble of classifiers using K-fold selection of training data subsets.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
test_vec_file – file path to a vector file with the testing data.
test_vec_lyr – vector layer name for the testing data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
n_kfolds – The number of models to fit from training data subsets
fold_prop – The proportion of the training data set to use for each model. The proportion should be between 0 and 1. The subset will be randomly selected.
cls_col – the name of the column specifying the class within the input vector layers.
rnd_seed – Optionally provide a random seed for the random number generator. Default: None.
sel_replacement – Optionally replace the training data when creating the subsets. Default: True.
- Returns:
returns a list of scikit-learn estimator models
- rsgislib.modelling.species_distribution.fit_kfold_sklearn_sgl_cls_mdls(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, test_vec_file: str, test_vec_lyr: str, analysis_vars: List[str], n_kfolds: int = 10, fold_prop: float = 0.6, cls_col: str = 'clsid', rnd_seed: int = None, sel_replacement: bool = True) List[BaseEstimator]
Fit an ensemble of classifiers using K-fold selection of training data subsets.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
test_vec_file – file path to a vector file with the testing data.
test_vec_lyr – vector layer name for the testing data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
n_kfolds – The number of models to fit from training data subsets
fold_prop – The proportion of the training data set to use for each model. The proportion should be between 0 and 1. The subset will be randomly selected.
cls_col – the name of the column specifying the class within the input vector layers.
rnd_seed – Optionally provide a random seed for the random number generator. Default: None.
sel_replacement – Optionally replace the training data when creating the subsets. Default: True.
- Returns:
returns a list of scikit-learn estimator models
Model Explanation
- rsgislib.modelling.species_distribution.shap_sklearn_mdl_explainer(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, analysis_vars: List[str], shap_summary_plot: str = None, shap_heatmap_plot: str = None, shap_scatter_plots_dir: str = None, shap_depend_plots_dir: str = None, subsample_n_smpls: int = None, use_tree_explainer: bool = False, use_linear_explainer: bool = False)
This functions uses the SHAP model to output feature importance and dependence plots for the est_cls_obj model. If neither use_tree_explainer or use_linear_explainer are set, then the KernelExplainer will be used and this works with all models.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
shap_summary_plot – Output summary plot of SHAP scores.
shap_heatmap_plot – Output heatmap of SHAP scores.
shap_scatter_plots_dir – Output directory for SHAP scatter plots
shap_depend_plots_dir – Output directory for SHAP dependence plots
subsample_n_smpls – Optionally use a subset of the training samples for the SHAP analysis.
use_tree_explainer – The est_cls_obj is a tree classifier (e.g., Random Forests) and therefore the SHAP TreeExplainer can be used.
use_linear_explainer – The est_cls_obj is a linear classifier (e.g., Logistic Regression) and therefore the SHAP LinearExplainer can be used.
- rsgislib.modelling.species_distribution.salib_sklearn_mdl_sensitity(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, analysis_vars: List[str], n_samples: int = 10000, sobol_plot_file: str = None, sobol_overall_file: str = None, sobol_first_file: str = None, sobol_second_file: str = None)
This functions uses the SALib sobol sensitivity analysis to assess the feature importance and first and second order responses within the est_cls_obj model.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
n_samples – the number of samples to be generated and used for the sobol modelling.
sobol_plot_file – Output sobol plot file path.
sobol_overall_file – Output overall sobol variance values CSV file.
sobol_first_file – Output first order sobol variance values CSV file.
sobol_second_file – Output second order sobol variance values CSV file.
- rsgislib.modelling.species_distribution.sklearn_mdl_variable_response_curves(est_cls_obj: BaseEstimator, train_vec_file: str, train_vec_lyr: str, analysis_vars: List[str], output_file: str, response_plots_dir: str, n_samples: int = 1000, cls_col: str = 'clsid', normalised_data: bool = False, env_vars: Dict[str, EnvVarInfo] = None)
A function to generate response curves for a scikit learn classifier. To generate the response curves the mean presence values are used for the environment variables and then each variable is varied in turn across the full range of values within the presence and absence data training data.
- Parameters:
est_cls_obj – a scikit-learn estimator model
train_vec_file – file path to a vector file with the training data.
train_vec_lyr – vector layer name for the training data.
analysis_vars – a list of environmental variables to be used for the analysis. The names must be the column names within the vector layer.
output_file – output CSV file with the response curve data.
response_plots_dir – output directory where the response curve plots will be outputted.
n_samples – the number of samples to be generated to generate the response curves
cls_col – the name of the column specifying the class within the input vector layers.
normalised_data – boolean specifying whether the input data is normalised. If the data is normalised then to aid interpretation of the resulting response curves the orignal values will be used. Default: False
env_vars – If the data is normalised then a dictionary of environment variables populated onto the train_vec_lyr data if required. The EnvVarInfo will provide the normalisation (mean and standard deviation) values.
Apply Models
- rsgislib.modelling.species_distribution.pred_sklearn_mdl_prob(est_cls_obj: BaseEstimator, in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', normalise_data: bool = False, calc_img_stats: bool = True)
A function which calculates the probability of the presence of the species of interest using the trained model. Note, the est_cls_obj must have the predict_proba function available to use by this function.
- Parameters:
est_cls_obj – a scikit-learn estimator model
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
- rsgislib.modelling.species_distribution.pred_sklearn_mdl_cls(est_cls_obj: BaseEstimator, in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', normalise_data: bool = False, calc_img_stats: bool = True)
A function which calculates the binary classification for the presence of the species of interest using the trained model. Note, the est_cls_obj must have the predict function available to use by this function.
- Parameters:
est_cls_obj – a scikit-learn estimator model
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
- rsgislib.modelling.species_distribution.pred_ensemble_sklearn_mdls_prob(est_cls_objs: List[BaseEstimator], in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', tmp_dir: str = 'tmp_dir', normalise_data: bool = False, calc_img_stats: bool = True)
A function which runs an ensemble of trained classifier models estimating the probability of the species presence. The ensemble results are combined by calculating the median of the probability from each model.
- Parameters:
est_cls_objs – A list of scikit-learn estimator models which have been trained.
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
tmp_dir – the temporary directory where the intermediate image files will be outputted.
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
- rsgislib.modelling.species_distribution.pred_ensemble_sklearn_mdls_cls(est_cls_objs: List[BaseEstimator], in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', tmp_dir: str = 'tmp_dir', normalise_data: bool = False, calc_img_stats: bool = True)
A function which runs an ensemble of trained classifier models estimating the binary classification of the species presence. The ensemble results are combined by calculating the number of times a pixel is included in the presences class from each model.
- Parameters:
est_cls_objs – A list of scikit-learn estimator models which have been trained.
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
tmp_dir – the temporary directory where the intermediate image files will be outputted.
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
- rsgislib.modelling.species_distribution.pred_ensemble_sklearn_slg_cls_mdls_prob(est_cls_objs: List[BaseEstimator], in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', tmp_dir: str = 'tmp_dir', normalise_data: bool = False, calc_img_stats: bool = True)
A function which runs an ensemble of trained classifier models estimating the probability of the species presence. The ensemble results are combined by calculating the median of the probability from each model.
- Parameters:
est_cls_objs – A list of scikit-learn estimator models which have been trained.
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
tmp_dir – the temporary directory where the intermediate image files will be outputted.
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
- rsgislib.modelling.species_distribution.pred_ensemble_sklearn_sgl_cls_mdls_cls(est_cls_objs: List[BaseEstimator], in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', tmp_dir: str = 'tmp_dir', normalise_data: bool = False, calc_img_stats: bool = True)
A function which runs an ensemble of trained classifier models estimating the binary classification of the species presence. The ensemble results are combined by calculating the number of times a pixel is included in the presences class from each model.
- Parameters:
est_cls_objs – A list of scikit-learn estimator models which have been trained.
in_msk_img – File path to a valid mask image which defines the region of interest.
img_msk_val – The value within the mask image which defines the region of interest
env_vars – A dictionary of environment variables populated onto both the presence and absence data. The EnvVarInfo will provide the image file path, image band and normalisation (mean and standard deviation) values.
analysis_vars – a list of environmental variables to be used for the analysis - specifies the order of the variables presented to the classifier.
output_img – the output image path
gdalformat – the output image format
tmp_dir – the temporary directory where the intermediate image files will be outputted.
normalise_data – boolean specifying whether to normalise the input data. This should be used if the model was trained with normalised data. The mean and standard deviation of variables used for normalisation should be provided through the env_vars dict of EnvVarInfo objects.
calc_img_stats – boolean specifying whether to calculate the image statistic and pyramids are built for the output image. Default: True
Other Tools
- rsgislib.modelling.species_distribution.create_finite_mask(in_msk_img: str, img_msk_val: int, env_vars: Dict[str, EnvVarInfo], analysis_vars: List[str], output_img: str, gdalformat: str = 'KEA', tmp_dir: str = 'tmp_msk_dir')