RSGISLib Species Distribution Modelling Module

The Species Distrubution Modelling (SDM) module provides a set of tools and utilities for undertaking a species distribution modelling project.

These functions have been developed by Osian Roberts.

Running a Species Distribution Modelling using RSGISLib

To run a species distribution model you first need to define the input variables and parameters:

import rsgislib.sdm

# Define simulation parameters within a Python dictionary:
ContinuousLayers = ['./Data/Environmental_Predictors/Annual_Insolation.tif', './Data/Environmental_Predictors/Elevation.tif', './Data/Environmental_Predictors/Forest_Distance.tif',
                    './Data/Environmental_Predictors/Summer_Temp.tif', './Data/Environmental_Predictors/Topographic_Wetness.tif', './Data/Environmental_Predictors/Wind_Velocity.tif',
                    './Data/Environmental_Predictors/Winter_Temp.tif']

CategoricalLayers = ['./Data/Environmental_Predictors/Corine_CLC_2018.tif', './Data/Environmental_Predictors/Forest_Type.tif']

simulation = {}
simulation['Name'] = 'Test_Run_Extra_Trees'
simulation['Output Directory'] = './' + simulation['Name']
simulation['Images'] = ContinuousLayers + CategoricalLayers
simulation['Categorical'] = [False for i in ContinuousLayers] + [True for i in CategoricalLayers]  # list of booleans where True == categorical variable.
simulation['Null Value'] = 0  # No data value for the Continuous and Categorical raster layers.
simulation['Valid Mask'] = './Data/Wales_Binary_Mask.tif'
simulation['Environmental Predictors'] = [image.split('/')[-1].replace('.tif', '') for image in simulation['Images']]  # define names of the environmental predictors.

# define the study region using the binary mask image:
geotransform, projection, xsize, ysize, noDataVal = rsgislib.sdm.read_raster_information(simulation['Valid Mask'])
simulation['geotransform'] = geotransform
simulation['projection'] = projection
simulation['rasterxsize'] = xsize
simulation['rasterysize'] = ysize
del geotransform, projection, xsize, ysize, noDataVal

The sighting records then need processing and the pseudo-absences creating:

# read xy coordinates for the presence records:
presence_xy = rsgislib.sdm.read_sightings_vector('./Data/Squirrel_Sightings/Red_Squirrel.geojson')  # import from OGR vector file.
#presence_xy = rsgislib.sdm.read_sightings_csv('./Data/Squirrel_Sightings/Red_Squirrel.csv', sep=',', xfield='x (OSGB)', yfield='y (OSGB)')  # import from CSV file.

# extract values from input images for presence records:
presence_data = rsgislib.sdm.extract_raster_values(simulation, presence_xy)

# generate pseudo-absence x&y coordinates:
absence_xy = rsgislib.sdm.generate_pseudoabsences(simulation, n_points=15000)

# (optional) drop pseudo-absence points that are proximal to presence records (e.g. within 500 metres):
absence_xy = rsgislib.sdm.drop_proximal_records(presence_xy, absence_xy, distance=500)

# (optional) ensure that number of presence records == number of absence records:
# Note: this is an important consideration particularly for decision-tree estimators.
absence_xy = rsgislib.sdm.equalise_records(presence_xy, absence_xy)

# (optional) export pseudo-absence xy coordinates to a CSV:
#rsgislib.sdm.export_pseudoabsence_xy(absence_xy, 'Pseudoabsence_Coords_Extra_Trees.csv')

# extract values from input images for absence records:
absence_data = rsgislib.sdm.extract_raster_values(simulation, absence_xy)
del presence_xy, absence_xy

# drop records with no valid data:
presence_data = rsgislib.sdm.drop_null_records(simulation, presence_data)
absence_data = rsgislib.sdm.drop_null_records(simulation, absence_data)

# combine the presence and absence data into a single array:
presence_absence_data = rsgislib.sdm.combine_records(presence_data, absence_data)
del presence_data, absence_data

Finally, you can run the simulation but you also need to create an instance of the estimator you want use:

Extra-Tree / Random Forests:

# define the estimator:
from sklearn.ensemble import ExtraTreesClassifier as ET
estimator = ET(n_estimators=100, max_depth=6, min_samples_split=4, min_samples_leaf=2, min_weight_fraction_leaf=0.0, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=False, n_jobs=4)

# run the simulation:
rsgislib.sdm.apply_decision_tree_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation

Logistic Regression (MaxEnt):

# define the estimator:
from sklearn.linear_model import LogisticRegression as maxent
estimator = maxent(solver='liblinear', max_iter=500, penalty='l2', tol=0.0001, C=0.1, fit_intercept=True, intercept_scaling=1, n_jobs=1)

# run the simulation:
rsgislib.sdm.apply_meta_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation

Gradiant Boosting:

# define the estimator:
from sklearn.ensemble import GradientBoostingClassifier as GBT
estimator = GBT(loss='deviance', learning_rate=0.01, n_estimators=100, subsample=1.0, min_samples_split=4, min_samples_leaf=2, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_split=1e-07, max_features=None)

# run the simulation:
rsgislib.sdm.apply_decision_tree_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation

Docker

If you are running these functions from within Docker (or Singularity) then you may get an error from matplotlib either a ‘Segmentation Fault’ or error related to the ‘Backend’. This is due to fast that there is no graphical interface within a docker image and therefore we need to use an appropriate ‘backend’.

To change to use the Agg backend the following import and function need to run before the rsgislib.sdm import:

import matplotlib
matplotlib.use('Agg')

Python Future Warnings

From the scikit-learn functions you sometimes get future warning printed to the console. You can set the following filters to the top of your script prevent these printing to the console:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Loading Data

rsgislib.sdm.read_sightings_vector(InputVector)

A function to read occurrence records from an OGR supported vector file (e.g. ESRI Shapefile). This must be a point dataset, not a line or polygon.

Parameters

problem – A python dictionary containing the simulation parameters.

Returns

a numpy.ndarray containing the x & y coordinates, shape=(n_samples, 2).

rsgislib.sdm.read_sightings_csv(InputCSV, sep=',', xfield='x', yfield='y')

A function to read occurrence records from a comma separated text file. Returns a numpy.array containing the x & y coordinates, shape=(n_samples, 2).

Parameters
  • InputCSV – Path to the input CSV file.

  • sep – Column separator / delimiter.

  • xfield – String defining the column name for the x coordinates (aka longitude, eastings).

  • yfield – String defining the column name for the y coordinates (aka latitude, northings).

rsgislib.sdm.extract_raster_values(problem, xy)

A function to extract raster values using x & y coordinates.

Parameters
  • problem – A python dictionary containing the simulation parameters.

  • xy – a numpy.ndarray containing x & y coordinates, shape=(n_samples, 2).

Returns

a structured np.recarray containing the extracted raster values.

Manipulate Records

rsgislib.sdm.generate_pseudoabsences(problem, n_points=10000)

A function to generate random pseudoabsence records.

Required Parameters: :param problem: A python dictionary containing the simulation parameters. :param n_points: number of pseudo-absence records to generate. Default is 10000 (same as Maxent). :return: A np.ndarray of tupled x & y coordinates, shape=(n_samples, 2).

rsgislib.sdm.export_pseudoabsence_xy(absence_records, OutputCSV)

A function to export pseudo-absence x & y coordinates to a comma separated text file.

Parameters
  • absence_records – np.ndarray containing x & y coordinates of pseudo-absence records, shape=(n_samples, 2).

  • OutputCSV – Path to the output CSV.

rsgislib.sdm.equalise_records(presence_records, absence_records)

A function to ensure that the number of absence records == presence records. Absence records are removed if the number of absence records > presence records.

Parameters
  • presence_data – a np.ndarray or np.recarray of presence data, shape(n_samples, n_features).

  • absence_data – a np.ndarray or np.recarray of absence data, shape(n_samples, n_features).

Returns

a np.array of absence records where n_absences == n_presences. If the number of absence records < presence records, the original array is returned without modification.

rsgislib.sdm.drop_proximal_records(presence_records, absence_records, distance, k=6)

A function to remove absence records that are close to presence records. This function works best for geographic coordinates (i.e. not projected).

Parameters
  • presence_records – np.ndarray containing x & y coordinates of presence records, shape=(n_samples, 2).

  • absence_records – np.ndarray containing x & y coordinates of absence records, shape=(n_samples, 2).

  • distance – A distance threshold to remove proximal absence records.

  • k – number of nearest neighbours to find. Default = 6.

Returns

Returns x & y coordinates of valid absence records.

rsgislib.sdm.drop_null_records(problem, records)

A function to remove presence or absence records with no valid data.

Parameters
  • problem – A python dictionary containing the simulation parameters.

  • records – a np.recarray containing values extracted from environmental rasters.

Returns

a numpy.recarray containing valid presence-absence records.

rsgislib.sdm.combine_records(presence_data, absence_data)

A function to concatenate and classify presence-absence records. Absences are assigned values = 0, presence = 1.

Parameters
  • presence_data – a np.recarray containing presence data.

  • absence_data – a np.recarray containing absence data.

Returns

a np.recarray containing presence-absence data, shape=(n_samples, n_predictors + label_column).

Run SDM

rsgislib.sdm.apply_decision_tree_estimator(params, presence_absence_data, estimator=None, validation='split-sample', test_fraction=0.15, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True, plot_out_format='png', plot_out_dpi=300)

A function to model species distributions using scikit-learn decision tree classifiers supporting the methods feature_importances_ and predict_proba. Use with: DecisionTreeClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier or AdaBoostClassifier.

Parameters
  • params – A python dictionary containing the simulation parameters.

  • presence_absence_data – np.recarray containing presence-absence training data.

  • estimator – An sklearn decision tree classifier with methods: feature_importances_ and predict_proba. If None, uses sklearn.tree.DecisionTreeClassifier by default.

  • validation – ‘bootstrap’ or ‘split-sample’ resampling methods for cross-validation are currently supported.

  • test_fraction – Float defining the proportion of presence-absence data to set aside for validation. Only used in split-sample validation. Default = 0.15 (i.e. 15%).

  • replicates – Number of model iterations to perform (default=10).

  • gdalformat – Format of output images. Valid options “GTiff” or “KEA”.

  • clf_labels – Name of column in presence_absence_data delineating the binary class labels (presence=1, absence=0).

  • Overwrite – Boolean option to overwrite the output directory if it already exists.

  • plot_out_format – Output format for plots (options: png or pdf; default png)

  • plot_out_dpi – resolution of plot output images. Default 300.

rsgislib.sdm.apply_meta_estimator(params, presence_absence_data, estimator=None, validation='split-sample', test_fraction=0.15, replicates=10, max_depth=3, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True, plot_out_format='png', plot_out_dpi=300)

A function to model species distributions using scikit-learn classifiers supporting the method predict_proba. Works with: LogisticRegression (aka maxent) and MLPClassifier (perhaps others?). Is very slow with SVC.

To avoid encoding each categorical variable, this function uses gradient boosted decision trees to perform a feature transform. The leaves within the fitted decision tree are then encoded into sparse binary values and a meta-estimator (e.g. logistic regression / maxent) is then fitted onto the transformed features/variables. This often leads to better results than can be achieved by fitting the meta-estimator directly. This method is related to manifold learning & stacked regression (where a model is trained from the predictions of previous estimators).

References: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html https://www.quora.com/Why-do-people-use-gradient-boosted-decision-trees-to-do-feature-transform Breiman, L. (1996). Stacked regressions. Machine learning, 24(1), 49-64.

Parameters
  • params – A python dictionary containing the simulation parameters.

  • presence_absence_data – np.recarray containing presence-absence training data.

  • estimator – An sklearn classifier supporting a predict_proba method. If None, uses sklearn.linear_model.LogisticRegression (Maxent) by default.

  • validation – ‘bootstrap’ or ‘split-sample’ resampling methods for cross-validation are currently supported.

  • test_fraction – Float defining the proportion of presence-absence data to set aside for validation. Only used in split-sample validation. Default = 0.15 (i.e. 15%).

  • replicates – Number of model iterations to perform (default=10).

  • max_depth – Maximum number of nodes in the decision tree used to transform the presence absence data. When max_depth is high, a very detailed transform is undertaken. Conversely, setting max_depth to a low value (e.g. 3) limits overfitting.

  • gdalformat – Format of output images. Valid options “GTiff” or “KEA”.

  • clf_labels – Name of column in presence_absence_data delineating the binary class labels (presence=1, absence=0).

  • Overwrite – Boolean option to overwrite the output directory if it already exists.

  • plot_out_format – Output format for plots (options: png or pdf; default png)

  • plot_out_dpi – resolution of plot output images. Default 300.

Utilities

rsgislib.sdm.read_raster_information(InputImage)
A function to read a GDAL image to determine:
  • geot = Image geotransform.

  • srs = Coordinate system.

  • xsize, ysize = Image dimensions along the x & y axes.

  • noDataVal = NoDataValue for invalid pixels.

Parameters

InputImage – path to the input raster.