RSGISLib Species Distribution Modelling Module¶
The Species Distrubution Modelling (SDM) module provides a set of tools and utilities for undertaking a species distribution modelling project.
These functions have been developed by Osian Roberts.
Running a Species Distribution Modelling using RSGISLib¶
To run a species distribution model you first need to define the input variables and parameters:
import rsgislib.sdm
# Define simulation parameters within a Python dictionary:
ContinuousLayers = ['./Data/Environmental_Predictors/Annual_Insolation.tif', './Data/Environmental_Predictors/Elevation.tif', './Data/Environmental_Predictors/Forest_Distance.tif',
'./Data/Environmental_Predictors/Summer_Temp.tif', './Data/Environmental_Predictors/Topographic_Wetness.tif', './Data/Environmental_Predictors/Wind_Velocity.tif',
'./Data/Environmental_Predictors/Winter_Temp.tif']
CategoricalLayers = ['./Data/Environmental_Predictors/Corine_CLC_2018.tif', './Data/Environmental_Predictors/Forest_Type.tif']
simulation = {}
simulation['Name'] = 'Test_Run_Extra_Trees'
simulation['Output Directory'] = './' + simulation['Name']
simulation['Images'] = ContinuousLayers + CategoricalLayers
simulation['Categorical'] = [False for i in ContinuousLayers] + [True for i in CategoricalLayers] # list of booleans where True == categorical variable.
simulation['Null Value'] = 0 # No data value for the Continuous and Categorical raster layers.
simulation['Valid Mask'] = './Data/Wales_Binary_Mask.tif'
simulation['Environmental Predictors'] = [image.split('/')[-1].replace('.tif', '') for image in simulation['Images']] # define names of the environmental predictors.
# define the study region using the binary mask image:
geotransform, projection, xsize, ysize, noDataVal = rsgislib.sdm.read_raster_information(simulation['Valid Mask'])
simulation['geotransform'] = geotransform
simulation['projection'] = projection
simulation['rasterxsize'] = xsize
simulation['rasterysize'] = ysize
del geotransform, projection, xsize, ysize, noDataVal
The sighting records then need processing and the pseudo-absences creating:
# read xy coordinates for the presence records:
presence_xy = rsgislib.sdm.read_sightings_vector('./Data/Squirrel_Sightings/Red_Squirrel.geojson') # import from OGR vector file.
#presence_xy = rsgislib.sdm.read_sightings_csv('./Data/Squirrel_Sightings/Red_Squirrel.csv', sep=',', xfield='x (OSGB)', yfield='y (OSGB)') # import from CSV file.
# extract values from input images for presence records:
presence_data = rsgislib.sdm.extract_raster_values(simulation, presence_xy)
# generate pseudo-absence x&y coordinates:
absence_xy = rsgislib.sdm.generate_pseudoabsences(simulation, n_points=15000)
# (optional) drop pseudo-absence points that are proximal to presence records (e.g. within 500 metres):
absence_xy = rsgislib.sdm.drop_proximal_records(presence_xy, absence_xy, distance=500)
# (optional) ensure that number of presence records == number of absence records:
# Note: this is an important consideration particularly for decision-tree estimators.
absence_xy = rsgislib.sdm.equalise_records(presence_xy, absence_xy)
# (optional) export pseudo-absence xy coordinates to a CSV:
#rsgislib.sdm.export_pseudoabsence_xy(absence_xy, 'Pseudoabsence_Coords_Extra_Trees.csv')
# extract values from input images for absence records:
absence_data = rsgislib.sdm.extract_raster_values(simulation, absence_xy)
del presence_xy, absence_xy
# drop records with no valid data:
presence_data = rsgislib.sdm.drop_null_records(simulation, presence_data)
absence_data = rsgislib.sdm.drop_null_records(simulation, absence_data)
# combine the presence and absence data into a single array:
presence_absence_data = rsgislib.sdm.combine_records(presence_data, absence_data)
del presence_data, absence_data
Finally, you can run the simulation but you also need to create an instance of the estimator you want use:
Extra-Tree / Random Forests:
# define the estimator:
from sklearn.ensemble import ExtraTreesClassifier as ET
estimator = ET(n_estimators=100, max_depth=6, min_samples_split=4, min_samples_leaf=2, min_weight_fraction_leaf=0.0, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=False, n_jobs=4)
# run the simulation:
rsgislib.sdm.apply_decision_tree_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation
Logistic Regression (MaxEnt):
# define the estimator:
from sklearn.linear_model import LogisticRegression as maxent
estimator = maxent(solver='liblinear', max_iter=500, penalty='l2', tol=0.0001, C=0.1, fit_intercept=True, intercept_scaling=1, n_jobs=1)
# run the simulation:
rsgislib.sdm.apply_meta_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation
Gradiant Boosting:
# define the estimator:
from sklearn.ensemble import GradientBoostingClassifier as GBT
estimator = GBT(loss='deviance', learning_rate=0.01, n_estimators=100, subsample=1.0, min_samples_split=4, min_samples_leaf=2, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_split=1e-07, max_features=None)
# run the simulation:
rsgislib.sdm.apply_decision_tree_estimator(simulation, presence_absence_data, estimator, validation='split-sample', test_fraction=0.1, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True)
del presence_absence_data, simulation
Docker¶
If you are running these functions from within Docker (or Singularity) then you may get an error from matplotlib either a ‘Segmentation Fault’ or error related to the ‘Backend’. This is due to fast that there is no graphical interface within a docker image and therefore we need to use an appropriate ‘backend’.
To change to use the Agg backend the following import and function need to run before the rsgislib.sdm import:
import matplotlib
matplotlib.use('Agg')
Python Future Warnings¶
From the scikit-learn functions you sometimes get future warning printed to the console. You can set the following filters to the top of your script prevent these printing to the console:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Loading Data¶
-
rsgislib.sdm.
read_sightings_vector
(InputVector)¶ A function to read occurrence records from an OGR supported vector file (e.g. ESRI Shapefile). This must be a point dataset, not a line or polygon.
- Parameters
problem – A python dictionary containing the simulation parameters.
- Returns
a numpy.ndarray containing the x & y coordinates, shape=(n_samples, 2).
-
rsgislib.sdm.
read_sightings_csv
(InputCSV, sep=',', xfield='x', yfield='y')¶ A function to read occurrence records from a comma separated text file. Returns a numpy.array containing the x & y coordinates, shape=(n_samples, 2).
- Parameters
InputCSV – Path to the input CSV file.
sep – Column separator / delimiter.
xfield – String defining the column name for the x coordinates (aka longitude, eastings).
yfield – String defining the column name for the y coordinates (aka latitude, northings).
-
rsgislib.sdm.
extract_raster_values
(problem, xy)¶ A function to extract raster values using x & y coordinates.
- Parameters
problem – A python dictionary containing the simulation parameters.
xy – a numpy.ndarray containing x & y coordinates, shape=(n_samples, 2).
- Returns
a structured np.recarray containing the extracted raster values.
Manipulate Records¶
-
rsgislib.sdm.
generate_pseudoabsences
(problem, n_points=10000)¶ A function to generate random pseudoabsence records.
Required Parameters: :param problem: A python dictionary containing the simulation parameters. :param n_points: number of pseudo-absence records to generate. Default is 10000 (same as Maxent). :return: A np.ndarray of tupled x & y coordinates, shape=(n_samples, 2).
-
rsgislib.sdm.
export_pseudoabsence_xy
(absence_records, OutputCSV)¶ A function to export pseudo-absence x & y coordinates to a comma separated text file.
- Parameters
absence_records – np.ndarray containing x & y coordinates of pseudo-absence records, shape=(n_samples, 2).
OutputCSV – Path to the output CSV.
-
rsgislib.sdm.
equalise_records
(presence_records, absence_records)¶ A function to ensure that the number of absence records == presence records. Absence records are removed if the number of absence records > presence records.
- Parameters
presence_data – a np.ndarray or np.recarray of presence data, shape(n_samples, n_features).
absence_data – a np.ndarray or np.recarray of absence data, shape(n_samples, n_features).
- Returns
a np.array of absence records where n_absences == n_presences. If the number of absence records < presence records, the original array is returned without modification.
-
rsgislib.sdm.
drop_proximal_records
(presence_records, absence_records, distance, k=6)¶ A function to remove absence records that are close to presence records. This function works best for geographic coordinates (i.e. not projected).
- Parameters
presence_records – np.ndarray containing x & y coordinates of presence records, shape=(n_samples, 2).
absence_records – np.ndarray containing x & y coordinates of absence records, shape=(n_samples, 2).
distance – A distance threshold to remove proximal absence records.
k – number of nearest neighbours to find. Default = 6.
- Returns
Returns x & y coordinates of valid absence records.
-
rsgislib.sdm.
drop_null_records
(problem, records)¶ A function to remove presence or absence records with no valid data.
- Parameters
problem – A python dictionary containing the simulation parameters.
records – a np.recarray containing values extracted from environmental rasters.
- Returns
a numpy.recarray containing valid presence-absence records.
-
rsgislib.sdm.
combine_records
(presence_data, absence_data)¶ A function to concatenate and classify presence-absence records. Absences are assigned values = 0, presence = 1.
- Parameters
presence_data – a np.recarray containing presence data.
absence_data – a np.recarray containing absence data.
- Returns
a np.recarray containing presence-absence data, shape=(n_samples, n_predictors + label_column).
Run SDM¶
-
rsgislib.sdm.
apply_decision_tree_estimator
(params, presence_absence_data, estimator=None, validation='split-sample', test_fraction=0.15, replicates=10, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True, plot_out_format='png', plot_out_dpi=300)¶ A function to model species distributions using scikit-learn decision tree classifiers supporting the methods feature_importances_ and predict_proba. Use with: DecisionTreeClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier or AdaBoostClassifier.
- Parameters
params – A python dictionary containing the simulation parameters.
presence_absence_data – np.recarray containing presence-absence training data.
estimator – An sklearn decision tree classifier with methods: feature_importances_ and predict_proba. If None, uses sklearn.tree.DecisionTreeClassifier by default.
validation – ‘bootstrap’ or ‘split-sample’ resampling methods for cross-validation are currently supported.
test_fraction – Float defining the proportion of presence-absence data to set aside for validation. Only used in split-sample validation. Default = 0.15 (i.e. 15%).
replicates – Number of model iterations to perform (default=10).
gdalformat – Format of output images. Valid options “GTiff” or “KEA”.
clf_labels – Name of column in presence_absence_data delineating the binary class labels (presence=1, absence=0).
Overwrite – Boolean option to overwrite the output directory if it already exists.
plot_out_format – Output format for plots (options: png or pdf; default png)
plot_out_dpi – resolution of plot output images. Default 300.
-
rsgislib.sdm.
apply_meta_estimator
(params, presence_absence_data, estimator=None, validation='split-sample', test_fraction=0.15, replicates=10, max_depth=3, gdalformat='GTiff', clf_labels='Presence-Absence', Overwrite=True, plot_out_format='png', plot_out_dpi=300)¶ A function to model species distributions using scikit-learn classifiers supporting the method predict_proba. Works with: LogisticRegression (aka maxent) and MLPClassifier (perhaps others?). Is very slow with SVC.
To avoid encoding each categorical variable, this function uses gradient boosted decision trees to perform a feature transform. The leaves within the fitted decision tree are then encoded into sparse binary values and a meta-estimator (e.g. logistic regression / maxent) is then fitted onto the transformed features/variables. This often leads to better results than can be achieved by fitting the meta-estimator directly. This method is related to manifold learning & stacked regression (where a model is trained from the predictions of previous estimators).
References: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html https://www.quora.com/Why-do-people-use-gradient-boosted-decision-trees-to-do-feature-transform Breiman, L. (1996). Stacked regressions. Machine learning, 24(1), 49-64.
- Parameters
params – A python dictionary containing the simulation parameters.
presence_absence_data – np.recarray containing presence-absence training data.
estimator – An sklearn classifier supporting a predict_proba method. If None, uses sklearn.linear_model.LogisticRegression (Maxent) by default.
validation – ‘bootstrap’ or ‘split-sample’ resampling methods for cross-validation are currently supported.
test_fraction – Float defining the proportion of presence-absence data to set aside for validation. Only used in split-sample validation. Default = 0.15 (i.e. 15%).
replicates – Number of model iterations to perform (default=10).
max_depth – Maximum number of nodes in the decision tree used to transform the presence absence data. When max_depth is high, a very detailed transform is undertaken. Conversely, setting max_depth to a low value (e.g. 3) limits overfitting.
gdalformat – Format of output images. Valid options “GTiff” or “KEA”.
clf_labels – Name of column in presence_absence_data delineating the binary class labels (presence=1, absence=0).
Overwrite – Boolean option to overwrite the output directory if it already exists.
plot_out_format – Output format for plots (options: png or pdf; default png)
plot_out_dpi – resolution of plot output images. Default 300.
Utilities¶
-
rsgislib.sdm.
read_raster_information
(InputImage)¶ - A function to read a GDAL image to determine:
geot = Image geotransform.
srs = Coordinate system.
xsize, ysize = Image dimensions along the x & y axes.
noDataVal = NoDataValue for invalid pixels.
- Parameters
InputImage – path to the input raster.