RSGISLib Scikit-Learn Clumps Classification Module

The steps to undertaking a classification using clumps are:

  • Image segmentation to generate clumps

  • Populate attributes to clumps

  • Generate training and populate to clumps

  • Train the classifier

  • Apply the classifier

  • Collapse to generate a classification.

If you have undertaken an image segmentation and want to use those segments for a classification using RSGISLib then you need to use the image clumps representation. This is described in the paper below:

Clewley, D., Bunting, P., Shepherd, J., Gillingham, S., Flood, N., Dymond, J., Lucas, R., Armston, J., Moghaddam, M. (2014). A Python-Based Open Source System for Geographic Object-Based Image Analysis (GEOBIA) Utilizing Raster Attribute Tables Remote Sensing 6(7), 6111 6135. https://dx.doi.org/10.3390/rs6076111

Commonly we would use the Shepherd et al., (2019) segmentation using the following function:

from rsgislib.segmentation import segutils

input_img = "S2_UVD_27sept_27700_sub.kea"
clumps_img = "s2_uvd_27sept_clumps.kea"
tmp_path = "./tmp"
segutils.runShepherdSegmentation(input_img, clumps_img, tmpath=tmp_path, numClusters=60, minPxls=100, distThres=100, sampling=100, kmMaxIter=200)

Shepherd, J., Bunting, P., Dymond, J. (2019). Operational Large-Scale Segmentation of Imagery Based on Iterative Elimination Remote Sensing 11(6), 658. https://dx.doi.org/10.3390/rs11060658

To populate the clumps (i.e., segments or objects) with the attribute information used for the classification you need to use the functions within the rsgislib.rastergis module, for example:

import rsgislib.rastergis

# Populate with all statistics (min, max, mean, standard deviation)
bandinfo = []
bandinfo.append(rsgislib.rastergis.BandAttStats(band=1, minField='BlueMin', maxField='BlueMax', meanField='BlueMean', stdDevField='BlueStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=2, minField='GrnMin', maxField='GrnMax', meanField='GrnMean', stdDevField='GrnStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=3, minField='RedMin', maxField='RedMax', meanField='RedMean', stdDevField='RedStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=4, minField='RE1Min', maxField='RE1Max', meanField='RE1Mean', stdDevField='RE1Stdev'))
rsgislib.rastergis.populateRATWithStats(input_img, clumps_img, bandinfo)

# Populate with just mean statistic
bandinfo = []
bandinfo.append(rsgislib.rastergis.BandAttStats(band=1, meanField='BlueMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=2, meanField='GrnMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=3, meanField='RedMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=4, meanField='RE1Mean'))
rsgislib.rastergis.populateRATWithStats(input_img, clumps_img, bandinfo)

To train the classifier you need to create a column within the clump raster attribute table (RAT) specifying the class for the clumps being used for training. Training is often provided as vector layers, using a ratutils helper function you can generate the training data:

import rsgislib.rastergis.ratutils

classes_dict = dict()
classes_dict['Mangroves'] = [1, 'Mangroves.shp']
classes_dict['Other'] = [2, 'Other.shp']
tmp_path = './tmp'
classes_int_col_in = 'ClassInt'
classes_name_col = 'ClassStr'
rsgislib.rastergis.ratutils.populateClumpsWithClassTraining(clumps_img, classes_dict, tmp_path, classes_int_col_in, classes_name_col)

To balance the training samples (ensuring there are the same number for each class) you can use the following function:

import rsgislib.classification.classratutils

classes_int_col = 'ClassIntSamp'
rsgislib.classification.classratutils.balanceSampleTrainingRandom(clumps_img, classes_int_col_in, classes_int_col, 100, 200)

To train the classifier you need to use the findClassifierParameters function:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# RAT variables used for the classification
variables = ['BlueMean', 'GrnMean', 'RedMean', 'RE1Mean']

grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators':[10,20,50,100], 'max_depth':[2,4,8]})

classifier = rsgislib.classification.classratutils.findClassifierParameters(clumps_img, classes_int_col, variables, preProcessor=None, gridSearch=grid_search)

To apply the classification you can use either the classifyWithinRAT or classifyWithinRATTiled functions. classifyWithinRAT loads the attribute table columns you are using for the classification to memory with a single read of the attribute table, this can therefore be faster to compute for smaller scenes. However, if you have a large number of clumps within your RAT then this can use more memory then you have available and you’ll need to use the classifyWithinRATTiled function, which steps through the RAT in chunks using only a small amount of memory. If you are unsure use the classifyWithinRATTiled function as the extra I/O time will be minimal.

Classification use the classifyWithinRATTiled function:

class_colours = dict()
class_colours['Mangroves'] = [0,255,0]
class_colours['Other'] = [100,100,100]

out_class_int_col = 'OutClass'
out_class_str_col = 'OutClassName'
rsgislib.classification.classratutils.classifyWithinRATTiled(clumps_img, classes_int_col, classes_name_col, variables, classifier=classifier, outColInt=out_class_int_col, outColStr=out_class_str_col, classColours=class_colours, preProcessor=None)

Classification use the classifyWithinRAT function:

class_colours = dict()
class_colours['Mangroves'] = [0,255,0]
class_colours['Other'] = [100,100,100]

out_class_int_col = 'OutClass'
out_class_str_col = 'OutClassName'
rsgislib.classification.classratutils.classifyWithinRAT(clumps_img, classes_int_col, classes_name_col, variables, classifier=classifier, outColInt=out_class_int_col, outColStr=out_class_str_col, classColours=class_colours, preProcessor=None)

Finally, to produce a classification image file, rather than segmentation, where the image pixel value corresponds with the classified class, you can use the following function which ‘collapses’ the RAT to create a classification image:

import rsgislib.classification

# Export to a 'classification image' rather than a RAT...
out_class_img = 's2_uvd_27sept_class.kea'
rsgislib.classification.collapseClasses(clumps_img, out_class_img, 'KEA', out_class_str_col, out_class_int_col)

Training Functions

rsgislib.classification.classratutils.findClassifierParameters(clumpsImg, classesIntCol, variables, preProcessor=None, gridSearch=GridSearchCV(cv=None, error_score=nan, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False), iid='deprecated', n_jobs=None, param_grid={}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring=None, verbose=0))

Find the optimal parameters for a classifier using a grid search and return a classifier instance with those optimal parameters.

Parameters
  • clumpsImg – is the clumps image on which the classification is to be performed

  • classesIntCol – is the column with the training data as int values

  • variables – is an array of column names which are to be used for the classification

  • preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).

  • gridSearch – is an instance of GridSearchCV parameterised with a classifier and parameters to be searched.

Returns

Instance of the classifier with optimal parameters defined.

Example:

from rsgislib.classification import classratutils
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MaxAbsScaler

clumpsImg = "./LS8_20150621_lat10lon652_r67p233_clumps.kea"
classesIntCol = 'ClassInt'

classParameters = {'kernel':['linear', 'rbf',  'poly', 'sigmoid'], 'C':[1, 2, 3, 4, 5, 10, 100, 400, 500, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 'auto'], 'degree':[2, 3, 4, 5, 6, 7, 8], 'class_weight':['', 'balanced'], 'decision_function_shape':['ovo', 'ovr', None]}
variables = ['BlueRefl', 'GreenRefl', 'RedRefl', 'NIRRefl', 'SWIR1Refl', 'SWIR2Refl']

gSearch = GridSearchCV(SVC(), classParameters)
classifier = classratutils.findClassifierParameters(clumpsImg, classesIntCol, variables, preProcessor=MaxAbsScaler(), gridSearch=gSearch)
rsgislib.classification.classratutils.balanceSampleTrainingRandom(clumpsImg, trainCol, outTrainCol, minNoSamples, maxNoSamples)

A function to balance the number of training samples for classification so the number is above a minimum threshold (minNoSamples) and all equal to the class with the smallest number of samples unless that is above a set maximum (maxNoSamples).

Parameters
  • clumpsImg – is a string with the file path to the input image with RAT

  • trainCol – is a string for the name of the input column specifying the training samples (zero is no data)

  • outTrainCol – is a string with the name of the outputted training samples.

  • minNoSamples – is an int specifying the minimum number of training samples for a class (if below threshold class is removed).

  • maxNoSamples – is an int specifiying the maximum number of training samples per class.

Classify Functions

rsgislib.classification.classratutils.classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=3, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=True, random_state=None, verbose=0, warm_start=False), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, preProcessor=None, justFit=False)

A function which will perform a classification within the RAT using a classifier from scikit-learn

Parameters
  • clumpsImg – is the clumps image on which the classification is to be performed

  • classesIntCol – is the column with the training data as int values

  • classesNameCol – is the column with the training data as string class names

  • variables – is an array of column names which are to be used for the classification

  • classifier – is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)

  • outColInt – is the output column name for the int class representation (Default: ‘OutClass’)

  • outColStr – is the output column name for the class names column (Default: ‘OutClassName’)

  • roiCol – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)

  • roiVal – is a int value used within the roiCol to select a region to be classified (Default: 1)

  • classColours – is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.

  • preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).

  • justFit – is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)

Example:

from sklearn.ensemble import ExtraTreesClassifier
from rsgislib.classification import classratutils

classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0)

classColours = dict()
classColours['Forest'] = [0,138,0]
classColours['NonForest'] = [200,200,200]

variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI']
classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours)

from sklearn.preprocessing import MaxAbsScaler

# With pre-processor
classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, preProcessor=MaxAbsScaler())
rsgislib.classification.classratutils.classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=3, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=True, random_state=None, verbose=0, warm_start=False), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, scaleVarsRange=False, justFit=False)

A function which will perform a classification within the RAT using a classifier from scikit-learn using the rios ratapplier interface allowing very large RATs to be processed.

Parameters
  • clumpsImg – is the clumps image on which the classification is to be performed

  • classesIntCol – is the column with the training data as int values

  • classesNameCol – is the column with the training data as string class names

  • variables – is an array of column names which are to be used for the classification

  • classifier – is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)

  • outColInt – is the output column name for the int class representation (Default: ‘OutClass’)

  • outColStr – is the output column name for the class names column (Default: ‘OutClassName’)

  • roiCol – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)

  • roiVal – is a int value used within the roiCol to select a region to be classified (Default: 1)

  • classColours – is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.

  • scaleVarsRange – will rescale each variable independently to a range of 0-1 (default: False).

  • justFit – is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)

Example:

from sklearn.ensemble import ExtraTreesClassifier
from rsgislib.classification import classratutils

classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0)

classColours = dict()
classColours['Forest'] = [0,138,0]
classColours['NonForest'] = [200,200,200]

variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI']
classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours)
    
# With using range scaling.
classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, scaleVarsRange=True)
rsgislib.classification.classratutils.clusterWithinRAT(clumpsImg, variables, clusterer=MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=8, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0), outColInt='OutCluster', roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=None)

A function which will perform a clustering within the RAT using a clustering algorithm from scikit-learn

Parameters
  • clumpsImg – is the clumps image on which the classification is to be performed.

  • variables – is an array of column names which are to be used for the clustering.

  • clusterer – is an instance of a scikit-learn clusterer (e.g., MiniBatchKMeans which is Default; Note with 8 clusters).

  • outColInt – is the output column name identifying the clusters (Default: ‘OutCluster’).

  • roiCol – is a column name for a column which specifies the region to be clustered. If None ignored (Default: None).

  • roiVal – is a int value used within the roiCol to select a region to be clustered (Default: 1).

  • clrClusters – is a boolean specifying whether the colour table should be updated to correspond to the clusters (Default: True).

  • clrSeed – is an integer seeding the random generator used to generate the colours (Default=10; if None provided system time used).

  • addConnectivity – is a boolean which adds a kneighbors_graph to the clusterer (just an option for the AgglomerativeClustering algorithm)

  • preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).

Example:

from rsgislib.classification import classratutils
from sklearn.cluster import DBSCAN

sklearnClusterer = DBSCAN(eps=1, min_samples=50)
classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False)

# With pre-processor
from sklearn.preprocessing import MaxAbsScaler
classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=MaxAbsScaler())