RSGISLib Scikit-Learn Clumps Classification Module¶
The steps to undertaking a classification using clumps are:
Image segmentation to generate clumps
Populate attributes to clumps
Generate training and populate to clumps
Train the classifier
Apply the classifier
Collapse to generate a classification.
If you have undertaken an image segmentation and want to use those segments for a classification using RSGISLib then you need to use the image clumps representation. This is described in the paper below:
Clewley, D., Bunting, P., Shepherd, J., Gillingham, S., Flood, N., Dymond, J., Lucas, R., Armston, J., Moghaddam, M. (2014). A Python-Based Open Source System for Geographic Object-Based Image Analysis (GEOBIA) Utilizing Raster Attribute Tables Remote Sensing 6(7), 6111 6135. https://dx.doi.org/10.3390/rs6076111
Commonly we would use the Shepherd et al., (2019) segmentation using the following function:
from rsgislib.segmentation import segutils
input_img = "S2_UVD_27sept_27700_sub.kea"
clumps_img = "s2_uvd_27sept_clumps.kea"
tmp_path = "./tmp"
segutils.runShepherdSegmentation(input_img, clumps_img, tmpath=tmp_path, numClusters=60, minPxls=100, distThres=100, sampling=100, kmMaxIter=200)
Shepherd, J., Bunting, P., Dymond, J. (2019). Operational Large-Scale Segmentation of Imagery Based on Iterative Elimination Remote Sensing 11(6), 658. https://dx.doi.org/10.3390/rs11060658
To populate the clumps (i.e., segments or objects) with the attribute information used for the classification you need to use the functions within the rsgislib.rastergis module, for example:
import rsgislib.rastergis
# Populate with all statistics (min, max, mean, standard deviation)
bandinfo = []
bandinfo.append(rsgislib.rastergis.BandAttStats(band=1, minField='BlueMin', maxField='BlueMax', meanField='BlueMean', stdDevField='BlueStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=2, minField='GrnMin', maxField='GrnMax', meanField='GrnMean', stdDevField='GrnStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=3, minField='RedMin', maxField='RedMax', meanField='RedMean', stdDevField='RedStdev'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=4, minField='RE1Min', maxField='RE1Max', meanField='RE1Mean', stdDevField='RE1Stdev'))
rsgislib.rastergis.populateRATWithStats(input_img, clumps_img, bandinfo)
# Populate with just mean statistic
bandinfo = []
bandinfo.append(rsgislib.rastergis.BandAttStats(band=1, meanField='BlueMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=2, meanField='GrnMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=3, meanField='RedMean'))
bandinfo.append(rsgislib.rastergis.BandAttStats(band=4, meanField='RE1Mean'))
rsgislib.rastergis.populateRATWithStats(input_img, clumps_img, bandinfo)
To train the classifier you need to create a column within the clump raster attribute table (RAT) specifying the class for the clumps being used for training. Training is often provided as vector layers, using a ratutils helper function you can generate the training data:
import rsgislib.rastergis.ratutils
classes_dict = dict()
classes_dict['Mangroves'] = [1, 'Mangroves.shp']
classes_dict['Other'] = [2, 'Other.shp']
tmp_path = './tmp'
classes_int_col_in = 'ClassInt'
classes_name_col = 'ClassStr'
rsgislib.rastergis.ratutils.populateClumpsWithClassTraining(clumps_img, classes_dict, tmp_path, classes_int_col_in, classes_name_col)
To balance the training samples (ensuring there are the same number for each class) you can use the following function:
import rsgislib.classification.classratutils
classes_int_col = 'ClassIntSamp'
rsgislib.classification.classratutils.balanceSampleTrainingRandom(clumps_img, classes_int_col_in, classes_int_col, 100, 200)
To train the classifier you need to use the findClassifierParameters function:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# RAT variables used for the classification
variables = ['BlueMean', 'GrnMean', 'RedMean', 'RE1Mean']
grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators':[10,20,50,100], 'max_depth':[2,4,8]})
classifier = rsgislib.classification.classratutils.findClassifierParameters(clumps_img, classes_int_col, variables, preProcessor=None, gridSearch=grid_search)
To apply the classification you can use either the classifyWithinRAT or classifyWithinRATTiled functions. classifyWithinRAT loads the attribute table columns you are using for the classification to memory with a single read of the attribute table, this can therefore be faster to compute for smaller scenes. However, if you have a large number of clumps within your RAT then this can use more memory then you have available and you’ll need to use the classifyWithinRATTiled function, which steps through the RAT in chunks using only a small amount of memory. If you are unsure use the classifyWithinRATTiled function as the extra I/O time will be minimal.
Classification use the classifyWithinRATTiled function:
class_colours = dict()
class_colours['Mangroves'] = [0,255,0]
class_colours['Other'] = [100,100,100]
out_class_int_col = 'OutClass'
out_class_str_col = 'OutClassName'
rsgislib.classification.classratutils.classifyWithinRATTiled(clumps_img, classes_int_col, classes_name_col, variables, classifier=classifier, outColInt=out_class_int_col, outColStr=out_class_str_col, classColours=class_colours, preProcessor=None)
Classification use the classifyWithinRAT function:
class_colours = dict()
class_colours['Mangroves'] = [0,255,0]
class_colours['Other'] = [100,100,100]
out_class_int_col = 'OutClass'
out_class_str_col = 'OutClassName'
rsgislib.classification.classratutils.classifyWithinRAT(clumps_img, classes_int_col, classes_name_col, variables, classifier=classifier, outColInt=out_class_int_col, outColStr=out_class_str_col, classColours=class_colours, preProcessor=None)
Finally, to produce a classification image file, rather than segmentation, where the image pixel value corresponds with the classified class, you can use the following function which ‘collapses’ the RAT to create a classification image:
import rsgislib.classification
# Export to a 'classification image' rather than a RAT...
out_class_img = 's2_uvd_27sept_class.kea'
rsgislib.classification.collapseClasses(clumps_img, out_class_img, 'KEA', out_class_str_col, out_class_int_col)
Training Functions¶
-
rsgislib.classification.classratutils.
findClassifierParameters
(clumpsImg, classesIntCol, variables, preProcessor=None, gridSearch=GridSearchCV(estimator=RandomForestClassifier(), param_grid={}))¶ Find the optimal parameters for a classifier using a grid search and return a classifier instance with those optimal parameters.
- Parameters
clumpsImg – is the clumps image on which the classification is to be performed
classesIntCol – is the column with the training data as int values
variables – is an array of column names which are to be used for the classification
preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).
gridSearch – is an instance of GridSearchCV parameterised with a classifier and parameters to be searched.
- Returns
Instance of the classifier with optimal parameters defined.
Example:
from rsgislib.classification import classratutils from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import MaxAbsScaler clumpsImg = "./LS8_20150621_lat10lon652_r67p233_clumps.kea" classesIntCol = 'ClassInt' classParameters = {'kernel':['linear', 'rbf', 'poly', 'sigmoid'], 'C':[1, 2, 3, 4, 5, 10, 100, 400, 500, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 'auto'], 'degree':[2, 3, 4, 5, 6, 7, 8], 'class_weight':['', 'balanced'], 'decision_function_shape':['ovo', 'ovr', None]} variables = ['BlueRefl', 'GreenRefl', 'RedRefl', 'NIRRefl', 'SWIR1Refl', 'SWIR2Refl'] gSearch = GridSearchCV(SVC(), classParameters) classifier = classratutils.findClassifierParameters(clumpsImg, classesIntCol, variables, preProcessor=MaxAbsScaler(), gridSearch=gSearch)
-
rsgislib.classification.classratutils.
balanceSampleTrainingRandom
(clumpsImg, trainCol, outTrainCol, minNoSamples, maxNoSamples)¶ A function to balance the number of training samples for classification so the number is above a minimum threshold (minNoSamples) and all equal to the class with the smallest number of samples unless that is above a set maximum (maxNoSamples).
- Parameters
clumpsImg – is a string with the file path to the input image with RAT
trainCol – is a string for the name of the input column specifying the training samples (zero is no data)
outTrainCol – is a string with the name of the outputted training samples.
minNoSamples – is an int specifying the minimum number of training samples for a class (if below threshold class is removed).
maxNoSamples – is an int specifiying the maximum number of training samples per class.
Classify Functions¶
-
rsgislib.classification.classratutils.
classifyWithinRAT
(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(max_features=3, n_jobs=- 1, oob_score=True), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, preProcessor=None, justFit=False)¶ A function which will perform a classification within the RAT using a classifier from scikit-learn
- Parameters
clumpsImg – is the clumps image on which the classification is to be performed
classesIntCol – is the column with the training data as int values
classesNameCol – is the column with the training data as string class names
variables – is an array of column names which are to be used for the classification
classifier – is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)
outColInt – is the output column name for the int class representation (Default: ‘OutClass’)
outColStr – is the output column name for the class names column (Default: ‘OutClassName’)
roiCol – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)
roiVal – is a int value used within the roiCol to select a region to be classified (Default: 1)
classColours – is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.
preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).
justFit – is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)
Example:
from sklearn.ensemble import ExtraTreesClassifier from rsgislib.classification import classratutils classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0) classColours = dict() classColours['Forest'] = [0,138,0] classColours['NonForest'] = [200,200,200] variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI'] classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours) from sklearn.preprocessing import MaxAbsScaler # With pre-processor classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, preProcessor=MaxAbsScaler())
-
rsgislib.classification.classratutils.
classifyWithinRATTiled
(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(max_features=3, n_jobs=- 1, oob_score=True), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, scaleVarsRange=False, justFit=False)¶ A function which will perform a classification within the RAT using a classifier from scikit-learn using the rios ratapplier interface allowing very large RATs to be processed.
- Parameters
clumpsImg – is the clumps image on which the classification is to be performed
classesIntCol – is the column with the training data as int values
classesNameCol – is the column with the training data as string class names
variables – is an array of column names which are to be used for the classification
classifier – is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)
outColInt – is the output column name for the int class representation (Default: ‘OutClass’)
outColStr – is the output column name for the class names column (Default: ‘OutClassName’)
roiCol – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)
roiVal – is a int value used within the roiCol to select a region to be classified (Default: 1)
classColours – is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.
scaleVarsRange – will rescale each variable independently to a range of 0-1 (default: False).
justFit – is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)
Example:
from sklearn.ensemble import ExtraTreesClassifier from rsgislib.classification import classratutils classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0) classColours = dict() classColours['Forest'] = [0,138,0] classColours['NonForest'] = [200,200,200] variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI'] classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours) # With using range scaling. classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, scaleVarsRange=True)
-
rsgislib.classification.classratutils.
clusterWithinRAT
(clumpsImg, variables, clusterer=MiniBatchKMeans(), outColInt='OutCluster', roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=None)¶ A function which will perform a clustering within the RAT using a clustering algorithm from scikit-learn
- Parameters
clumpsImg – is the clumps image on which the classification is to be performed.
variables – is an array of column names which are to be used for the clustering.
clusterer – is an instance of a scikit-learn clusterer (e.g., MiniBatchKMeans which is Default; Note with 8 clusters).
outColInt – is the output column name identifying the clusters (Default: ‘OutCluster’).
roiCol – is a column name for a column which specifies the region to be clustered. If None ignored (Default: None).
roiVal – is a int value used within the roiCol to select a region to be clustered (Default: 1).
clrClusters – is a boolean specifying whether the colour table should be updated to correspond to the clusters (Default: True).
clrSeed – is an integer seeding the random generator used to generate the colours (Default=10; if None provided system time used).
addConnectivity – is a boolean which adds a kneighbors_graph to the clusterer (just an option for the AgglomerativeClustering algorithm)
preProcessor – is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).
Example:
from rsgislib.classification import classratutils from sklearn.cluster import DBSCAN sklearnClusterer = DBSCAN(eps=1, min_samples=50) classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False) # With pre-processor from sklearn.preprocessing import MaxAbsScaler classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=MaxAbsScaler())