RSGISLib Classification Module

The classification module provides classification functionality within RSGISLib.

rsgislib.classification.generateTransectAccuracyPts(inputImage, inputLinesShp, outputPtsShp, classImgCol, classImgVecCol, classRefVecCol, lineStep, force=False)

A tool for converting a set of lines in to point transects and populating with the information for undertaking an accuracy assessment.

Where:

  • inputImage is a string specifying the input image file with classification.
  • inputLinesShp is a string specifying the input lines shapefile path.
  • outputPtsShp is a string specifying the output points shapefile path.
  • classImgCol is a string speciyfing the name of the column in the image file containing the class names.
  • classImgVecCol is a string specifiying the output column in the shapefile for the classified class names.
  • classRefVecCol is an optional string specifiying an output column in the shapefile which can be used in the accuracy assessment for the reference data.
  • lineStep is a double specifying the step along the lines between the points
  • force is an optional boolean specifying whether the output shapefile should be deleted if is already exists (True and it will be deleted; Default is False)

Image Pixel Classification

class rsgislib.classification.classimgutils.ClassInfoObj(id=None, fileH5=None, red=None, green=None, blue=None)

This is a class to store the information associated within the classification.

  • id - Output pixel value for this class
  • fileH5 - hdf5 file (from rsgislib.imageutils.extractZoneImageBandValues2HDF) with the training data for the class
  • red - Red colour for visualisation (0-255)
  • green - Green colour for visualisation (0-255)
  • blue - Blue colour for visualisation (0-255)
rsgislib.classification.classimgutils.applyClassifer(classTrainInfo, skClassifier, imgMask, imgMaskVal, imgFileInfo, outputImg, gdalFormat, classClrNames=True)

This function uses a trained classifier and applies it to the provided input image.

  • classTrainInfo - dict (where the key is the class name) of ClassInfoObj objects which will be used to train the classifier (i.e., trainClassifier()), provide pixel value id and RGB class values.
  • skClassifier - a trained instance of a scikit-learn classifier (e.g., use trainClassifier or findClassifierParametersAndTrain)
  • imgMask - is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.genValidMask)
  • imgMaskVal - the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a heirachical classification.
  • imgFileInfo - a list of rsgislib.imageutils.ImageBandInfo objects (also used within rsgislib.imageutils.extractZoneImageBandValues2HDF) to identify which images and bands are to be used for the classification so it adheres to the training data.
  • outputImg - output image file with the classification. Note. by default a colour table and class names column is added to the image. If an error is produced use HFA or KEA formats.
  • gdalFormat - is the output image format - all GDAL supported formats are supported.
  • classClrNames - default is True and therefore a colour table will the colours specified in classTrainInfo and a ClassName column (from imgFileInfo) will be added to the output file.
rsgislib.classification.classimgutils.findClassifierParametersAndTrain(classTrainInfo, paramSearchSampNum=0, gridSearch=GridSearchCV(cv=None, error_score='raise', estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params={}, iid=True, n_jobs=1, param_grid={}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring=None, verbose=0))

A function to find the optimal parameters for classification using a Grid Search (http://scikit-learn.org/stable/modules/grid_search.html). The returned classifier instance will be trained using the input data.

  • classTrainInfo - list of ClassInfoObj objects which will be used to train the classifier.
  • paramSearchSampNum - the number of samples that will be randomly sampled from the training data for each class for applying the grid search (tend to use a small data sample as can take a long time). A value of 500 would use 500 samples per class.
  • gridSearch - is an instance of the sklearn.model_selection.GridSearchCV with an instance of the choosen classifier and parameters to be searched.
rsgislib.classification.classimgutils.performPerPxlMLClassShpTrain(imageBandInfo=[], classInfo={}, outputImg='classImg.kea', gdalFormat='KEA', tmpPath='./tmp', skClassifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), gridSearch=None, paramSearchSampNum=100)

A function which performs a per-pixel based classification of a scene using a machine learning classifier from the scikit-learn library where a single polygon shapefile per class is required to represent the training data.

  • imageBandInfo is a list of rsgislib.imageutils.ImageBandInfo objects specifying the images which should be used.
  • classInfo is a dict of rsgislib.classification.classimgutils.ClassInfoObj objects where the key is the class name. The fileH5 field is used to define the file path to the shapefile with the training data.
  • outputImg is the name and path to the output image file.
  • gdalFormat is the output image file format (e.g., KEA).
  • tmpPath is a tempory file path which can be used during processing.
  • skClassifier is an instance of a scikit-learn classifier appropriately parameterised. If None then the gridSearch object must not be None.
  • gridSearch is an instance of a scikit-learn sklearn.model_selection.GridSearchCV object with the classifier and parameter search space specified. (If None then skClassifier will be used; if both not None then skClassifier will be used in preference to gridSearch)

Example:

from rsgislib.classification import classimgutils
from rsgislib import imageutils

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV

imageBandInfo=[imageutils.ImageBandInfo('./LS2MSS_19750620_lat10lon6493_r67p250_rad_srefdem_30m.kea', 'Landsat', [1,2,3,4])]
classInfo=dict()
classInfo['Forest'] = classimgutils.ClassInfoObj(id=1, fileH5='./ForestRegions.shp', red=0, green=255, blue=0)
classInfo['Non-Forest'] = classimgutils.ClassInfoObj(id=2, fileH5='./NonForestRegions.shp', red=100, green=100, blue=100)


skClassifier=ExtraTreesClassifier(n_estimators=20)
classimgutils.performPerPxlMLClassShpTrain(imageBandInfo, classInfo, outputImg='classImg.kea', gdalFormat='KEA', tmpPath='./tmp', skClassifier=skClassifier)
rsgislib.classification.classimgutils.trainClassifier(classTrainInfo, skClassifier)

This function trains the classifier.

Raster GIS

rsgislib.classification.classratutils.balanceSampleTrainingRandom(clumpsImg, trainCol, outTrainCol, minNoSamples, maxNoSamples)

A function to balance the number of training samples for classification so the number is above a minimum threshold (minNoSamples) and all equal to the class with the smallest number of samples unless that is above a set maximum (maxNoSamples).

  • clumpsImg is a string with the file path to the input image with RAT
  • trainCol is a string for the name of the input column specifying the training samples (zero is no data)
  • outTrainCol is a string with the name of the outputted training samples.
  • minNoSamples is an int specifying the minimum number of training samples for a class (if below threshold class is removed).
  • maxNoSamples is an int specifiying the maximum number of training samples per class.
rsgislib.classification.classratutils.classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features=3, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=True, random_state=None, verbose=0, warm_start=False), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, preProcessor=None, justFit=False)

A function which will perform a classification within the RAT using a classifier from scikit-learn

  • clumpsImg is the clumps image on which the classification is to be performed
  • classesIntCol is the column with the training data as int values
  • classesNameCol is the column with the training data as string class names
  • variables is an array of column names which are to be used for the classification
  • classifier is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)
  • outColInt is the output column name for the int class representation (Default: ‘OutClass’)
  • outColStr is the output column name for the class names column (Default: ‘OutClassName’)
  • roiCol is a column name for a column which specifies the region to be classified. If None ignored (Default: None)
  • roiVal is a int value used within the roiCol to select a region to be classified (Default: 1)
  • classColours is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.
  • preProcessor is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).
  • justFit is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)

Example:

from sklearn.ensemble import ExtraTreesClassifier
from rsgislib.classification import classratutils

classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0)

classColours = dict()
classColours['Forest'] = [0,138,0]
classColours['NonForest'] = [200,200,200]

variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI']
classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours)

from sklearn.preprocessing import MaxAbsScaler

# With pre-processor
classifyWithinRAT(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, preProcessor=MaxAbsScaler())
rsgislib.classification.classratutils.classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features=3, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=True, random_state=None, verbose=0, warm_start=False), outColInt='OutClass', outColStr='OutClassName', roiCol=None, roiVal=1, classColours=None, scaleVarsRange=False, justFit=False)

A function which will perform a classification within the RAT using a classifier from scikit-learn using the rios ratapplier interface allowing very large RATs to be processed.

  • clumpsImg is the clumps image on which the classification is to be performed
  • classesIntCol is the column with the training data as int values
  • classesNameCol is the column with the training data as string class names
  • variables is an array of column names which are to be used for the classification
  • classifier is an instance of a scikit-learn classifier (e.g., RandomForests which is Default)
  • outColInt is the output column name for the int class representation (Default: ‘OutClass’)
  • outColStr is the output column name for the class names column (Default: ‘OutClassName’)
  • roiCol is a column name for a column which specifies the region to be classified. If None ignored (Default: None)
  • roiVal is a int value used within the roiCol to select a region to be classified (Default: 1)
  • classColours is a python dict using the class name as the key along with arrays of length 3 specifying the RGB colours for the class.
  • scaleVarsRange will rescale each variable independently to a range of 0-1 (default: False).
  • justFit is a boolean specifying that the classifier should just be fitted to the data and not applied (Default: False; i.e., apply classification)

Example:

from sklearn.ensemble import ExtraTreesClassifier
from rsgislib.classification import classratutils

classifier = ExtraTreesClassifier(n_estimators=100, max_features=3, n_jobs=-1, verbose=0)

classColours = dict()
classColours['Forest'] = [0,138,0]
classColours['NonForest'] = [200,200,200]

variables = ['GreenAvg', 'RedAvg', 'NIR1Avg', 'NIR2Avg', 'NDVI']
classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours)
    
# With using range scaling.
classifyWithinRATTiled(clumpsImg, classesIntCol, classesNameCol, variables, classifier=classifier, classColours=classColours, scaleVarsRange=True)
rsgislib.classification.classratutils.clusterWithinRAT(clumpsImg, variables, clusterer=MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=8, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0), outColInt='OutCluster', roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=None)

A function which will perform a clustering within the RAT using a clustering algorithm from scikit-learn

  • clumpsImg is the clumps image on which the classification is to be performed.
  • variables is an array of column names which are to be used for the clustering.
  • clusterer is an instance of a scikit-learn clusterer (e.g., MiniBatchKMeans which is Default; Note with 8 clusters).
  • outColInt is the output column name identifying the clusters (Default: ‘OutCluster’).
  • roiCol is a column name for a column which specifies the region to be clustered. If None ignored (Default: None).
  • roiVal is a int value used within the roiCol to select a region to be clustered (Default: 1).
  • clrClusters is a boolean specifying whether the colour table should be updated to correspond to the clusters (Default: True).
  • clrSeed is an integer seeding the random generator used to generate the colours (Default=10; if None provided system time used).
  • addConnectivity is a boolean which adds a kneighbors_graph to the clusterer (just an option for the AgglomerativeClustering algorithm)
  • preProcessor is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).

Example:

from rsgislib.classification import classratutils
from sklearn.cluster import DBSCAN

sklearnClusterer = DBSCAN(eps=1, min_samples=50)
classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False)

# With pre-processor
from sklearn.preprocessing import MaxAbsScaler
classratutils.clusterWithinRAT('MangroveClumps.kea', ['MinX', 'MinY'], clusterer=sklearnClusterer, outColInt="OutCluster", roiCol=None, roiVal=1, clrClusters=True, clrSeed=10, addConnectivity=False, preProcessor=MaxAbsScaler())
rsgislib.classification.classratutils.findClassifierParameters(clumpsImg, classesIntCol, variables, preProcessor=None, gridSearch=GridSearchCV(cv=None, error_score='raise', estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params={}, iid=True, n_jobs=1, param_grid={}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring=None, verbose=0))

Find the optimal parameters for a classifier using a grid search and return a classifier instance with those optimal parameters.

  • clumpsImg is the clumps image on which the classification is to be performed
  • classesIntCol is the column with the training data as int values
  • variables is an array of column names which are to be used for the classification
  • preProcessor is a scikit-learn processors such as sklearn.preprocessing.MaxAbsScaler() which can rescale the input variables independently as read in (Define: None; i.e., not in use).
  • gridSearch is an instance of GridSearchCV parameterised with a classifier and parameters to be searched.

return:

  • Instance of the classifier with optimal parameters defined.

Example:

from rsgislib.classification import classratutils
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MaxAbsScaler

clumpsImg = "./LS8_20150621_lat10lon652_r67p233_clumps.kea"
classesIntCol = 'ClassInt'

classParameters = {'kernel':['linear', 'rbf',  'poly', 'sigmoid'], 'C':[1, 2, 3, 4, 5, 10, 100, 400, 500, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 'auto'], 'degree':[2, 3, 4, 5, 6, 7, 8], 'class_weight':['', 'balanced'], 'decision_function_shape':['ovo', 'ovr', None]}
variables = ['BlueRefl', 'GreenRefl', 'RedRefl', 'NIRRefl', 'SWIR1Refl', 'SWIR2Refl']

gSearch = GridSearchCV(SVC(), classParameters)
classifier = classratutils.findClassifierParameters(clumpsImg, classesIntCol, variables, preProcessor=MaxAbsScaler(), gridSearch=gSearch)
rsgislib.classification.collapseClasses(inputimage, outputimage, gdalformat, classColumn, classIntCol)

Collapses an attribute table with a large number of classified clumps (segments) to a attribute table with a single row per class (i.e. a classification rather than segmentation.

Where:

  • inputImage is a string containing the name and path of the input file with attribute table.
  • outputImage is a string containing the name and path of the output file.
  • gdalformat is a string with the output image format for the GDAL driver.
  • classColumn is a string with the name of the column with the class names - internally this will be treated as a string column even if a numerical column is specified.
  • classIntCol is a sting specifying the name of a column with the integer class representation. This is an optional parameter but if specified then the int reprentation of the classes will be reserved.
rsgislib.classification.colour3bands(inputimage, outputimage, gdalformat)

Generates a 3 band colour image from the colour table in the input file.

Where:

  • inputImage is a string containing the name and path of the input file with attribute table.
  • outputImage is a string containing the name and path of the output file.
  • gdalformat is a string with the output image format for the GDAL driver.

Accuracy Assessment

rsgislib.classification.generateRandomAccuracyPts(inputImage, outputShp, classImgCol, classImgVecCol, classRefVecCol, numPts, seed, force)

Generates a set of random points for accuracy assessment.

Where:

  • inputImage is a string containing the name and path of the input image with attribute table.
  • outputShp is a string containing the name and path of the output shapefile.
  • classImgCol is a string speciyfing the name of the column in the image file containing the class names.
  • classImgVecCol is a string specifiying the output column in the shapefile for the classified class names.
  • classRefVecCol is a string specifiying an output column in the shapefile which can be used in the accuracy assessment for the reference data.
  • numPts is an int specifying the total number of points which should be created.
  • seed is an int specifying the seed for the random number generator. (Optional: Default 10)
  • force is a bool, specifying whether to force removal of the output vector if it exists. (Optional: Default False)
rsgislib.classification.generateStratifiedRandomAccuracyPts(inputImage, outputShp, classImgCol, classImgVecCol, classRefVecCol, numPts, seed, force)

Generates a set of stratified random points for accuracy assessment.

Where:

  • inputImage is a string containing the name and path of the input image with attribute table.
  • outputShp is a string containing the name and path of the output shapefile.
  • classImgCol is a string speciyfing the name of the column in the image file containing the class names.
  • classImgVecCol is a string specifiying the output column in the shapefile for the classified class names.
  • classRefVecCol is a string specifiying an output column in the shapefile which can be used in the accuracy assessment for the reference data.
  • numPts is an int specifying the number of points for each class which should be created.
  • seed is an int specifying the seed for the random number generator. (Optional: Default 10)
  • force is a bool, specifying whether to force removal of the output vector if it exists. (Optional: Default False)
rsgislib.classification.generateTransectAccuracyPts(inputImage, inputLinesShp, outputPtsShp, classImgCol, classImgVecCol, classRefVecCol, lineStep, force=False)

A tool for converting a set of lines in to point transects and populating with the information for undertaking an accuracy assessment.

Where:

  • inputImage is a string specifying the input image file with classification.
  • inputLinesShp is a string specifying the input lines shapefile path.
  • outputPtsShp is a string specifying the output points shapefile path.
  • classImgCol is a string speciyfing the name of the column in the image file containing the class names.
  • classImgVecCol is a string specifiying the output column in the shapefile for the classified class names.
  • classRefVecCol is an optional string specifiying an output column in the shapefile which can be used in the accuracy assessment for the reference data.
  • lineStep is a double specifying the step along the lines between the points
  • force is an optional boolean specifying whether the output shapefile should be deleted if is already exists (True and it will be deleted; Default is False)
rsgislib.classification.popClassInfoAccuracyPts(inputImage, inputShp, classImgCol, classImgVecCol, classRefVecCol)

Generates a set of stratified random points for accuracy assessment.

Where:

  • inputImage is a string containing the name and path of the input image with attribute table.
  • inputShp is a string containing the name and path of the input shapefile.
  • classImgCol is a string speciyfing the name of the column in the image file containing the class names.
  • classImgVecCol is a string specifiying the output column in the shapefile for the classified class names.
  • classRefVecCol is an optional string specifiying an output column in the shapefile which can be used in the accuracy assessment for the reference data.