RSGISLib Scikit-Learn Unsupervised Pixel Classification Module

To use the unsupervised classification functions you need to create an instance of a scikit-learn cluster (https://scikit-learn.org/stable/modules/clustering.html). For example K-Means:

from sklearn.cluster import MiniBatchKMeans
sklclusterer = MiniBatchKMeans(n_clusters=60, init='k-means++', max_iter=100, batch_size=100)

You can then run one of the module functions, first we’ll import the module and define our images:

input_img = "S2_UVD_27sept_27700_sub.kea"
output_img = "S2_UVD_27sept_27700_sub_clusters.kea"

Using all the image pixels, this can be time and memory intensive so this would normally be used for smaller datasets:

rsgislib.classification.clustersklearn.img_pixel_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, clusterer=sklclusterer)

If you have a larger image then you might want to use one of the other two functions, which either performs the clustering in tiles and therefore has tile boundaries, alternatively, you can sample the input image performing the clustering on the sample and then applying to the whole image.

Clustering using tiling:

rsgislib.classification.clustersklearn.img_pixel_tiled_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, clusterer=sklclusterer)

Clustering using samples, the imgSamp parameter specifies the size of the sample taken, in the example below it will be every 100th pixel (i.e., a 1 percent sample):

rsgislib.classification.clustersklearn.img_pixel_sample_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, imgSamp=100, clusterer=sklclusterer)

Function Specifications

rsgislib.classification.clustersklearn.img_pixel_sample_cluster(inputImg, outputImg, gdalformat='KEA', noDataVal=0, imgSamp=100, clusterer=MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=60, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0), calcStats=True, useMeanShiftEstBandWidth=False)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is trained on a sample of the input image and then applied using the predict function (therefore this function is only compatiable with clusterers which have the predict function implemented) to the whole image.

Parameters
  • inputImg – input image file.

  • outputImg – output image file.

  • gdalformat – output image file format.

  • noDataVal – no data value associated with the input image.

  • imgSamp – the input image sampling. (e.g., 100 is every 100th pixel)

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calcStats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • useMeanShiftEstBandWidth – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.

rsgislib.classification.clustersklearn.img_pixel_tiled_cluster(inputImg, outputImg, gdalformat='KEA', noDataVal=0, clusterer=MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=60, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0), calcStats=True, useMeanShiftEstBandWidth=False, tileXSize=200, tileYSize=200)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is applied to a single tile at a time and therefore produces tile boundaries in the result. However, memory is controlled such that usage isn’t excessive which it could be when processing a whole image.

Parameters
  • inputImg – input image file.

  • outputImg – output image file.

  • gdalformat – output image file format.

  • noDataVal – no data value associated with the input image.

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calcStats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • useMeanShiftEstBandWidth – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.

  • tileXSize – tile size in the x-axis in pixels.

  • tileYSize – tile size in the y-axis in pixels.

rsgislib.classification.clustersklearn.img_pixel_cluster(inputImg, outputImg, gdalformat='KEA', noDataVal=0, clusterer=MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=60, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0, verbose=0), calcStats=True, useMeanShiftEstBandWidth=False)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is applied to the whole image in one operation so therefore requires the whole image to be loaded into memory. However, if there is sufficent memory all the clustering algorithms within scikit-learn can be applied without boundary artifacts.

Parameters
  • inputImg – input image file.

  • outputImg – output image file.

  • gdalformat – output image file format.

  • noDataVal – no data value associated with the input image.

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calcStats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • useMeanShiftEstBandWidth – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.