RSGISLib Scikit-Learn Unsupervised Pixel Classification

To use the unsupervised classification functions you need to create an instance of a scikit-learn cluster (https://scikit-learn.org/stable/modules/clustering.html). For example K-Means:

from sklearn.cluster import MiniBatchKMeans
sklclusterer = MiniBatchKMeans(n_clusters=60, init='k-means++', max_iter=100, batch_size=100)

You can then run one of the module functions, first we’ll import the module and define our images:

input_img = "S2_UVD_27sept_27700_sub.kea"
output_img = "S2_UVD_27sept_27700_sub_clusters.kea"

Using all the image pixels, this can be time and memory intensive so this would normally be used for smaller datasets:

rsgislib.classification.clustersklearn.img_pixel_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, clusterer=sklclusterer)

If you have a larger image then you might want to use one of the other two functions, which either performs the clustering in tiles and therefore has tile boundaries, alternatively, you can sample the input image performing the clustering on the sample and then applying to the whole image.

Clustering using tiling:

rsgislib.classification.clustersklearn.img_pixel_tiled_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, clusterer=sklclusterer)

Clustering using samples, the imgSamp parameter specifies the size of the sample taken, in the example below it will be every 100th pixel (i.e., a 1 percent sample):

rsgislib.classification.clustersklearn.img_pixel_sample_cluster(input_img, output_img, gdalformat='KEA', noDataVal=0, imgSamp=100, clusterer=sklclusterer)

Pixel Clustering

rsgislib.classification.clustersklearn.img_pixel_sample_cluster(input_img: str, output_img: str, gdalformat: str = 'KEA', no_data_val: float = 0, n_img_smpl: int = 100, clusterer: BaseEstimator = MiniBatchKMeans(batch_size=100, n_clusters=60), calc_stats: bool = True, use_mean_shift_est_band_width: bool = False)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is trained on a sample of the input image and then applied using the predict function (therefore this function is only compatible with clusterers which have the predict function implemented) to the whole image.

Parameters:
  • input_img – input image file.

  • output_img – output image file.

  • gdalformat – output image file format.

  • no_data_val – no data value associated with the input image.

  • n_img_smpl – the input image sampling. (e.g., 100 is every 100th pixel)

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calc_stats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • use_mean_shift_est_band_width – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.

rsgislib.classification.clustersklearn.img_pixel_tiled_cluster(input_img: str, output_img: str, gdalformat: str = 'KEA', no_data_val: float = 0, clusterer: BaseEstimator = MiniBatchKMeans(batch_size=100, n_clusters=60), calc_stats: bool = True, use_mean_shift_est_band_width: bool = False, tile_x_size: int = 200, tile_y_size: int = 200)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is applied to a single tile at a time and therefore produces tile boundaries in the result. However, memory is controlled such that usage isn’t excessive which it could be when processing a whole image.

Parameters:
  • input_img – input image file.

  • output_img – output image file.

  • gdalformat – output image file format.

  • no_data_val – no data value associated with the input image.

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calc_stats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • use_mean_shift_est_band_width – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.

  • tile_x_size – tile size in the x-axis in pixels.

  • tile_y_size – tile size in the y-axis in pixels.

rsgislib.classification.clustersklearn.img_pixel_cluster(input_img: str, output_img: str, gdalformat: str = 'KEA', no_data_val: float = 0, clusterer: BaseEstimator = MiniBatchKMeans(batch_size=100, n_clusters=60), calc_stats: bool = True, use_mean_shift_est_band_width: bool = False)

A function which allows a clustering to be performed using the algorithms available within the scikit-learn library. The clusterer is applied to the whole image in one operation so therefore requires the whole image to be loaded into memory. However, if there is sufficient memory all the clustering algorithms within scikit-learn can be applied without boundary artifacts.

Parameters:
  • input_img – input image file.

  • output_img – output image file.

  • gdalformat – output image file format.

  • no_data_val – no data value associated with the input image.

  • clusterer – clusterer from scikit-learn which must have a predict function.

  • calc_stats – calculate image pixel statistics, histogram and image pyramids - note if you are not using a KEA file then the format needs to support RATs for this option as histogram and colour table are written to RAT.

  • use_mean_shift_est_band_width – use the mean-shift algorithm as the clusterer (pass None as the clusterer) where the bandwidth is calculated from the data itself.

RAT Clustering

rsgislib.classification.clustersklearn.cluster_sklearn_rat(clumps_img: str, variables: List[str], sk_clusterer: BaseEstimator = MiniBatchKMeans(batch_size=100, n_clusters=60), out_col: str = 'OutClass', roi_col: str = None, roi_val: int = 1, sub_sample: float = None)

A function which will apply an scikit-learn clustering (i.e., unsupervised classification) within a Raster Attribute Table (RAT).

Parameters:
  • clumps_img – is the clumps image on which the clustering is to be performed

  • variables – is an array of column names which are to be used for the clustering

  • sk_clusterer – an instance of a scikit-learn clustering algorithm

  • out_col – is the output column

  • roi_col – is a column name for a column which specifies the region to be clustered. If None ignored (Default: None)

  • roi_val – is a int value used within the roi_col to select a region to be clustered (Default: 1)

  • sub_sample – Subsample the data for fitting the clusterer. Provide the proprotion (0-1) to be used for the clustering.