RSGISLib LightGBM Pixel Classification Module

LightGBM (https://lightgbm.readthedocs.io) is an alternative library to scikit-learn which has specialist implementation of Gradient Boosted Decision Tree (GBDT), but it also implements random forests, Dropouts meet Multiple Additive Regression Trees (DART), and Gradient Based One-Side Sampling (Goss).

When considering ensemble learning, there are two primary methods: bagging and boosting. Bagging involves the training of many independent models and combines their predictions through some form of aggregation (averaging, voting etc.). An example of a bagging ensemble is a Random Forest.

Boosting instead trains models sequentially, where each model learns from the errors of the previous model. Starting with a weak base model, models are trained iteratively, each adding to the prediction of the previous model to produce a strong overall prediction. In the case of gradient boosted decision trees, successive models are found by applying gradient descent in the direction of the average gradient, calculated with respect to the error residuals of the loss function, of the leaf nodes of previous models.

See also

For an easy to follow and understandable background to LightGBM see this blog post

See also

For an an academic paper see: Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

LightGBM is a binary classifier (i.e., separates two classes, e.g., mangroves and other) but it has a multi-class mode which applies a number of binary classification to produce a multi-class classification result.

Steps to applying a LightGBM Classification:

  • Extract training

  • Split training: Training, Validation, Testing

  • Train Classifier and Optimise Hyperparameters

  • Apply Classifier

To define training a raster with a unique value for each class, or multiple binary rasters one for each class. Commonly the training regions might be defined using a vector layer which would require rasterising:

import rsgislib.vectorutils

sen2_img = 'sen2_srefimg.kea'
mangroves_sample_vec_file = 'mangrove_cls_samples.geojson'
mangroves_sample_vec_lyr = 'mangrove_cls_samples'
mangroves_sample_img = 'mangrove_cls_samples.kea'
rsgislib.vectorutils.rasteriseVecLyr(mangroves_sample_vec_file, mangroves_sample_vec_lyr, sen2_img, mangroves_sample_img, gdalformat='KEA')

other_sample_vec_file = 'other_cls_samples.geojson'
other_sample_vec_lyr = 'other_cls_samples'
other_sample_img = 'other_cls_samples.kea'
rsgislib.vectorutils.rasteriseVecLyr(other_sample_vec_file, other_sample_vec_lyr, sen2_img, other_sample_img, gdalformat='KEA')

To extract the image pixel values, which are stored within a HDF5 file (see https://portal.hdfgroup.org/display/HDF5/HDF5 for more information) the following functions are used. To define the images and associated bands to be used for the classification and therefore values need to be extracted then a list of rsgislib.imageutils.ImageBandInfo classes needs to be provided:

import rsgislib.imageutils

imgs_info = []
imgs_info.append(rsgislib.imageutils.ImageBandInfo(fileName='sen2_srefimg.kea', name='sen2', bands=[1,2,3,4,5,6,7,8,9,10]))
imgs_info.append(rsgislib.imageutils.ImageBandInfo(fileName='sen1_dBimg.kea', name='sen1', bands=[1,2]))

mangroves_sample_h5 = 'mangrove_cls_samples.h5'
rsgislib.imageutils.extractZoneImageBandValues2HDF(imgs_info, mangroves_sample_img, mangroves_sample_h5, 1)

other_sample_h5 = 'other_cls_samples.h5'
rsgislib.imageutils.extractZoneImageBandValues2HDF(imgs_info, other_sample_img, other_sample_h5, 1)

If training data is extracted from multiple input images then it will need to be merged using the following function:

rsgislib.imageutils.mergeExtractedHDF5Data(['mang_samples_1.h5', 'mang_samples_2.h5'], 'mangrove_cls_samples.h5')
rsgislib.imageutils.mergeExtractedHDF5Data(['other_samples_1.h5', 'other_samples_2.h5'], 'other_cls_samples.h5')

To split the extracted samples into a training, validation and testing sets you can use the rsgislib.classification.split_sample_train_valid_test function:

import rsgislib.classification

mangroves_sample_h5_train = 'mangrove_cls_samples_train.h5'
mangroves_sample_h5_valid = 'mangrove_cls_samples_valid.h5'
mangroves_sample_h5_test = 'mangrove_cls_samples_test.h5'
rsgislib.classification.split_sample_train_valid_test(mangroves_sample_h5, mangroves_sample_h5_train, mangroves_sample_h5_valid, mangroves_sample_h5_test, test_sample=500, valid_sample=500, train_sample=2000)

other_sample_h5_train = 'other_cls_samples_train.h5'
other_sample_h5_valid = 'other_cls_samples_valid.h5'
other_sample_h5_test = 'other_cls_samples_test.h5'
rsgislib.classification.split_sample_train_valid_test(other_sample_h5, other_sample_h5_train, other_sample_h5_valid, other_sample_h5_test, test_sample=500, valid_sample=500, train_sample=2000)

Note

Training samples are used to train the classifier. Validation samples are used to test the accuracy of the classifier during the parameter optimisation process and are therefore part of the training process and not independent. Testing samples completely independent of the training process and are used as an independent sample to test the overall accuracy of the classifier.

Apply a LightGBM Binary Classifier

To train a single binary classifier you need to use the following function:

import rsgislib.classification
import rsgislib.classification.classlightgbm

out_mdl_file = 'model_file.txt'
rsgislib.classification.classlightgbm.train_lightgbm_binary_classifer(out_mdl_file, mangroves_sample_h5_train, mangroves_sample_h5_valid, mangroves_sample_h5_test, other_sample_h5_train, other_sample_h5_valid, other_sample_h5_test)

To apply the binary classifier use the following function:

img_mask = 'mangrove_habitat_img.kea'
out_prob_img = 'mangrove_prob_img.kea'
out_cls_img = 'mangrove_cls_img.kea'
rsgislib.classification.classlightgbm.apply_lightgbm_binary_classifier(out_mdl_file, img_mask, 1, imgs_info, out_prob_img, 'KEA', out_cls_img, class_thres=5000)

Note

Class probability values are multipled by 10,000 so a threshold of 5000 is really 0.5.

Apply a LightGBM Multi-Class Classifier

To train a multi-class classifier you need to use the following function:

import rsgislib.classification
import rsgislib.classification.classlightgbm

out_mdl_file = 'model_file.txt'
clsinfodict = dict()
clsinfodict['Mangroves'] = rsgislib.classification.ClassInfoObj(id=0, out_id=1, trainfileH5=mangroves_sample_h5_train, testfileH5=mangroves_sample_h5_test, validfileH5=mangroves_sample_h5_valid, red=0, green=255, blue=0)
clsinfodict['Other'] = rsgislib.classification.ClassInfoObj(id=1, out_id=2, trainfileH5=other_sample_h5_train, testfileH5=other_sample_h5_test, validfileH5=other_sample_h5_valid, red=100, green=100, blue=100)
# Note. Water samples not shown above but would be extracted and generated using the same functions.
clsinfodict['Water'] = rsgislib.classification.ClassInfoObj(id=2, out_id=3, trainfileH5=water_sample_h5_train, testfileH5=water_sample_h5_test, validfileH5=water_sample_h5_valid,, red=0, green=0, blue=255)

rsgislib.classification.classlightgbm.train_lightgbm_multiclass_classifer(out_mdl_file, clsinfodict)

To apply the multi-class classifier use the following function:

img_mask = 'mangrove_habitat_img.kea'
out_prob_img = 'class_prob_img.kea'
out_cls_img = 'class_out_img.kea'

rsgislib.classification.classlightgbm.apply_lightgbm_multiclass_classifier(clsinfodict, out_mdl_file, img_mask, 1, imgs_info, out_prob_img, out_cls_img, 'KEA')

Note

Within the rsgislib.classification.ClassInfoObj class you need to provide an id and out_id value. The id must start from zero and be consecutive while the out_id will be used as the pixel value for the output classification image and can be any integer value.

Training Functions

rsgislib.classification.classlightgbm.train_lightgbm_binary_classifer(out_mdl_file, cls_params_file, cls1_train_file, cls1_valid_file, cls1_test_file, cls2_train_file, cls2_valid_file, cls2_test_file, unbalanced=False, nthread=2, scale_pos_weight=None, early_stopping_rounds=100, num_iterations=5000, num_boost_round=100, learning_rate=0.05, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a binary lightgbm classifier. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

This function requires that lightgbm and skopt modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • cls_params_file – A JSON file with the model parameters

  • cls1_train_file – Training samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_valid_file – Validation samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_test_file – Testing samples HDF5 file for the primary class (i.e., the one being classified)

  • cls2_train_file – Training samples HDF5 file for the ‘other’ class

  • cls2_valid_file – Validation samples HDF5 file for the ‘other’ class

  • cls2_test_file – Testing samples HDF5 file for the ‘other’ class

  • unbalanced – Specify that the training data is unbalance (i.e., a different number of samples per class) and LightGBM will try to take this into account during training.

  • nthread – The number of threads to use for the training.

  • scale_pos_weight – Optional, default is None. If None then a value will automatically be calculated. Parameter used to balance imbalanced training data.

rsgislib.classification.classlightgbm.train_lightgbm_multiclass_classifer(out_mdl_file, clsinfodict, out_info_file=None, unbalanced=False, nthread=2, early_stopping_rounds=100, num_iterations=5000, num_boost_round=100, learning_rate=0.05, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a multiclass lightgbm classifier. A dict of class information, as ClassInfoObj objects, is defined with the training data.

This function requires that lightgbm and skopt modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • clsinfodict – dict (key is string with class name) of ClassInfoObj objects defining the training data.

  • out_info_file – An optional output JSON file with information about the classifier which has been created.

  • unbalanced

  • nthread

  • scale_pos_weight

Classify Functions

rsgislib.classification.classlightgbm.apply_lightgbm_binary_classifier(model_file, imgMask, imgMaskVal, imgFileInfo, outScoreImg, gdalformat, outClassImg=None, class_thres=5000)

This function applies a trained binary (i.e., two classes) lightgbm model. The function train_lightgbm_binary_classifer can be used to train such as model. The output image will contain the softmax score for the class of interest. You will need to threshold this image to get a final hard classification. Alternative, a hard class output image and threshold can be applied to this image. Note. the softmax score is not a probability.

Parameters
  • model_file – a trained lightgbm binary model which can be loaded with lgb.Booster(model_file=model_file).

  • imgMask – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.genValidMask)

  • imgMaskVal – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a heirachical classification.

  • imgFileInfo – a list of rsgislib.imageutils.ImageBandInfo objects (also used within rsgislib.imageutils.extractZoneImageBandValues2HDF) to identify which images and bands are to be used for the classification so it adheres to the training data.

  • outScoreImg – output image file with the classification softmax score - this image is scaled by multiplying by 10000.

  • gdalformat – is the output image format - all GDAL supported formats are supported.

  • outClassImg – Optional output image which will contain the hard classification, defined with a threshold on the probability image.

  • class_thres – The threshold used to define the hard classification. Default is 5000 (i.e., probability of 0.5).

rsgislib.classification.classlightgbm.apply_lightgbm_multiclass_classifier(classTrainInfo, model_file, imgMask, imgMaskVal, imgFileInfo, outClassImg, gdalformat, classClrNames=True)

This function applies a trained multiple classes lightgbm model. The function train_lightgbm_multiclass_classifer can be used to train such as model. The output image will be a final hard classification using the class with the maximum softmax score.

Parameters
  • classTrainInfo – dict (where the key is the class name) of rsgislib.classification.ClassInfoObj objects which will be used to train the classifier (i.e., train_lightgbm_multiclass_classifer()), provide pixel value id and RGB class values.

  • model_file – a trained lightgbm multiclass model which can be loaded with lgb.Booster(model_file=model_file).

  • imgMask – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.genValidMask)

  • imgMaskVal – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a heirachical classification.

  • imgFileInfo – a list of rsgislib.imageutils.ImageBandInfo objects (also used within rsgislib.imageutils.extractZoneImageBandValues2HDF) to identify which images and bands are to be used for the classification so it adheres to the training data.

  • outClassImg – Output image which will contain the hard classification defined as the maximum probability.

  • gdalformat – is the output image format - all GDAL supported formats are supported.

  • classClrNames – default is True and therefore a colour table will the colours specified in ClassInfoObj and a ClassName (from classTrainInfo) column will be added to the output file.