RSGISLib XGBoost Classification

Training Functions

rsgislib.classification.classxgboost.optimise_xgboost_binary_classifier(out_params_file, cls1_train_file, cls1_valid_file, cls2_train_file, cls2_valid_file, n_threads=1, scale_pos_weight=None, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a binary xgboost classifier. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

This function requires that xgboost and skopt modules to be installed.

Parameters
  • out_params_file – The output model parameters which have been optimised.

  • cls1_train_file – Training samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_valid_file – Validation samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_test_file – Testing samples HDF5 file for the primary class (i.e., the one being classified)

  • cls2_train_file – Training samples HDF5 file for the ‘other’ class

  • cls2_valid_file – Validation samples HDF5 file for the ‘other’ class

  • cls2_test_file – Testing samples HDF5 file for the ‘other’ class

  • n_threads – The number of threads to use for the training.

  • scale_pos_weight – Optional, default is None. If None then a value will automatically be calculated. Parameter used to balance imbalanced training data.

  • mdl_cls_obj – XGBoost object to allow continue training with a new dataset.

rsgislib.classification.classxgboost.train_xgboost_binary_classifier(out_mdl_file, cls_params_file, cls1_train_file, cls1_valid_file, cls1_test_file, cls2_train_file, cls2_valid_file, cls2_test_file, n_threads=1, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a binary xgboost classifier. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

This function requires that xgboost and skopt modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • cls_params_file – A JSON file with the model parameters

  • cls1_train_file – Training samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_valid_file – Validation samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_test_file – Testing samples HDF5 file for the primary class (i.e., the one being classified)

  • cls2_train_file – Training samples HDF5 file for the ‘other’ class

  • cls2_valid_file – Validation samples HDF5 file for the ‘other’ class

  • cls2_test_file – Testing samples HDF5 file for the ‘other’ class

  • n_threads – The number of threads to use for the training.

  • scale_pos_weight – Optional, default is None. If None then a value will automatically be calculated. Parameter used to balance imbalanced training data.

  • mdl_cls_obj – XGBoost object to allow continue training with a new dataset.

rsgislib.classification.classxgboost.train_opt_xgboost_binary_classifier(out_mdl_file, cls1_train_file, cls1_valid_file, cls1_test_file, cls2_train_file, cls2_valid_file, cls2_test_file, n_threads=1, scale_pos_weight=None, mdl_cls_obj=None, out_params_file=None)

A function which performs a bayesian optimisation of the hyper-parameters for a binary xgboost classifier. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

This function requires that xgboost and skopt modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • cls1_train_file – Training samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_valid_file – Validation samples HDF5 file for the primary class (i.e., the one being classified)

  • cls1_test_file – Testing samples HDF5 file for the primary class (i.e., the one being classified)

  • cls2_train_file – Training samples HDF5 file for the ‘other’ class

  • cls2_valid_file – Validation samples HDF5 file for the ‘other’ class

  • cls2_test_file – Testing samples HDF5 file for the ‘other’ class

  • n_threads – The number of threads to use for the training.

  • scale_pos_weight – Optional, default is None. If None then a value will automatically be calculated. Parameter used to balance imbalanced training data.

  • mdl_cls_obj – XGBoost object to allow continue training with a new dataset.

  • out_params_file – The output model parameters which have been optimised. If None then no file will be outputted.

rsgislib.classification.classxgboost.optimise_xgboost_multiclass_classifier(out_params_file, cls_info_dict, n_threads=1, mdl_cls_obj=None, sub_train_smpls=None, rnd_seed=42)

A function which performs a bayesian optimisation of the hyper-parameters for a multiclass xgboost classifier. A dict of class information, as ClassInfoObj objects, is defined with the training and validation data. Note, the training data inputted into this function might well be a smaller subset of the whole training dataset to speed up processing.

This function requires that xgboost and skopt modules to be installed.

Parameters
  • out_params_file – The output model parameters which have been optimised.

  • cls_info_dict – dict (key is string with class name) of ClassInfoObj objects defining the training and validation data.

  • n_threads – The number of threads to use to train the classifier.

  • sub_train_smpls – Subset the training, if None or 0 then no sub-setting will occur. If between 0-1 then a ratio subset (e.g., 0.25 = 25 % subset) will be taken. If > 1 then that number of points will be taken per class.

  • rnd_seed – the seed for the random selection of the training data.

rsgislib.classification.classxgboost.train_xgboost_multiclass_classifier(out_mdl_file, cls_params_file, cls_info_dict, n_threads=1, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a multiclass xgboost classifier producing a full trained model at the end. A dict of class information, as ClassInfoObj objects, is defined with the training data.

This function requires that xgboost modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • cls_params_file – A JSON file with the model parameters

  • cls_info_dict – dict (key is string with class name) of ClassInfoObj objects defining the training data.

  • n_threads – The number of threads to use to train the classifier.

rsgislib.classification.classxgboost.train_opt_xgboost_multiclass_classifier(out_mdl_file, cls_info_dict, n_threads=1, mdl_cls_obj=None)

A function which performs a bayesian optimisation of the hyper-parameters for a multiclass xgboost classifier producing a full trained model at the end. A dict of class information, as ClassInfoObj objects, is defined with the training data.

This function requires that xgboost and skopt modules to be installed.

Parameters
  • out_mdl_file – The output model which can be loaded to perform a classification.

  • cls_info_dict – dict (key is string with class name) of ClassInfoObj objects defining the training data.

  • n_threads – The number of threads to use to train the classifier.

Classify Functions

rsgislib.classification.classxgboost.apply_xgboost_binary_classifier(model_file, in_msk_img, img_mask_val, img_file_info, out_prob_img, gdalformat, out_class_img=None, class_thres=5000, n_threads=1)

This function applies a trained binary (i.e., two classes) xgboost model. The function train_xgboost_binary_classifier can be used to train such as model. The output image will contain the probability of membership to the class of interest. You will need to threshold this image to get a final hard classification. Alternative, a hard class output image and threshold can be applied to this image.

Parameters
  • model_file – a trained xgboost binary model which can be loaded with lgb.Booster(model_file=model_file).

  • in_msk_img – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.gen_valid_mask)

  • img_mask_val – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a heirachical classification.

  • img_file_info – a list of rsgislib.imageutils.ImageBandInfo objects (also used within rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) to identify which images and bands are to be used for the classification so it adheres to the training data.

  • out_prob_img – output image file with the classification probabilities - this image is scaled by multiplying by 10000.

  • gdalformat – is the output image format - all GDAL supported formats are supported.

  • out_class_img – Optional output image which will contain the hard classification, defined with a threshold on the probability image.

  • class_thres – The threshold used to define the hard classification. Default is 5000 (i.e., probability of 0.5).

  • n_threads – The number of threads to use for the classifier.

rsgislib.classification.classxgboost.apply_xgboost_multiclass_classifier(class_train_info, model_file, in_mask_img, img_mask_val, img_file_info, out_class_img, gdalformat, class_clr_names=True, n_threads=1)

This function applies a trained multiple classes xgboost model. The function train_xgboost_multiclass_classifier can be used to train such as model. The output image will contain the probability of membership to the class of interest. You will need to threshold this image to get a final hard classification. Alternative, a hard class output image and threshold can be applied to this image.

Parameters
  • class_train_info – dict (where the key is the class name) of rsgislib.classification.ClassInfoObj objects which will be used to train the classifier (i.e., train_xgboost_multiclass_classifier()), provide pixel value id and RGB class values.

  • model_file – a trained xgboost multiclass model which can be loaded with lgb.Booster(model_file=model_file).

  • in_mask_img – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.gen_valid_mask)

  • img_mask_val – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a heirachical classification.

  • img_file_info – a list of rsgislib.imageutils.ImageBandInfo objects (also used within rsgislib.zonalstats.extract_zone_img_band_values_to_hdf) to identify which images and bands are to be used for the classification so it adheres to the training data.

  • out_class_img – Output image which will contain the hard classification defined as the maximum probability.

  • gdalformat – is the output image format - all GDAL supported formats are supported.

  • class_clr_names – default is True and therefore a colour table will the colours specified in ClassInfoObj and a class_names (from classTrainInfo) column will be added to the output file.

  • n_threads – The number of threads to use for the classifier.

rsgislib.classification.classxgboost.apply_xgboost_multiclass_classifier_rat(clumps_img, variables, model_file, class_train_info, out_col_int='OutClass', out_col_str='OutClassName', roi_col=None, roi_val=1, class_colours=True, nthread=1)

A function which will apply an XGBoost model within a Raster Attribute Table (RAT).

Parameters
  • clumps_img – is the clumps image on which the classification is to be performed

  • variables – is an array of column names which are to be used for the classification

  • class_train_info – dict (where the key is the class name) of rsgislib.classification.ClassInfoObj objects which will be used to train the classifier (i.e., train_xgboost_multiclass_classifier()), provide pixel value id and RGB class values.

  • model_file – a trained xgboost multiclass model which can be loaded with lgb.Booster(model_file=model_file).

  • out_col_int – is the output column name for the int class representation (Default: ‘OutClass’)

  • out_col_str – is the output column name for the class names column (Default: ‘OutClassName’)

  • roi_col – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)

  • roi_val – is a int value used within the roi_col to select a region to be classified (Default: 1)

  • class_colours – is a boolean specifying whether the RAT colour table should be updated using the classification colours (default: True)

  • nthread – The number of threads to use for the classifier.