RSGISLib XGBoost Classification

XGBoost (https://xgboost.readthedocs.io) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

When considering ensemble learning, there are two primary methods: bagging and boosting. Bagging involves the training of many independent models and combines their predictions through some form of aggregation (averaging, voting etc.). An example of a bagging ensemble is a Random Forest.

Boosting instead trains models sequentially, where each model learns from the errors of the previous model. Starting with a weak base model, models are trained iteratively, each adding to the prediction of the previous model to produce a strong overall prediction. In the case of gradient boosted decision trees, successive models are found by applying gradient descent in the direction of the average gradient, calculated with respect to the error residuals of the loss function, of the leaf nodes of previous models.

See also

For an easy to follow and understandable background to XGBoost see this blog post

See also

For an an academic paper see: Chen, T. & Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '16. New York, NY, USA: ACM, pp. 785–794. Available at: http://doi.acm.org/10.1145/2939672.2939785.

XGBoost is a binary classifier (i.e., separates two classes, e.g., mangroves and other) but it has a multi-class mode which applies a number of binary classification to produce a multi-class classification result.

Steps to applying a XGBoost Classification:

  • Extract training

  • Split training: Training, Validation, Testing

  • Train Classifier and Optimise Hyperparameters

  • Apply Classifier

However, fist we’ll create a couple of directories for our outputs and intermediary files:

import os

out_dir = "baseline_cls_xgb"
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

tmp_dir = "tmp_xgb"
if not os.path.exists(tmp_dir):
    os.mkdir(tmp_dir)

We will also define the input file path and the list ImageBandInfo objects, which specifies which images and bands are used for the analysis:

import rsgislib.imageutils

input_img = "./LS5TM_19970716_vmsk_mclds_topshad_rad_srefdem_stdsref_subset.tif"

imgs_info = []
imgs_info.append(
    rsgislib.imageutils.ImageBandInfo(
        file_name=input_img, name="ls97", bands=[1, 2, 3, 4, 5, 6]
    )
)

When applying a classifier a mask image needs to be provided where a pixel value within that mask specifying which pixels should be classified. While defining the input image we can also define that valid mask image using the rsgislib.imageutils.gen_valid_mask function, which simply creates a mask of pixels which are not ‘no data’:

vld_msk_img = os.path.join(out_dir, "LS5TM_19970716_vmsk.kea")
rsgislib.imageutils.gen_valid_mask(
    input_img, output_img=vld_msk_img, gdalformat="KEA", no_data_val=0.0
)

To define training a raster with a unique value for each class, or multiple binary rasters one for each class. Commonly the training regions might be defined using a vector layer which would require rasterising:

import rsgislib.vectorutils.createrasters

mangrove_vec_file = "./training/mangroves.geojson"
mangrove_vec_lyr = "mangroves"
mangrove_smpls_img = os.path.join(tmp_dir, "mangrove_smpls.kea")
rsgislib.vectorutils.createrasters.rasterise_vec_lyr(
    vec_file=mangrove_vec_file,
    vec_lyr=mangrove_vec_lyr,
    input_img=input_img,
    output_img=mangrove_smpls_img,
    gdalformat="KEA",
    burn_val=1,
)

other_terrestrial_vec_file = "./training/other_terrestrial.geojson"
other_terrestrial_vec_lyr = "other_terrestrial"
other_terrestrial_smpls_img = os.path.join(tmp_dir, "other_terrestrial_smpls.kea")
rsgislib.vectorutils.createrasters.rasterise_vec_lyr(
    vec_file=other_terrestrial_vec_file,
    vec_lyr=other_terrestrial_vec_lyr,
    input_img=input_img,
    output_img=other_terrestrial_smpls_img,
    gdalformat="KEA",
    burn_val=1,
)

water_vec_file = "./training/water.geojson"
water_vec_lyr = "water"
water_smpls_img = os.path.join(tmp_dir, "water_smpls.kea")
rsgislib.vectorutils.createrasters.rasterise_vec_lyr(
    vec_file=water_vec_file,
    vec_lyr=water_vec_lyr,
    input_img=input_img,
    output_img=water_smpls_img,
    gdalformat="KEA",
    burn_val=1,
)

To extract the image pixel values, which are stored within a HDF5 file (see https://portal.hdfgroup.org/display/HDF5/HDF5 for more information) the following functions are used. To define the images and associated bands to be used for the classification and therefore values need to be extracted then a list of rsgislib.imageutils.ImageBandInfo classes needs to be provided:

import rsgislib.zonalstats

mangrove_all_smpls_h5_file = os.path.join(out_dir, "mangrove_all_smpls.h5")
rsgislib.zonalstats.extract_zone_img_band_values_to_hdf(
    imgs_info,
    in_msk_img=mangrove_smpls_img,
    out_h5_file=mangrove_all_smpls_h5_file,
    mask_val=1,
    datatype=rsgislib.TYPE_16UINT,
)

other_terrestrial_all_smpls_h5_file = os.path.join(
    out_dir, "other_terrestrial_all_smpls.h5"
)
rsgislib.zonalstats.extract_zone_img_band_values_to_hdf(
    imgs_info,
    in_msk_img=other_terrestrial_smpls_img,
    out_h5_file=other_terrestrial_all_smpls_h5_file,
    mask_val=1,
    datatype=rsgislib.TYPE_16UINT,
)

water_all_smpls_h5_file = os.path.join(out_dir, "water_all_smpls.h5")
rsgislib.zonalstats.extract_zone_img_band_values_to_hdf(
    imgs_info,
    in_msk_img=water_smpls_img,
    out_h5_file=water_all_smpls_h5_file,
    mask_val=1,
    datatype=rsgislib.TYPE_16UINT,
)

If training data is extracted from multiple input images then it will need to be merged using the following function. In this case we’ll merge the water and terrestrial samples and use the merged class to create a mangrove binary classifier:

other_all_smpls_h5_file = os.path.join(out_dir, "other_all_smpls.h5")
rsgislib.zonalstats.merge_extracted_hdf5_data(
    h5_files=[other_terrestrial_all_smpls_h5_file, water_all_smpls_h5_file],
    out_h5_file=other_all_smpls_h5_file,
    datatype=rsgislib.TYPE_16UINT,
)

To split the extracted samples into a training, validation and testing sets you can use the rsgislib.classification.split_sample_train_valid_test function. Note, this function is also used to standardise the number of samples used to train the classifier so the training data are balanced:

import rsgislib.classification

mangrove_train_smpls_h5_file = os.path.join(out_dir, "mangrove_train_smpls.h5")
mangrove_valid_smpls_h5_file = os.path.join(out_dir, "mangrove_valid_smpls.h5")
mangrove_test_smpls_h5_file = os.path.join(out_dir, "mangrove_test_smpls.h5")
rsgislib.classification.split_sample_train_valid_test(
    in_h5_file=mangrove_all_smpls_h5_file,
    train_h5_file=mangrove_train_smpls_h5_file,
    valid_h5_file=mangrove_valid_smpls_h5_file,
    test_h5_file=mangrove_test_smpls_h5_file,
    test_sample=10000,
    valid_sample=10000,
    train_sample=35000,
    rnd_seed=42,
    datatype=rsgislib.TYPE_16UINT,
)


other_terrestrial_train_smpls_h5_file = os.path.join(
    out_dir, "other_terrestrial_train_smpls.h5"
)
other_terrestrial_valid_smpls_h5_file = os.path.join(
    out_dir, "other_terrestrial_valid_smpls.h5"
)
other_terrestrial_test_smpls_h5_file = os.path.join(
    out_dir, "other_terrestrial_test_smpls.h5"
)
rsgislib.classification.split_sample_train_valid_test(
    in_h5_file=other_terrestrial_all_smpls_h5_file,
    train_h5_file=other_terrestrial_train_smpls_h5_file,
    valid_h5_file=other_terrestrial_valid_smpls_h5_file,
    test_h5_file=other_terrestrial_test_smpls_h5_file,
    test_sample=10000,
    valid_sample=10000,
    train_sample=35000,
    rnd_seed=42,
    datatype=rsgislib.TYPE_16UINT,
)


water_train_smpls_h5_file = os.path.join(out_dir, "water_train_smpls.h5")
water_valid_smpls_h5_file = os.path.join(out_dir, "water_valid_smpls.h5")
water_test_smpls_h5_file = os.path.join(out_dir, "water_test_smpls.h5")
rsgislib.classification.split_sample_train_valid_test(
    in_h5_file=water_all_smpls_h5_file,
    train_h5_file=water_train_smpls_h5_file,
    valid_h5_file=water_valid_smpls_h5_file,
    test_h5_file=water_test_smpls_h5_file,
    test_sample=10000,
    valid_sample=10000,
    train_sample=35000,
    rnd_seed=42,
    datatype=rsgislib.TYPE_16UINT,
)


other_train_smpls_h5_file = os.path.join(out_dir, "other_train_smpls.h5")
other_valid_smpls_h5_file = os.path.join(out_dir, "other_valid_smpls.h5")
other_test_smpls_h5_file = os.path.join(out_dir, "other_test_smpls.h5")
rsgislib.classification.split_sample_train_valid_test(
    in_h5_file=other_all_smpls_h5_file,
    train_h5_file=other_train_smpls_h5_file,
    valid_h5_file=other_valid_smpls_h5_file,
    test_h5_file=other_test_smpls_h5_file,
    test_sample=10000,
    valid_sample=10000,
    train_sample=35000,
    rnd_seed=42,
    datatype=rsgislib.TYPE_16UINT,
)

Note

Training samples are used to train the classifier. Validation samples are used to test the accuracy of the classifier during the parameter optimisation process and are therefore part of the training process and not independent. Testing samples completely independent of the training process and are used as an independent sample to test the overall accuracy of the classifier.

Apply a XGBoost Binary Classifier

To train a single binary classifier you need to use the following function:

import rsgislib.classification.classxgboost

cls_bin_mdl_file = os.path.join(out_dir, "xgb_mng_bin_mdl.h5")
rsgislib.classification.classxgboost.train_opt_xgboost_binary_classifier(
    out_mdl_file=cls_bin_mdl_file,
    cls1_train_file=mangrove_train_smpls_h5_file,
    cls1_valid_file=mangrove_valid_smpls_h5_file,
    cls1_test_file=mangrove_test_smpls_h5_file,
    cls2_train_file=other_train_smpls_h5_file,
    cls2_valid_file=other_valid_smpls_h5_file,
    cls2_test_file=other_test_smpls_h5_file,
    op_mthd=rsgislib.OPT_MTHD_BAYESOPT,
    n_opt_iters=100,
    rnd_seed=42,
    n_threads=1,
    mdl_cls_obj=None,
    out_params_file=None,
    use_gpu=False,
)

To apply the binary classifier use the following function:

cls_score_img = os.path.join(out_dir, "LS5TM_19970716_bin_cls_score_img.kea")
out_class_img = os.path.join(out_dir, "LS5TM_19970716_bin_cls_img.kea")
rsgislib.classification.classxgboost.apply_xgboost_binary_classifier(
    model_file=cls_bin_mdl_file,
    in_msk_img=vld_msk_img,
    img_msk_val=1,
    img_file_info=imgs_info,
    out_score_img=cls_score_img,
    gdalformat="KEA",
    out_class_img=out_class_img,
    class_thres=5000,
    n_threads=1,
)

Note

Class probability values are multipled by 10,000 so a threshold of 5000 is really 0.5.

Apply a XGBoost Multi-Class Classifier

To train a multi-class classifier you first need to specify the reference samples as a dict of rsgislib.classification.ClassInfoObj objects:

import rsgislib.classification

cls_info_dict = dict()
cls_info_dict["Mangrove"] = rsgislib.classification.ClassInfoObj(
    id=0,
    out_id=1,
    train_file_h5=mangrove_train_smpls_h5_file,
    test_file_h5=mangrove_test_smpls_h5_file,
    valid_file_h5=mangrove_valid_smpls_h5_file,
    red=0,
    green=255,
    blue=0,
)
cls_info_dict["Other Terrestrial"] = rsgislib.classification.ClassInfoObj(
    id=1,
    out_id=2,
    train_file_h5=other_terrestrial_train_smpls_h5_file,
    test_file_h5=other_terrestrial_test_smpls_h5_file,
    valid_file_h5=other_terrestrial_valid_smpls_h5_file,
    red=100,
    green=100,
    blue=100,
)
cls_info_dict["Water"] = rsgislib.classification.ClassInfoObj(
    id=2,
    out_id=3,
    train_file_h5=water_train_smpls_h5_file,
    test_file_h5=water_test_smpls_h5_file,
    valid_file_h5=water_valid_smpls_h5_file,
    red=0,
    green=0,
    blue=255,
)

You can then train a multi-class xgboost classifier using the following function:

import rsgislib.classification.classxgboost

cls_mcls_mdl_file = os.path.join(out_dir, "xgb_mng_mcls_mdl.h5")
rsgislib.classification.classxgboost.train_opt_xgboost_multiclass_classifier(
    out_mdl_file=cls_mcls_mdl_file,
    cls_info_dict=cls_info_dict,
    op_mthd=rsgislib.OPT_MTHD_BAYESOPT,
    n_opt_iters=100,
    rnd_seed=42,
    n_threads=1,
    mdl_cls_obj=None,
    use_gpu=False,
)

To apply the multi-class classifier use the following function:

out_class_img = os.path.join(out_dir, "LS5TM_19970716_mcls_img.kea")
rsgislib.classification.classxgboost.apply_xgboost_multiclass_classifier(
    model_file=cls_mcls_mdl_file,
    cls_info_dict=cls_info_dict,
    in_msk_img=vld_msk_img,
    img_msk_val=1,
    img_file_info=imgs_info,
    out_class_img=out_class_img,
    gdalformat="KEA",
    class_clr_names=True,
    n_threads=1,
)

Note

Within the rsgislib.classification.ClassInfoObj class you need to provide an id and out_id value. The id must start from zero and be consecutive while the out_id will be used as the pixel value for the output classification image and can be any integer value.

Binary Classification Functions

rsgislib.classification.classxgboost.optimise_xgboost_binary_classifier(out_params_file: str, cls1_train_file: str, cls1_valid_file: str, cls2_train_file: str, cls2_valid_file: str, op_mthd: int = 1, n_opt_iters: int = 100, rnd_seed: int = None, n_threads: int = 1, mdl_cls_obj=None, use_gpu: bool = False)

A function which performs a hyper-parameter optimisation for a binary xgboost classifier. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

You have the option of using the bayes_opt (Default), optuna or skopt optimisation libraries. Before 5.1.0 skopt was the only option but this no longer appears to be maintained so the other options have been added.

Parameters:
  • out_params_file – The output JSON file with the identified parameters

  • cls1_train_file – File path to the HDF5 file with the training samples for class 1

  • cls1_valid_file – File path to the HDF5 file with the validation samples for class 1

  • cls2_train_file – File Path to the HDF5 file with the training samples for class 2

  • cls2_valid_file – File path to the HDF5 file with the validation samples for class 2

  • op_mthd – The method used to optimise the parameters. Default: rsgislib.OPT_MTHD_BAYESOPT

  • n_opt_iters – The number of iterations (Default 100) used for the optimisation. This parameter is ignored for skopt. For bayes_opt there is a minimum of 10 and these are added to that minimum so Default is therefore 110. For optuna this is the number of iterations used.

  • rnd_seed – A random seed for the optimisation. Default None. If None there a different seed will be used each time the function is run.

  • n_threads – The number of threads used by xgboost

  • mdl_cls_obj – An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

  • use_gpu – A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.train_xgboost_binary_classifier(out_mdl_file: str, cls_params_file: str, cls1_train_file: str, cls1_valid_file: str, cls1_test_file: str, cls2_train_file: str, cls2_valid_file: str, cls2_test_file: str, n_threads: int = 1, mdl_cls_obj=None, use_gpu: bool = False)
A function which trains a binary xgboost model using the parameters provided

within a JSON file. The JSON file must provide values for the following parameters:

  • eta

  • gamma

  • max_depth

  • min_child_weight

  • max_delta_step

  • subsample

  • bagging_fraction

  • eval_metric

  • objective

param out_mdl_file:

The file path for the output xgboost (*.json) model which can be loaded to perform a classification.

param cls_params_file:

The file path to the JSON file with the classifier parameters.

param cls1_train_file:

File path to the HDF5 file with the training samples for class 1

param cls1_valid_file:

File path to the HDF5 file with the validation samples for class 1

param cls1_test_file:

File path to the HDF5 file with the testing samples for class 1

param cls2_train_file:

File path to the HDF5 file with the training samples for class 2

param cls2_valid_file:

File path to the HDF5 file with the validation samples for class 2

param cls2_test_file:

File path to the HDF5 file with the testing samples for class 2

param n_threads:

The number of threads used by xgboost

param mdl_cls_obj:

An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

param use_gpu:

A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.train_opt_xgboost_binary_classifier(out_mdl_file: str, cls1_train_file: str, cls1_valid_file: str, cls1_test_file: str, cls2_train_file: str, cls2_valid_file: str, cls2_test_file: str, op_mthd: int = 1, n_opt_iters: int = 100, rnd_seed: int = None, n_threads: int = 1, mdl_cls_obj=None, out_params_file: str = None, use_gpu: bool = False)

A function which performs a hyper-parameter optimisation for a binary xgboost classifier and then trains a model saving the model for future use. Class 1 is the class which you are interested in and Class 2 is the ‘other class’.

You have the option of using the bayes_opt (Default), optuna or skopt optimisation libraries. Before 5.1.0 skopt was the only option but this no longer appears to be maintained so the other options have been added.

Parameters:
  • out_mdl_file – The file path for the output xgboost (*.json) model which can be loaded to perform a classification.

  • cls1_train_file – File path to the HDF5 file with the training samples for class 1

  • cls1_valid_file – File path to the HDF5 file with the validation samples for class 1

  • cls1_test_file – File path to the HDF5 file with the testing samples for class 1

  • cls2_train_file – File path to the HDF5 file with the training samples for class 2

  • cls2_valid_file – File path to the HDF5 file with the validation samples for class 2

  • cls2_test_file – File path to the HDF5 file with the testing samples for class 2

  • op_mthd – The method used to optimise the parameters. Default: rsgislib.OPT_MTHD_BAYESOPT

  • n_opt_iters – The number of iterations (Default 100) used for the optimisation. This parameter is ignored for skopt. For bayes_opt there is a minimum of 10 and these are added to that minimum so Default is therefore 110. For optuna this is the number of iterations used.

  • rnd_seed – A random seed for the optimisation. Default None. If None there a different seed will be used each time the function is run.

  • n_threads – The number of threads used by xgboost

  • mdl_cls_obj – An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

  • out_params_file – The output JSON file with the identified parameters. If None (default) then no file is outputted.

  • use_gpu – A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.apply_xgboost_binary_classifier(model_file: str, in_msk_img: str, img_msk_val: int, img_file_info: List[ImageBandInfo], out_score_img: str, gdalformat: str = 'KEA', out_class_img=None, class_thres: int = 5000, n_threads: int = 1)

A function for applying a trained binary xgboost model to a image or stack of image files.

Parameters:
  • model_file – a trained xgboost binary model which can be loaded with the xgb.Booster function load_model(model_file).

  • in_msk_img – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.gen_valid_mask)

  • img_msk_val – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a hierarchical classification.

  • img_file_info – a list of rsgislib.imageutils.ImageBandInfo objects to identify which images and bands are to be used for the classification so it adheres to the training data.

  • out_score_img – output image file with the classification softmax score. Note. this image is scaled by multiplying by 10000 therefore the range is between 0-10000.

  • gdalformat – The output image format (Default: KEA).

  • out_class_img – Optional output image which will contain the hard classification, defined with a threshold on the softmax score image.

  • class_thres – The threshold used to define the hard classification. Default is 5000 (i.e., softmax score of 0.5).

  • n_threads – The number of threads used by xgboost

Multi-Class Classification Functions

rsgislib.classification.classxgboost.optimise_xgboost_multiclass_classifier(out_params_file: str, cls_info_dict: Dict[str, ClassInfoObj], sub_train_smpls: int | float = None, op_mthd: int = 1, n_opt_iters: int = 100, rnd_seed: int = None, n_threads: int = 1, mdl_cls_obj=None, use_gpu: bool = False)

A function which performs a hyper-parameter optimisation for a multi-class xgboost classifier.

You have the option of using the bayes_opt (Default), optuna or skopt optimisation libraries. Before 5.1.0 skopt was the only option but this no longer appears to be maintained so the other options have been added.

Parameters:
  • out_params_file – The output JSON file with the identified parameters

  • cls_info_dict – a dict where the key is string with class name of ClassInfoObj objects defining the training data.

  • sub_train_smpls – Subset the training, if None or 0 then no sub-setting will occur. If between 0-1 then a ratio subset (e.g., 0.25 = 25 % subset) will be taken. If > 1 then that number of points will be taken per class.

  • op_mthd – The method used to optimise the parameters. Default: rsgislib.OPT_MTHD_BAYESOPT

  • n_opt_iters – The number of iterations (Default 100) used for the optimisation. This parameter is ignored for skopt. For bayes_opt there is a minimum of 10 and these are added to that minimum so Default is therefore 110. For optuna this is the number of iterations used.

  • rnd_seed – A random seed for the optimisation. Default None. If None there a different seed will be used each time the function is run.

  • n_threads – The number of threads used by xgboost

  • mdl_cls_obj – An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

  • use_gpu – A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.train_xgboost_multiclass_classifier(out_mdl_file: str, cls_params_file: str, cls_info_dict: Dict[str, ClassInfoObj], n_threads: int = 1, mdl_cls_obj=None, use_gpu: bool = False)

A function which trains a multiclass xgboost model using the parameters provided within a JSON file. The JSON file must provide values for the following parameters:

  • eta

  • gamma

  • max_depth

  • min_child_weight

  • max_delta_step

  • subsample

  • bagging_fraction

  • eval_metric

  • objective

Parameters:
  • params_file – The file path to the JSON file with the classifier parameters.

  • out_mdl_file – The file path for the output xgboost (*.json) model which can be loaded to perform a classification.

  • cls_info_dict – a dict where the key is string with class name of ClassInfoObj objects defining the training data.

  • n_threads – The number of threads used by xgboost

  • mdl_cls_obj – An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

  • use_gpu – A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.train_opt_xgboost_multiclass_classifier(out_mdl_file: str, cls_info_dict: Dict[str, ClassInfoObj], op_mthd: int = 1, n_opt_iters: int = 100, rnd_seed: int = None, n_threads: int = 1, mdl_cls_obj=None, use_gpu: bool = False)

A function which performs a hyper-parameter optimisation for a multi-class xgboost classifier and then trains a model saving the model for future use.

You have the option of using the bayes_opt (Default), optuna or skopt optimisation libraries. Before 5.1.0 skopt was the only option but this no longer appears to be maintained so the other options have been added.

Parameters:
  • out_mdl_file – The file path for the output xgboost (*.json) model which can be loaded to perform a classification.

  • cls_info_dict – a dict where the key is string with class name of ClassInfoObj objects defining the training data.

  • op_mthd – The method used to optimise the parameters. Default: rsgislib.OPT_MTHD_BAYESOPT

  • n_opt_iters – The number of iterations (Default 100) used for the optimisation. This parameter is ignored for skopt. For bayes_opt there is a minimum of 10 and these are added to that minimum so Default is therefore 110. For optuna this is the number of iterations used.

  • rnd_seed – A random seed for the optimisation. Default None. If None there a different seed will be used each time the function is run.

  • n_threads – The number of threads used by xgboost

  • mdl_cls_obj – An optional (Default None) xgboost model which will be used as the basis model from which training will be continued (i.e., transfer learning).

  • use_gpu – A boolean to specify whether the GPU should be used for training. If you have a GPU available which supports CUDA and xgboost is installed with GPU support then this is significantly speed up the training of your model.

rsgislib.classification.classxgboost.apply_xgboost_multiclass_classifier(model_file: str, cls_info_dict: Dict[str, ClassInfoObj], in_msk_img: str, img_msk_val: int, img_file_info: List[ImageBandInfo], out_class_img: str, gdalformat: str = 'KEA', class_clr_names: bool = True, n_threads: int = 1)

A function for applying a trained multiclass xgboost model to a image or stack of image files.

Parameters:
  • model_file – a trained xgboost multiclass model which can be loaded with the xgb.Booster function load_model(model_file).

  • cls_info_dict – a dict where the key is string with class name of ClassInfoObj objects defining the training data. This is used to define the class names and colours if class_clr_names is True.

  • in_msk_img – is an image file providing a mask to specify where should be classified. Simplest mask is all the valid data regions (rsgislib.imageutils.gen_valid_mask)

  • img_msk_val – the pixel value within the imgMask to limit the region to which the classification is applied. Can be used to create a hierarchical classification.

  • img_file_info – a list of rsgislib.imageutils.ImageBandInfo objects to identify which images and bands are to be used for the classification so it adheres to the training data.

  • out_class_img – The file path for the output classification image

  • gdalformat – The output image format (Default: KEA).

  • class_clr_names – default is True and therefore a colour table will the colours specified in ClassInfoObj and a class_names (from cls_info_dict) column will be added to the output file. Note the output format needs to support a raster attribute table (i.e., KEA).

  • n_threads – The number of threads used by xgboost

rsgislib.classification.classxgboost.apply_xgboost_multiclass_classifier_rat(clumps_img: str, variables: List[str], model_file: str, cls_info_dict: Dict, out_col_int: str = 'OutClass', out_col_str: str = 'OutClassName', roi_col: str = None, roi_val: int = 1, class_colours: bool = True, n_threads: int = 1)

A function for applying a trained multiclass xgboost model to a raster attribute table.

Parameters:
  • clumps_img – the file path for the input image with associated raster attribute table (RAT) to which the classification will be applied.

  • variables – A list of column names within the RAT to be used for the classification.

  • model_file – a trained xgboost multiclass model which can be loaded with the xgb.Booster function load_model(model_file).

  • cls_info_dict – a dict where the key is string with class name of ClassInfoObj objects defining the training data. Note, this is just used for the class names, int ID and classification colours.

  • out_col_int – is the output column name for the int class representation (Default: ‘OutClass’)

  • out_col_str – is the output column name for the class names column (Default: ‘OutClassName’)

  • roi_col – is a column name for a column which specifies the region to be classified. If None ignored (Default: None)

  • roi_val – is a int value used within the roi_col to select a region to be classified (Default: 1)

  • class_colours – is a boolean specifying whether the RAT colour table should be updated using the classification colours (default: True)

  • n_threads – The number of threads used by xgboost