RSGISLib Imbalanced Classification Utilities

rsgislib.classification.classimblearn.imblearn_h5_io_smplr(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], imblearn_obj, datatype: int = None)

A function which uses imblearn sampling (over or under) instance from the imbalanced-learn module to balance samples between all the classes. This function reads the data in from a number of HDF5 files and then outputs the data to a number of HDF5 files to match the other functions in RSGISLib.

More information can be found here: https://imbalanced-learn.org/stable/over_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • imblearn_obj – a imblearn under or over sampling class instance.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

Under Sampling

rsgislib.classification.classimblearn.random_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses random undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.cluster_centroid_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses ClusterCentroids undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.near_miss_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], version: int = 1, datatype: int = None)

A function which uses NearMiss undersampling from the imbalanced-learn module to balance samples between all the classes. Note this function only works with continuous data variables.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.repeat_edited_near_neigh_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses RepeatedEditedNearestNeighbours undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.all_knn_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses AllKNN undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.condensed_near_neigh_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses CondensedNearestNeighbour undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.one_sided_sel_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses OneSidedSelection undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.neighbourhood_clean_undersample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses NeighbourhoodCleaningRule undersampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/under_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

Over Sampling

rsgislib.classification.classimblearn.rand_oversample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses random oversampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/over_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.smote_oversample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses SMOTE oversampling from the imbalanced-learn module to balance samples between all the classes. Note this function only works with continuous data variables.

More information can be found here: https://imbalanced-learn.org/stable/over_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.adasyn_oversample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses ADASYN oversampling from the imbalanced-learn module to balance samples between all the classes. Note this function only works with continuous data variables.

More information can be found here: https://imbalanced-learn.org/stable/over_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.borderline_smote_oversample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], datatype: int = None)

A function which uses BorderlineSMOTE oversampling from the imbalanced-learn module to balance samples between all the classes. Note this function only works with continuous data variables.

More information can be found here: https://imbalanced-learn.org/stable/over_sampling.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

Combined Under/Over Sampling

rsgislib.classification.classimblearn.smoteenn_combined_sample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses SMOTEENN combined under and over sampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/combine.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.

rsgislib.classification.classimblearn.smotetomek_combined_sample_smpls(cls_in_info: Dict[str, ClassSimpleInfoObj], cls_out_info: Dict[str, ClassSimpleInfoObj], rnd_seed: int = 42, datatype: int = None)

A function which uses SMOTETomek combined under and over sampling from the imbalanced-learn module to balance samples between all the classes.

More information can be found here: https://imbalanced-learn.org/stable/combine.html

Parameters:
  • cls_in_info – input dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • cls_out_info – output dict of rsgislib.classification.ClassSimpleInfoObj specifying class names, ids and HDF5 file names.

  • rnd_seed – the random seed used for the analysis

  • datatype – is the data type used for the output HDF5 file (e.g., rsgislib.TYPE_32FLOAT). If None (default) then the output data type will be float32.