verde.BlockShuffleSplit#
- class verde.BlockShuffleSplit(spacing=None, shape=None, n_splits=10, test_size=0.1, train_size=None, random_state=None, balancing=10)[source]#
Random permutation of spatial blocks cross-validator.
Yields indices to split data into training and test sets. Data are first grouped into rectangular blocks of size given by the spacing argument. Alternatively, blocks can be defined by the number of blocks in each dimension using the shape argument instead of spacing. The blocks are then split into testing and training sets randomly.
The proportion of blocks assigned to each set is controlled by test_size and/or train_size. However, the total amount of actual data points in each set could be different from these values since blocks can have a different number of data points inside them. To guarantee that the proportion of actual data is as close as possible to the proportion of blocks, this cross-validator generates an extra number of splits and selects the one with proportion of data points in each set closer to the desired amount [Valavi_etal2019]. The number of balancing splits per iteration is controlled by the balancing argument.
This cross-validator is preferred over
sklearn.model_selection.ShuffleSplit
for spatial data to avoid overestimating cross-validation scores. This can happen because of the inherent autocorrelation that is usually associated with this type of data (points that are close together are more likely to have similar values). See [Roberts_etal2017] for an overview of this topic.Note
Like
sklearn.model_selection.ShuffleSplit
, this cross-validator cannot guarantee that all folds will be different, although this is still very likely for sizeable datasets.- Parameters:
- spacing
float
,tuple
= (s_north
,s_east
),or
None
The block size in the South-North and West-East directions, respectively. A single value means that the spacing is equal in both directions. If None, then shape must be provided.
- shape
tuple
= (n_north
,n_east
)or
None
The number of blocks in the South-North and West-East directions, respectively. If None, then spacing must be provided.
- n_splits
int
,default
10 Number of re-shuffling & splitting iterations.
- test_size
float
,int
,None
, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If
train_size
is also None, it will be set to 0.1.- train_size
float
,int
orNone
, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
- random_state
int
,RandomState
instance
orNone
,optional
(default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- balancing
int
The number of splits generated per iteration to try to balance the amount of data in each set so that test_size and train_size are respected. If 1, then no extra splits are generated (essentially disabling the balancing). Must be >= 1.
- spacing
See also
train_test_split
Split a dataset into a training and a testing set.
cross_val_score
Score an estimator/gridder using cross-validation.
Examples
>>> from verde import grid_coordinates >>> import numpy as np >>> # Make a regular grid of data points >>> coords = grid_coordinates(region=(0, 3, -10, -7), spacing=1) >>> # Need to convert the coordinates into a feature matrix >>> X = np.transpose([i.ravel() for i in coords]) >>> shuffle = BlockShuffleSplit(spacing=1.5, n_splits=3, random_state=0) >>> # These are the 1D indices of the points belonging to each set >>> for train, test in shuffle.split(X): ... print("Train: {} Test: {}".format(train, test)) Train: [ 0 1 2 3 4 5 6 7 10 11 14 15] Test: [ 8 9 12 13] Train: [ 2 3 6 7 8 9 10 11 12 13 14 15] Test: [0 1 4 5] Train: [ 0 1 4 5 8 9 10 11 12 13 14 15] Test: [2 3 6 7] >>> # A better way to visualize this is to create a 2D array and put >>> # "train" or "test" in the corresponding locations. >>> shape = coords[0].shape >>> mask = np.full(shape=shape, fill_value=" ") >>> for iteration, (train, test) in enumerate(shuffle.split(X)): ... # The index needs to be converted to 2D so we can index our matrix. ... mask[np.unravel_index(train, shape)] = "train" ... mask[np.unravel_index(test, shape)] = " test" ... print("Iteration {}:".format(iteration)) ... print(mask) Iteration 0: [['train' 'train' 'train' 'train'] ['train' 'train' 'train' 'train'] [' test' ' test' 'train' 'train'] [' test' ' test' 'train' 'train']] Iteration 1: [[' test' ' test' 'train' 'train'] [' test' ' test' 'train' 'train'] ['train' 'train' 'train' 'train'] ['train' 'train' 'train' 'train']] Iteration 2: [['train' 'train' ' test' ' test'] ['train' 'train' ' test' ' test'] ['train' 'train' 'train' 'train'] ['train' 'train' 'train' 'train']]
Methods
Get metadata routing of this object.
get_n_splits
([X, y, groups])Returns the number of splitting iterations in the cross-validator
split
(X[, y, groups])Generate indices to split data into training and test set.
Methods#
- BlockShuffleSplit.get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routing
MetadataRequest
A
MetadataRequest
encapsulating routing information.
- routing
- BlockShuffleSplit.get_n_splits(X=None, y=None, groups=None)#
Returns the number of splitting iterations in the cross-validator
- BlockShuffleSplit.split(X, y=None, groups=None)#
Generate indices to split data into training and test set.
- Parameters:
- Xarray_like,
shape
(n_samples
, 2) Columns should be the easting and northing coordinates of data points, respectively.
- yarray_like,
shape
(n_samples,) The target variable for supervised learning problems. Always ignored.
- groupsarray_like,
with
shape
(n_samples,),optional
Group labels for the samples used while splitting the dataset into train/test set. Always ignored.
- Xarray_like,
- Yields: