Note
Click here to download the full example code
Model Selection¶
In Evaluating Performance, we saw how to check the performance of an interpolator using
cross-validation. We found that the default parameters for verde.Spline
are not
good for predicting our sample air temperature data. Now, let’s see how we can tune the
Spline
to improve the cross-validation performance.
Once again, we’ll start by importing the required packages and loading our sample data.
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import itertools
import pyproj
import verde as vd
data = vd.datasets.fetch_texas_wind()
# Use Mercator projection because Spline is a Cartesian gridder
projection = pyproj.Proj(proj="merc", lat_ts=data.latitude.mean())
proj_coords = projection(data.longitude.values, data.latitude.values)
region = vd.get_region((data.longitude, data.latitude))
# The desired grid spacing in degrees (converted to meters using 1 degree approx. 111km)
spacing = 15 / 60
Before we begin tuning, let’s reiterate what the results were with the default parameters.
spline_default = vd.Spline()
score_default = np.mean(
vd.cross_val_score(spline_default, proj_coords, data.air_temperature_c)
)
spline_default.fit(proj_coords, data.air_temperature_c)
print("R² with defaults:", score_default)
Out:
R² with defaults: 0.7960368854786868
Tuning¶
Spline
has many parameters that can be set to modify the final result.
Mainly the damping
regularization parameter and the mindist
“fudge factor”
which smooths the solution. Would changing the default values give us a better score?
We can answer these questions by changing the values in our spline
and
re-evaluating the model score repeatedly for different values of these parameters.
Let’s test the following combinations:
dampings = [None, 1e-4, 1e-3, 1e-2]
mindists = [5e3, 10e3, 50e3, 100e3]
# Use itertools to create a list with all combinations of parameters to test
parameter_sets = [
dict(damping=combo[0], mindist=combo[1])
for combo in itertools.product(dampings, mindists)
]
print("Number of combinations:", len(parameter_sets))
print("Combinations:", parameter_sets)
Out:
Number of combinations: 16
Combinations: [{'damping': None, 'mindist': 5000.0}, {'damping': None, 'mindist': 10000.0}, {'damping': None, 'mindist': 50000.0}, {'damping': None, 'mindist': 100000.0}, {'damping': 0.0001, 'mindist': 5000.0}, {'damping': 0.0001, 'mindist': 10000.0}, {'damping': 0.0001, 'mindist': 50000.0}, {'damping': 0.0001, 'mindist': 100000.0}, {'damping': 0.001, 'mindist': 5000.0}, {'damping': 0.001, 'mindist': 10000.0}, {'damping': 0.001, 'mindist': 50000.0}, {'damping': 0.001, 'mindist': 100000.0}, {'damping': 0.01, 'mindist': 5000.0}, {'damping': 0.01, 'mindist': 10000.0}, {'damping': 0.01, 'mindist': 50000.0}, {'damping': 0.01, 'mindist': 100000.0}]
Now we can loop over the combinations and collect the scores for each parameter set.
spline = vd.Spline()
scores = []
for params in parameter_sets:
spline.set_params(**params)
score = np.mean(vd.cross_val_score(spline, proj_coords, data.air_temperature_c))
scores.append(score)
print(scores)
Out:
[-6.672752081834777, 0.4845403172740642, 0.8383522700289185, 0.8371988991061056, 0.8351153077200753, 0.8316607509492719, 0.8492629930940236, 0.8418400888233621, 0.837179509183614, 0.8412200336664746, 0.8529555082484436, 0.852172766758575, 0.8401945161939217, 0.8330182923873511, 0.8441706458564976, 0.8491145591352283]
The largest score will yield the best parameter combination.
best = np.argmax(scores)
print("Best score:", scores[best])
print("Score with defaults:", score_default)
print("Best parameters:", parameter_sets[best])
Out:
Best score: 0.8529555082484436
Score with defaults: 0.7960368854786868
Best parameters: {'damping': 0.001, 'mindist': 50000.0}
That is a nice improvement over our previous score!
This type of tuning is important and should always be performed when using a new gridder or a new dataset. However, the above implementation requires a lot of coding. Fortunately, Verde provides convenience classes that perform the cross-validation and tuning automatically when fitting a dataset.
Cross-validated gridders¶
The verde.SplineCV
class provides a cross-validated version of
verde.Spline
. It has almost the same interface but does all of the above
automatically when fitting a dataset. The only difference is that you must provide a
list of damping
and mindist
parameters to try instead of only a single value:
Calling fit
will run a grid search over all parameter
combinations to find the one that maximizes the cross-validation score.
Out:
SplineCV(dampings=[None, 0.0001, 0.001, 0.01],
mindists=[5000.0, 10000.0, 50000.0, 100000.0])
The estimated best damping and mindist, as well as the cross-validation scores, are stored in class attributes:
print("Highest score:", spline.scores_.max())
print("Best damping:", spline.damping_)
print("Best mindist:", spline.mindist_)
Out:
Highest score: 0.8529555082484436
Best damping: 0.001
Best mindist: 50000.0
The cross-validated gridder can be used like any other gridder (including in
verde.Chain
and verde.Vector
):
Out:
<xarray.Dataset>
Dimensions: (latitude: 43, longitude: 51)
Coordinates:
* longitude (longitude) float64 -106.4 -106.1 -105.9 ... -94.06 -93.8
* latitude (latitude) float64 25.91 26.16 26.41 ... 35.91 36.16 36.41
Data variables:
temperature (latitude, longitude) float64 19.42 19.42 19.43 ... 7.536 7.765
Attributes:
metadata: Generated by SplineCV(dampings=[None, 0.0001, 0.001, 0.01],\n ...
Like verde.cross_val_score
, SplineCV
can also run the
grid search in parallel using Dask by specifying the
delayed
attribute:
Unlike verde.cross_val_score
, calling fit
does not result in dask.delayed
objects. The full grid search is
executed and the optimal parameters are found immediately.
spline.fit(proj_coords, data.air_temperature_c)
print("Best damping:", spline.damping_)
print("Best mindist:", spline.mindist_)
Out:
Best damping: 0.001
Best mindist: 50000.0
The one caveat is the that the scores_
attribute will be a list of
dask.delayed
objects instead because the scores are only computed as
intermediate values in the scheduled computations.
print("Delayed scores:", spline.scores_)
Out:
Delayed scores: [Delayed('mean-2c727950-8e88-4d2c-b4ae-97b89a066a44'), Delayed('mean-5564ce3b-02d1-457b-b404-3352205d749d'), Delayed('mean-49a8618e-2540-42ba-a353-ccd38040984e'), Delayed('mean-9f5cbd3d-6c45-4fd1-aa58-ebb961af4a7e'), Delayed('mean-511a8e71-9894-4782-8a5e-8a3087a0d937'), Delayed('mean-c9d5455a-3cb6-4378-ad14-753e15b971b1'), Delayed('mean-8b5f87fc-46c7-4f9a-8b7d-b698833ddf42'), Delayed('mean-9b1402db-f4ae-4c43-af19-910148035dde'), Delayed('mean-38a5894a-d7a5-4fdd-82be-54deed427cc6'), Delayed('mean-c0334135-c2d8-47be-a20d-bc80b3040dd3'), Delayed('mean-df4e9b8c-eb5e-4e98-a04b-4b52210d7bab'), Delayed('mean-e6de421e-6eea-4b02-a38e-98b650614855'), Delayed('mean-1732fb4b-1ab4-4fd7-9165-6bb090043580'), Delayed('mean-97399b79-18cf-40bc-a2b0-b20df0ad320b'), Delayed('mean-62b4997c-e542-43a5-a7b1-29e45695b670'), Delayed('mean-8443a268-372f-400d-84f6-29843d858ae3')]
Calling dask.compute
on the scores will calculate their values but
will unfortunately run the entire grid search again. So using
delayed=True
is not recommended if you need the scores of each parameter
combination.
The importance of tuning¶
To see the difference that tuning has on the results, we can make a grid with the best configuration and see how it compares to the default result.
grid_default = spline_default.grid(
region=region,
spacing=spacing,
projection=projection,
dims=["latitude", "longitude"],
data_names="temperature",
)
Let’s plot our grids side-by-side:
mask = vd.distance_mask(
(data.longitude, data.latitude),
maxdist=3 * spacing * 111e3,
coordinates=vd.grid_coordinates(region, spacing=spacing),
projection=projection,
)
grid = grid.where(mask)
grid_default = grid_default.where(mask)
plt.figure(figsize=(14, 8))
for i, title, grd in zip(range(2), ["Defaults", "Tuned"], [grid_default, grid]):
ax = plt.subplot(1, 2, i + 1, projection=ccrs.Mercator())
ax.set_title(title)
pc = grd.temperature.plot.pcolormesh(
ax=ax,
cmap="plasma",
transform=ccrs.PlateCarree(),
vmin=data.air_temperature_c.min(),
vmax=data.air_temperature_c.max(),
add_colorbar=False,
add_labels=False,
)
plt.colorbar(pc, orientation="horizontal", aspect=50, pad=0.05).set_label("C")
ax.plot(
data.longitude, data.latitude, ".k", markersize=1, transform=ccrs.PlateCarree()
)
vd.datasets.setup_texas_wind_map(ax)
plt.show()
Notice that, for sparse data like these, smoother models tend to be better predictors. This is a sign that you should probably not trust many of the short wavelength features that we get from the defaults.
Total running time of the script: ( 0 minutes 2.006 seconds)