Fetching files from a registry¶
If you need to manage the download of multiple files from one or more locations, then this section is for you!
Setup¶
In the following example we’ll assume that:
You have several data files served from the same base URL (for example,
"https://www.somewebpage.org/science/data"
).You know the file names and their hashes.
We will use pooch.create
to set up our download manager:
import pooch
odie = pooch.create(
# Use the default cache folder for the operating system
path=pooch.os_cache("my-project"),
base_url="https://www.somewebpage.org/science/data/",
# The registry specifies the files that can be fetched
registry={
"temperature.csv": "sha256:19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc",
"gravity-disturbance.nc": "sha256:1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w",
},
)
The return value (odie
) is an instance of pooch.Pooch
.
It contains all of the information needed to fetch the data files in our
registry and store them in the specified cache folder.
Note
The Pooch registry is a mapping of file names and their associated hashes (and optionally download URLs).
Tip
If you don’t know the hash or are otherwise unable to obtain it, it is possible to bypass the check. This is not recommended for general use, only if it can’t be avoided. See Hashes: Calculating and bypassing.
Attention
You can have data files in subdirectories of the remote data store (URL). These files will be saved to the same subdirectories in the local storage folder.
However, the names of these files in the registry must use Unix-style
separators ('/'
) even on Windows.
Pooch will handle the appropriate conversions.
Downloading files¶
To download one our data files and load it with xarray:
import xarray as xr
file_path = odie.fetch("gravity-disturbance.nc")
# Standard use of xarray to load a netCDF file (.nc)
data = xr.open_dataset(file_path)
The call to pooch.Pooch.fetch
will check if the file already exists in
the cache folder.
If it doesn’t:
The file is downloaded and saved to the cache folder.
The hash of the downloaded file is compared against the one stored in the registry to make sure the file isn’t corrupted.
The function returns the absolute path to the file on your computer.
If it does:
Check if it’s hash matches the one in the registry.
If it does, no download happens and the file path is returned.
If it doesn’t, the file is downloaded once more to get an updated version on your computer.
Why use this method?¶
With pooch.Pooch
, you can centralize the information about the URLs,
hashes, and files in a single place.
Once the instance is created, it can be used to fetch individual files without
repeating the URL and hash everywhere.
A good way to use this is to place the call to pooch.create
in Python
module (a .py
file).
Then you can import
the module in .py
scripts or Jupyter notebooks and
use the instance to fetch your data.
This way, you don’t need to define the URLs or hashes in multiple
scripts/notebooks.
Customizing the download¶
The pooch.Pooch.fetch
method supports for all of Pooch’s
downloaders and processors.
You can use HTTP, FTP, and SFTP
(even with authentication),
decompress files,
unpack archives,
show progress bars, and more with a bit of configuration.