Advanced tricks

These are more advanced things that can be done for specific use cases. Most projects will not require these.

Adjusting the logging level

Pooch will log events like downloading a new file, updating an existing one, or unpacking an archive by printing to the terminal. You can change how verbose these events are by getting the event logger from pooch and changing the logging level:

logger = pooch.get_logger()
logger.setLevel("WARNING")

Most of the events from Pooch are logged at the info level; this code says that you only care about warnings or errors, like inability to create the data cache. The event logger is a logging.Logger object, so you can use that class’s methods to handle logging events in more sophisticated ways if you wish.

Retry failed downloads

When downloading data repeatedly, like in continuous integration, failures can occur due to sporadic network outages or other factors outside of our control. In these cases, it can be frustrating to have entire jobs fail because a single download was not successful.

Pooch allows you to specify a number of times to retry the download in case of failure by setting retry_if_failed in pooch.create. This setting will be valid for all downloads attempted with pooch.Pooch.fetch. The download can fail because the file hash doesn’t match the known hash (due to a partial download, for example) or because of network errors coming from requests. Other errors (file system permission errors, etc) will still result in a failed download.

Note

Requires Pooch >= 1.3.0.

Bypassing the hash check

Sometimes we might not know the hash of the file or it could change on the server periodically. In these cases, we need a way of bypassing the hash check. One way of doing that is with Python’s unittest.mock module. It defines the object unittest.mock.ANY which passes all equality tests made against it. To bypass the check, we can set the hash value to unittest.mock.ANY when specifying the registry argument for pooch.create.

In this example, we want to use Pooch to download a list of weather stations around Australia. The file with the stations is in an FTP server and we want to store it locally in separate folders for each day that the code is run. The problem is that the stations.zip file is updated on the server instead of creating a new one, so the hash check would fail. This is how you can solve this problem:

import datetime
import unittest.mock
import pooch

# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()

GOODBOY = pooch.create(
    path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE,
    base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/",
    # Use ANY for the hash value to ignore the checks
    registry={
        "stations.zip": unittest.mock.ANY,
    },
)

Because hash check is always True, Pooch will only download the file once. When running again at a different date, the file will be downloaded again because the local cache folder changed and the file is no longer present in it. If you omit CURRENT_DATE from the cache path, then Pooch will only fetch the files once, unless they are deleted from the cache.

Note

If this script is run over a period of time, your cache directory will increase in size, as the files are stored in daily subdirectories.

Create registry file from remote files

If you want to create a registry file for a large number of data files that are available for download but you don’t have their hashes or any local copies, you must download them first. Manually downloading each file can be tedious. However, we can automate the process using pooch.retrieve. Below, we’ll explore two different scenarios.

If the data files share the same base url, we can use pooch.retrieve to download them and then use pooch.make_registry to create the registry:

import os

# Names of the data files
filenames = ["c137.csv", "cronen.csv", "citadel.csv"]

# Base url from which the data files can be downloaded from
base_url = "https://www.some-data-hosting-site.com/files/"

# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)

# Download each data file to data_files
for fname in filenames:
    path = pooch.retrieve(
        url=base_url + fname, known_hash=None, fname=fname, path=directory
    )

# Create the registry file from the downloaded data files
pooch.make_registry("data_files", "registry.txt")

If each data file has its own url, the registry file can be manually created after downloading each data file through pooch.retrieve:

import os

# Names and urls of the data files. The file names are used for naming the
# downloaded files. These are the names that will be included in the registry.
fnames_and_urls = {
    "c137.csv": "https://www.some-data-hosting-site.com/c137/data.csv",
    "cronen.csv": "https://www.some-data-hosting-site.com/cronen/data.csv",
    "citadel.csv": "https://www.some-data-hosting-site.com/citadel/data.csv",
}

# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)

# Create a new registry file
with open("registry.txt", "w") as registry:
    for fname, url in fnames_and_urls.items():
        # Download each data file to the specified directory
        path = pooch.retrieve(
            url=url, known_hash=None, fname=fname, path=directory
        )
        # Add the name, hash, and url of the file to the new registry file
        registry.write(
            f"{fname} {pooch.file_hash(path)} {url}\n"
        )

Warning

Notice that there are no checks for download integrity (since we don’t know the file hashes before hand). Only do this for trusted data sources and over a secure connection. If you have access to file hashes/checksums, we highly recommend using them to set the known_hash argument.