Hashes: Calculating and bypassing#

Pooch uses hashes to check if files are up-to-date or possibly corrupted:

  • If a file exists in the local folder, Pooch will check that its hash matches the one in the registry. If it doesn’t, we’ll assume that it needs to be updated.

  • If a file needs to be updated or doesn’t exist, Pooch will download it from the remote source and check the hash. If the hash doesn’t match, an exception is raised to warn of possible file corruption.

  • Cryptographic hashes may be used where users wish to ensure the security of their download.

Calculating hashes#

You can generate hashes for your data files using openssl in the terminal:

$ openssl sha256 data/c137.csv
SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d

Or using the pooch.file_hash function (which is a convenient way of calling Python’s hashlib):

import pooch
print(pooch.file_hash("data/c137.csv"))

Specifying the hash algorithm#

By default, Pooch uses SHA256 hashes. Other hash methods that are available in hashlib can also be used:

import pooch
print(pooch.file_hash("data/c137.csv", alg="sha512"))

In this case, you can specify the hash algorithm in the registry by prepending it to the hash, for example "md5:0hljc7298ndo2" or "sha512:803o3uh2pecb2p3829d1bwouh9d". Pooch will understand this and use the appropriate method.

Bypassing the hash check#

Sometimes we might not know the hash of the file or it could change on the server periodically. To bypass the check, we can set the hash value to None when specifying the registry argument for pooch.create (or the known_hash in pooch.retrieve).

In this example, we want to use Pooch to download a list of weather stations around Australia:

  • The file with the stations is in an FTP server and we want to store it locally in separate folders for each day that the code is run.

  • The problem is that the stations.zip file is updated on the server instead of creating a new one, so the hash check would fail.

This is how you can solve this problem:

import datetime
import pooch

# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()

GOODBOY = pooch.create(
    path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE,
    base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/",
    registry={
        "stations.zip": None,
    },
)

When running this same code again at a different date, the file will be downloaded again because the local cache folder changed and the file is no longer present in it. If you omit CURRENT_DATE from the cache path, then Pooch will only fetch the files once, unless they are deleted from the cache.

Attention

If this script is run over a period of time, your cache directory will increase in size, as the files are stored in daily subdirectories.

Other supported hashes#

Beyond hashing algorithms supported by hashlib, Pooch supports algorithms provided by the xxhash package. If the xxhash package is available, users may specify to use one of the algorithms provided by the package.

$ xxh128sum data/store.zip
6a71973c93eac6c8839ce751ce10ae48  data/store.zip
$ # ^^^^^^^^^^^^^^^^^^^ The hash  ^^^^^^^^^^^^^^ The filename
import datetime
import pooch

# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()

GOODBOY = pooch.create(
    [...],
    registry={
        "store.zip": "xxh128:6a71973c93eac6c8839ce751ce10ae48",
    },
)