Hashes: Calculating and bypassing¶
Pooch uses hashes to check if files are up-to-date or possibly corrupted:
If a file exists in the local folder, Pooch will check that its hash matches the one in the registry. If it doesn’t, we’ll assume that it needs to be updated.
If a file needs to be updated or doesn’t exist, Pooch will download it from the remote source and check the hash. If the hash doesn’t match, an exception is raised to warn of possible file corruption.
Cryptographic hashes may be used where users wish to ensure the security of their download.
Calculating hashes¶
You can generate hashes for your data files using openssl
in the terminal:
$ openssl sha256 data/c137.csv
SHA256(data/c137.csv)= baee0894dba14b12085eacb204284b97e362f4f3e5a5807693cc90ef415c1b2d
Or using the pooch.file_hash
function (which is a convenient way of
calling Python’s hashlib
):
import pooch
print(pooch.file_hash("data/c137.csv"))
Specifying the hash algorithm¶
By default, Pooch uses SHA256
hashes.
Other hash methods that are available in hashlib
can also be used:
import pooch
print(pooch.file_hash("data/c137.csv", alg="sha512"))
In this case, you can specify the hash algorithm in the registry by
prepending it to the hash, for example "md5:0hljc7298ndo2"
or
"sha512:803o3uh2pecb2p3829d1bwouh9d"
.
Pooch will understand this and use the appropriate method.
Bypassing the hash check¶
Sometimes we might not know the hash of the file or it could change on the
server periodically.
To bypass the check, we can set the hash value to None
when specifying the
registry
argument for pooch.create
(or the known_hash
in pooch.retrieve
).
In this example, we want to use Pooch to download a list of weather stations around Australia:
The file with the stations is in an FTP server and we want to store it locally in separate folders for each day that the code is run.
The problem is that the
stations.zip
file is updated on the server instead of creating a new one, so the hash check would fail.
This is how you can solve this problem:
import datetime
import pooch
# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()
GOODBOY = pooch.create(
path=pooch.os_cache("bom_daily_stations") / CURRENT_DATE,
base_url="ftp://ftp.bom.gov.au/anon2/home/ncc/metadata/sitelists/",
registry={
"stations.zip": None,
},
)
When running this same code again at a different date, the file will be
downloaded again because the local cache folder changed and the file is no
longer present in it.
If you omit CURRENT_DATE
from the cache path, then Pooch will only fetch
the files once, unless they are deleted from the cache.
Attention
If this script is run over a period of time, your cache directory will increase in size, as the files are stored in daily subdirectories.
Other supported hashes¶
Beyond hashing algorithms supported by hashlib
, Pooch supports algorithms
provided by the xxhash package.
If the xxhash
package is available, users may specify to use one of
the algorithms provided by the package.
$ xxh128sum data/store.zip
6a71973c93eac6c8839ce751ce10ae48 data/store.zip
$ # ^^^^^^^^^^^^^^^^^^^ The hash ^^^^^^^^^^^^^^ The filename
import datetime
import pooch
# Get the current data to store the files in separate folders
CURRENT_DATE = datetime.datetime.now().date()
GOODBOY = pooch.create(
[...],
registry={
"store.zip": "xxh128:6a71973c93eac6c8839ce751ce10ae48",
},
)