Registry files#
Usage#
If your project has a large number of data files, it can be tedious to list
them in a dictionary. In these cases, it’s better to store the file names and
hashes in a file and use pooch.Pooch.load_registry
to read them.
import os
import pkg_resources
POOCH = pooch.create(
path=pooch.os_cache("plumbus"),
base_url="https://github.com/rick/plumbus/raw/{version}/data/",
version=version,
version_dev="main",
# We'll load it from a file later
registry=None,
)
# Get registry file from package_data
registry_file = pkg_resources.resource_stream("plumbus", "registry.txt")
# Load this registry file
POOCH.load_registry(registry_file)
In this case, the registry.txt
file is in the plumbus/
package
directory and should be shipped with the package (see below for instructions).
We use pkg_resources
to access the registry.txt
, giving it the name of our Python package.
Registry file format#
Registry files are light-weight text files that specify a file’s name and hash.
In our example, the contents of registry.txt
are:
c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w
A specific hashing algorithm can be enforced, if a checksum for a file is
prefixed with alg:
:
c137.csv sha1:e32b18dab23935bc091c353b308f724f18edcb5e
cronen.csv md5:b53c08d3570b82665784cedde591a8b0
From Pooch v1.2.0 the registry file can also contain line comments, prepended
with a #
:
# C-137 sample data
c137.csv 19uheidhlkjdwhoiwuhc0uhcwljchw9ochwochw89dcgw9dcgwc
# Cronenberg sample data
cronen.csv 1upodh2ioduhw9celdjhlfvhksgdwikdgcowjhcwoduchowjg8w
Attention
Make sure you set the Pooch version in your setup.py
to >=1.2.0 when
using comments as earlier versions cannot handle them:
install_requires = [..., "pooch>=1.2.0", ...]
Packaging registry files#
To make sure the registry file is shipped with your package, include the
following in your MANIFEST.in
file:
include plumbus/registry.txt
And the following entry in the setup
function of your setup.py
file:
setup(
...
package_data={"plumbus": ["registry.txt"]},
...
)
Creating a registry file#
If you have many data files, creating the registry and keeping it updated can
be a challenge. Function pooch.make_registry
will create a registry
file with all contents of a directory. For example, we can generate the
registry file for our fictitious project from the command-line:
$ python -c "import pooch; pooch.make_registry('data', 'plumbus/registry.txt')"
Create registry file from remote files#
If you want to create a registry file for a large number of data files that are
available for download but you don’t have their hashes or any local copies,
you must download them first. Manually downloading each file
can be tedious. However, we can automate the process using
pooch.retrieve
. Below, we’ll explore two different scenarios.
If the data files share the same base url, we can use pooch.retrieve
to download them and then use pooch.make_registry
to create the
registry:
import os
# Names of the data files
filenames = ["c137.csv", "cronen.csv", "citadel.csv"]
# Base url from which the data files can be downloaded from
base_url = "https://www.some-data-hosting-site.com/files/"
# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)
# Download each data file to data_files
for fname in filenames:
path = pooch.retrieve(
url=base_url + fname, known_hash=None, fname=fname, path=directory
)
# Create the registry file from the downloaded data files
pooch.make_registry("data_files", "registry.txt")
If each data file has its own url, the registry file can be manually created
after downloading each data file through pooch.retrieve
:
import os
# Names and urls of the data files. The file names are used for naming the
# downloaded files. These are the names that will be included in the registry.
fnames_and_urls = {
"c137.csv": "https://www.some-data-hosting-site.com/c137/data.csv",
"cronen.csv": "https://www.some-data-hosting-site.com/cronen/data.csv",
"citadel.csv": "https://www.some-data-hosting-site.com/citadel/data.csv",
}
# Create a new directory where all files will be downloaded
directory = "data_files"
os.makedirs(directory)
# Create a new registry file
with open("registry.txt", "w") as registry:
for fname, url in fnames_and_urls.items():
# Download each data file to the specified directory
path = pooch.retrieve(
url=url, known_hash=None, fname=fname, path=directory
)
# Add the name, hash, and url of the file to the new registry file
registry.write(
f"{fname} {pooch.file_hash(path)} {url}\n"
)
Warning
Notice that there are no checks for download integrity (since we don’t
know the file hashes before hand). Only do this for trusted data sources
and over a secure connection. If you have access to file hashes/checksums,
we highly recommend using them to set the known_hash
argument.