pooch.Pooch

class pooch.Pooch(path, base_url, registry=None, urls=None)[source]

Manager for a local data storage that can fetch from a remote source.

Avoid creating Pooch instances directly. Use pooch.create instead.

Parameters
  • path (str) – The path to the local data storage folder. The path must exist in the file system.

  • base_url (str) – Base URL for the remote data source. All requests will be made relative to this URL.

  • registry (dict or None) – A record of the files that are managed by this good boy. Keys should be the file names and the values should be their hashes. Only files in the registry can be fetched from the local storage. Files in subdirectories of path must use Unix-style separators ('/') even on Windows.

  • urls (dict or None) – Custom URLs for downloading individual files in the registry. A dictionary with the file names as keys and the custom URLs as values. Not all files in registry need an entry in urls. If a file has an entry in urls, the base_url will be ignored when downloading it in favor of urls[fname].

Methods Summary

Pooch.fetch(fname[, processor, downloader])

Get the absolute path to a file in the local storage.

Pooch.get_url(fname)

Get the full URL to download a file in the registry.

Pooch.is_available(fname)

Check availability of a remote file without downloading it.

Pooch.load_registry(fname)

Load entries from a file and add them to the registry.


Pooch.fetch(fname, processor=None, downloader=None)[source]

Get the absolute path to a file in the local storage.

If it’s not in the local storage, it will be downloaded. If the hash of the file in local storage doesn’t match the one in the registry, will download a new copy of the file. This is considered a sign that the file was updated in the remote storage. If the hash of the downloaded file still doesn’t match the one in the registry, will raise an exception to warn of possible file corruption.

Post-processing actions sometimes need to be taken on downloaded files (unzipping, conversion to a more efficient format, etc). If these actions are time or memory consuming, it would be best to do this only once when the file is actually downloaded. Use the processor argument to specify a function that is executed after the downloaded (if required) to perform these actions. See below.

Custom file downloaders can be provided through the downloader argument. By default, files are downloaded over HTTP. If the server for a given file requires authentication (username and password) or if the file is served over FTP, use custom downloaders that support these features. Downloaders can also be used to print custom messages (like a progress bar), etc. See below for details.

Parameters
  • fname (str) – The file name (relative to the base_url of the remote data storage) to fetch from the local storage.

  • processor (None or callable) – If not None, then a function (or callable object) that will be called before returning the full path and after the file has been downloaded (if required). See below for details.

  • downloader (None or callable) – If not None, then a function (or callable object) that will be called to download a given URL to a provided local file name. By default, downloads are done through HTTP without authentication using pooch.HTTPDownloader. See below for details.

Returns

full_path (str) – The absolute path (including the file name) of the file in the local storage.

Notes

Processor functions should have the following format:

def myprocessor(fname, action, pooch):
    '''
    Processes the downloaded file and returns a new file name.

    The function **must** take as arguments (in order):

    fname : str
        The full path of the file in the local data storage
    action : str
        Either: "download" (file doesn't exist and will be
        downloaded), "update" (file is outdated and will be
        downloaded), or "fetch" (file exists and is updated so no
        download is necessary).
    pooch : pooch.Pooch
        The instance of the Pooch class that is calling this
        function.

    The return value can be anything but is usually a full path to
    a file (or list of files). This is what will be returned by
    *fetch* in place of the original file path.
    '''
    ...
    return full_path

Downloader functions should have the following format:

def mydownloader(url, output_file, pooch):
    '''
    Download a file from the given URL to the given local file.

    The function **must** take as arguments (in order):

    url : str
        The URL to the file you want to download.
    output_file : str or file-like object
        Path (and file name) to which the file will be downloaded.
    pooch : pooch.Pooch
        The instance of the Pooch class that is calling this
        function.

    No return value is required.
    '''
    ...

Authentication through HTTP can be handled by pooch.HTTPDownloader:

authdownload = HTTPDownloader(auth=(username, password))
mypooch.fetch("some-data-file.txt", downloader=authdownload)

Progress bar for the download can be printed by pooch.HTTPDownloader by passing the argument progressbar=True:

progress_download = HTTPDownloader(progressbar=True)
mypooch.fetch("some-data-file.txt", downloader=progress_download)
# Will print a progress bar to standard error like:
# 100%|█████████████████████████████████████████| 336/336 [...]
Pooch.get_url(fname)[source]

Get the full URL to download a file in the registry.

Parameters

fname (str) – The file name (relative to the base_url of the remote data storage) to fetch from the local storage.

Pooch.is_available(fname)[source]

Check availability of a remote file without downloading it.

Use this method when working with large files to check if they are available for download.

Parameters

fname (str) – The file name (relative to the base_url of the remote data storage) to fetch from the local storage.

Returns

status (bool) – True if the file is available for download. False otherwise.

Pooch.load_registry(fname)[source]

Load entries from a file and add them to the registry.

Use this if you are managing many files.

Each line of the file should have file name and its hash separated by a space. Hash can specify checksum algorithm using “alg:hash” format. In case no algorithm is provided, SHA256 is used by default. Only one file per line is allowed. Custom download URLs for individual files can be specified as a third element on the line.

Parameters

fname (str | fileobj) – Path (or open file object) to the registry file.