Retrieving a single data file#
Basic usage#
If you only want to download one or two data files, use the
pooch.retrieve
function:
import pooch
file_path = pooch.retrieve(
# URL to one of Pooch's test files
url="https://github.com/fatiando/pooch/raw/v1.0.0/data/tiny-data.txt",
known_hash="md5:70e2afd3fd7e336ae478b1e740a5f08e",
)
The code above will:
Check if the file from this URL already exists in Pooch’s default cache folder (see
pooch.os_cache
).If it doesn’t, the file is downloaded and saved to the cache folder.
The MD5 hash is compared against the
known_hash
to make sure the file isn’t corrupted.The function returns the absolute path to the file on your computer.
If the file already existed on your machine, Pooch will check if it’s MD5 hash
matches the known_hash
:
If it does, no download happens and the file path is returned.
If it doesn’t, the file is downloaded once more to get an updated version on your computer.
Since the download happens only once, you can place this function call at the start of your script or Jupyter notebook without having to worry about repeat downloads. Anyone getting a copy of your code should also get the correct data file the first time they run it.
See also
Pooch can handle multiple download protocols like HTTP, FTP, SFTP, and even download from repositories like figshare and Zenodo by using the DOI instead of a URL. See Download protocols.
See also
You can use different hashes by specifying different algorithm names:
sha256:XXXXXX
, sha1:XXXXXX
, etc. See Hashes: Calculating and bypassing.
Unknown file hash#
If you don’t know the hash of the file, you can set known_hash=None
to
bypass the check.
retrieve
will print a log message with the SHA256 hash of the
downloaded file.
It’s highly recommended that you copy and paste this hash into your code
and use it as the known_hash
.
Tip
Setting the known_hash
guarantees that the next time your code is run
(by you or someone else) the exact same file is downloaded. This helps
make the results of your code reproducible.
Customizing the download#
The pooch.retrieve
function supports for all of Pooch’s
downloaders and processors.
You can use HTTP, FTP, and SFTP
(even with authentication),
decompress files,
unpack archives,
show progress bars, and more with a bit of configuration.
When not to use retrieve
#
If you need to manage the download and caching of several files from one or
more sources, then you should start using the full capabilities of the
pooch.Pooch
class.
It can handle sandboxing
data for different package versions, allow users to set the download
locations, and more.
The classic example is a Python package that contains several sample datasets for use in testing and documentation.
See Fetching files from a registry and Manage a package’s sample data to get started.