Unpacking archives¶
Let’s say our data file is actually a zip (or tar) archive with a collection of
files.
We may want to store an unpacked version of the archive or extract just a
single file from it.
We can do both operations with the pooch.Unzip
and
pooch.Untar
processors.
For example, to extract a single file from a zip archive:
from pooch import Unzip
def fetch_zipped_file():
"""
Load a large zipped sample data as a pandas.DataFrame.
"""
# Extract the file "actual-data-file.txt" from the archive
unpack = Unzip(members=["actual-data-file.txt"])
# Pass in the processor to unzip the data file
fnames = GOODBOY.fetch("zipped-data-file.zip", processor=unpack)
# Returns the paths of all extract members (in our case, only one)
fname = fnames[0]
# fname is now the path of the unzipped file ("actual-data-file.txt")
# which can be loaded by pandas directly
data = pandas.read_csv(fname)
return data
By default, the Unzip
processor (and similarly the
Untar
processor) will create a new folder in the same location
as the downloaded archive file, and give it the same name as the archive file
with the suffix .unzip
(or .untar
) appended.
If you want to change the location of the unpacked files, you can provide a
parameter extract_dir
to the processor to tell it where you want to unpack
the files:
from pooch import Untar
def fetch_and_unpack_tar_file():
"""
Unpack a file from a tar archive to a custom subdirectory in the cache.
"""
# Extract a single file from the archive, to a specific location
unpack_to_custom_dir = Untar(members=["actual-data-file.txt"],
extract_dir="custom_folder")
# Pass in the processor to untar the data file
fnames = GOODBOY.fetch("tarred-data-file.tar.gz", processor=unpack)
# Returns the paths of all extract members (in our case, only one)
fname = fnames[0]
return fname
To extract all files into a folder and return the path to each file, omit the
members
parameter:
def fetch_zipped_archive():
"""
Load all files from a zipped archive.
"""
fnames = GOODBOY.fetch("zipped-archive.zip", processor=Unzip())
return fnames
Use pooch.Untar
to do the exact same for tar archives (with optional
compression).