ITP: kerchunk -- Cloud-friendly access to archival data
Package: wnpp
Severity: wishlist
X-Debbugs-Cc: debian-devel@lists.debian.org, :debian-python@lists.debian.org
Owner: Antonio Valentino <antonio.valentino@tiscali.it>
* Package name : kerchunk
Version : 0.2.9
Upstream Author : Martin Durant <martin.durant@alumni.utoronto.ca>
* URL : https://github.com/fsspec/kerchunk
* License : Expat
Programming Lang: Python
Description : Cloud-friendly access to archival data
Binary package names: python3-kerchunk
Kerchunk is a library that provides a unified way to represent a
variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB),
allowing efficient access to the data from traditional file systems or
cloud object storage. It also provides a flexible way to create
virtual datasets from multiple files. It does this by extracting the
byte ranges, compression information and other information about the
data and storing this metadata in a new, separate object.
This means that you can create a virtual aggregate dataset over
potentially many source files, for efficient, parallel and
cloud-friendly *in-situ* access without having to copy or translate
the originals. It is a gateway to in-the-cloud massive data processing
while the data providers still insist on using legacy formats for
archival storage.
.
Features:
.
* completely serverless architecture
* metadata consolidation, so you can understand a many-file dataset
(metadata plus physical storage) in a single read
* read from all of the storage backends supported by fsspec,
including object storage (s3, gcs, abfs, alibaba), http, cloud user
storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs,
smb...)
* loading of various file types (currently netcdf4/HDF, grib2, tiff,
fits, zarr), potentially heterogeneous within a single dataset,
without a need to go via the specific driver (e.g., no need for
h5py)
* asynchronous concurrent fetch of many data chunks in one go,
amortizing the cost of latency
* parallel access with a library like zarr without any locks
* logical datasets viewing many (>~millions) data files, and direct
access/subselection to them via coordinate indexing across an
arbitrary number of dimensions
The package is a dependency of the EOPF framework, it will be maintained
in Debian GIS.
kind regards
--
Antonio Valentino
Reply to: