[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

ITP: kerchunk -- Cloud-friendly access to archival data



Package: wnpp
Severity: wishlist
X-Debbugs-Cc: debian-devel@lists.debian.org, :debian-python@lists.debian.org
Owner: Antonio Valentino <antonio.valentino@tiscali.it>

* Package name    : kerchunk
  Version         : 0.2.9
  Upstream Author : Martin Durant <martin.durant@alumni.utoronto.ca>
* URL             : https://github.com/fsspec/kerchunk
* License         : Expat
  Programming Lang: Python
  Description     : Cloud-friendly access to archival data

Binary package names: python3-kerchunk

 Kerchunk is a library that provides a unified way to represent a
 variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB),
 allowing efficient access to the data from traditional file systems or
 cloud object storage.  It also provides a flexible way to create
 virtual datasets from multiple files.  It does this by extracting the
 byte ranges, compression information and other information about the
 data and storing this metadata in a new, separate object.
 This means that you can create a virtual aggregate dataset over
 potentially many source files, for efficient, parallel and
 cloud-friendly *in-situ* access without having to copy or translate
 the originals. It is a gateway to in-the-cloud massive data processing
 while the data providers still insist on using legacy formats for
 archival storage.
 .
 Features:
 .
  * completely serverless architecture
  * metadata consolidation, so you can understand a many-file dataset
    (metadata plus physical storage) in a single read
  * read from all of the storage backends supported by fsspec,
    including object storage (s3, gcs, abfs, alibaba), http, cloud user
    storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs,
    smb...)
  * loading of various file types (currently netcdf4/HDF, grib2, tiff,
    fits, zarr), potentially heterogeneous within a single dataset,
    without a need to go via the specific driver (e.g., no need for
    h5py)
  * asynchronous concurrent fetch of many data chunks in one go,
    amortizing the cost of latency
  * parallel access with a library like zarr without any locks
  * logical datasets viewing many (>~millions) data files, and direct
    access/subselection to them via coordinate indexing across an
    arbitrary number of dimensions

The package is a dependency of the EOPF framework, it will be maintained in Debian GIS.

kind regards
--
Antonio Valentino


Reply to: