ITP: kerchunk -- Cloud-friendly access to archival data

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: ITP: kerchunk -- Cloud-friendly access to archival data
From: Antonio Valentino <antonio.valentino@tiscali.it>
Date: Sat, 27 Dec 2025 21:46:05 +0100
Message-id: <[🔎] d227a4c6-829f-4d64-8e7a-4daeb4c81710@tiscali.it>

Package: wnpp
Severity: wishlist
X-Debbugs-Cc: debian-devel@lists.debian.org, :debian-python@lists.debian.org
Owner: Antonio Valentino <antonio.valentino@tiscali.it>

* Package name    : kerchunk
  Version         : 0.2.9
  Upstream Author : Martin Durant <martin.durant@alumni.utoronto.ca>
* URL             : https://github.com/fsspec/kerchunk
* License         : Expat
  Programming Lang: Python
  Description     : Cloud-friendly access to archival data

Binary package names: python3-kerchunk

 Kerchunk is a library that provides a unified way to represent a
 variety of chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB),
 allowing efficient access to the data from traditional file systems or
 cloud object storage.  It also provides a flexible way to create
 virtual datasets from multiple files.  It does this by extracting the
 byte ranges, compression information and other information about the
 data and storing this metadata in a new, separate object.
 This means that you can create a virtual aggregate dataset over
 potentially many source files, for efficient, parallel and
 cloud-friendly *in-situ* access without having to copy or translate
 the originals. It is a gateway to in-the-cloud massive data processing
 while the data providers still insist on using legacy formats for
 archival storage.
 .
 Features:
 .
  * completely serverless architecture
  * metadata consolidation, so you can understand a many-file dataset
    (metadata plus physical storage) in a single read
  * read from all of the storage backends supported by fsspec,
    including object storage (s3, gcs, abfs, alibaba), http, cloud user
    storage (dropbox, gdrive) and network protocols (ftp, ssh, hdfs,
    smb...)
  * loading of various file types (currently netcdf4/HDF, grib2, tiff,
    fits, zarr), potentially heterogeneous within a single dataset,
    without a need to go via the specific driver (e.g., no need for
    h5py)
  * asynchronous concurrent fetch of many data chunks in one go,
    amortizing the cost of latency
  * parallel access with a library like zarr without any locks
  * logical datasets viewing many (>~millions) data files, and direct
    access/subselection to them via coordinate indexing across an
    arbitrary number of dimensions

The package is a dependency of the EOPF framework, it will be maintainedin Debian GIS.


kind regards
--
Antonio Valentino

Reply to:

Follow-Ups:
- Re: ITP: kerchunk -- Cloud-friendly access to archival data
  - From: Sebastiaan Couwenberg <sebastic@xs4all.nl>

Prev by Date: Re: ITP: slicerator -- lazy-loading, fancy-sliceable iterable
Next by Date: Re: ITP: kerchunk -- Cloud-friendly access to archival data
Previous by thread: Re: ITP: slicerator -- lazy-loading, fancy-sliceable iterable
Next by thread: Re: ITP: kerchunk -- Cloud-friendly access to archival data
Index(es):
- Date
- Thread