[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#924040: ITP: archivebox -- open source self-hosted web archive



Package: wnpp
Severity: wishlist
Owner: Antoine Beaupre <anarcat@debian.org>

* Package name    : archivebox
  Version         : 0.2.4
  Upstream Author : Nick Sweeting
* URL             : https://archivebox.io/
* License         : MIT/Expat?
  Programming Lang: Python
  Description     : open source self-hosted web archive

ArchiveBox takes a list of website URLs you want to archive, and
creates a local, static, browsable HTML clone of the content from
those websites (it saves HTML, JS, media files, PDFs, images and
more).

You can use it to preserve access to websites you care about by
storing them locally offline. ArchiveBox works by rendering the pages
in a headless browser, then saving all the requests and fully loaded
pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that
will last long after the original content dissapears off the
internet. It also automatically extracts assets like git repositories,
audio, video, subtitles, images, and PDFs into separate files using
youtube-dl, pywb, and wget.

ArchiveBox doesn’t require a constantly running server or backend,
instead you just run the ./archive command each time you want to
import new links and update the static output. It can import and
export JSON (among other formats), so it’s easy to script or hook up
to other APIs. If you run it on a schedule and import from browser
history or bookmarks regularly, you can sleep soundly knowing that the
slice of the internet you care about will be automatically preserved
in multiple, durable long-term formats that will be accessible for
decades (or longer).

----

I'm not using this just yet because the upstream packaging is somewhat
weird right now.

https://github.com/pirate/ArchiveBox/issues/120#issuecomment-471027516

It's eventually going to end up on pypi, at which point i'll look at
packaging this myself.

There are, as far as I know, no similar tool in Debian right
now. There are web crawlers and grabbers, but nothing as comprehensive
as this.

I'd be happy to co-maintain this or delegate to whoever is interested.

Reply to: