Bug#712827: ITP: boilerpipe -- Boilerplate removal and fulltext extraction from HTML pages

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#712827: ITP: boilerpipe -- Boilerplate removal and fulltext extraction from HTML pages
From: Emmanuel Bourg <ebourg@apache.org>
Date: Wed, 19 Jun 2013 23:40:08 +0200
Message-id: <[🔎] 20130619214008.14063.61113.reportbug@debiandev>
Reply-to: Emmanuel Bourg <ebourg@apache.org>, 712827@bugs.debian.org

Package: wnpp
Severity: wishlist
Owner: Emmanuel Bourg <ebourg@apache.org>

* Package name    : boilerpipe
  Version         : 1.2.0
  Upstream Author : Christian Kohlschütter <christian@kohlschutter.com>
* URL             : http://code.google.com/p/boilerpipe
* License         : Apache-2.0
  Programming Lang: Java
  Description     : Boilerplate removal and fulltext extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus
"clutter" (boilerplate, templates) around the main textual content of a web
page.

The library already provides specific strategies for common tasks (for example:
news article extraction) and may also be easily extended for individual problem
settings.

Extracting content is very fast (milliseconds), just needs the input document
(no global or site-level information required) and is usually quite accurate.

Reply to:

Prev by Date: Bug#712821: ITP: libtiger -- Kate rendering library
Next by Date: Bug#712830: RFP: dump1090 -- simple Mode S decoder for RTLSDR devices
Previous by thread: Bug#712821: ITP: libtiger -- Kate rendering library
Next by thread: Bug#712830: RFP: dump1090 -- simple Mode S decoder for RTLSDR devices
Index(es):
- Date
- Thread