Bug#712827: ITP: boilerpipe -- Boilerplate removal and fulltext extraction from HTML pages
Package: wnpp
Severity: wishlist
Owner: Emmanuel Bourg <ebourg@apache.org>
* Package name : boilerpipe
Version : 1.2.0
Upstream Author : Christian Kohlschütter <christian@kohlschutter.com>
* URL : http://code.google.com/p/boilerpipe
* License : Apache-2.0
Programming Lang: Java
Description : Boilerplate removal and fulltext extraction from HTML pages
The boilerpipe library provides algorithms to detect and remove the surplus
"clutter" (boilerplate, templates) around the main textual content of a web
page.
The library already provides specific strategies for common tasks (for example:
news article extraction) and may also be easily extended for individual problem
settings.
Extracting content is very fast (milliseconds), just needs the input document
(no global or site-level information required) and is usually quite accurate.
Reply to: