[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#712827: ITP: boilerpipe -- Boilerplate removal and fulltext extraction from HTML pages



Package: wnpp
Severity: wishlist
Owner: Emmanuel Bourg <ebourg@apache.org>

* Package name    : boilerpipe
  Version         : 1.2.0
  Upstream Author : Christian Kohlschütter <christian@kohlschutter.com>
* URL             : http://code.google.com/p/boilerpipe
* License         : Apache-2.0
  Programming Lang: Java
  Description     : Boilerplate removal and fulltext extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus
"clutter" (boilerplate, templates) around the main textual content of a web
page.

The library already provides specific strategies for common tasks (for example:
news article extraction) and may also be easily extended for individual problem
settings.

Extracting content is very fast (milliseconds), just needs the input document
(no global or site-level information required) and is usually quite accurate.


Reply to: