Bug#1022911: ITP: python-html-text -- extract text from HTML
Package: wnpp
Severity: wishlist
Owner: Christian Marillat <marillat@debian.org>
X-Debbugs-Cc: debian-devel@lists.debian.org
* Package name : python-html-text
Version : 0.5.2
Upstream Author : Scrapinghub Inc
* URL : https://github.com/TeamHG-Memex/html-text
* License : MIT
Programming Lang: Python
Description : extract text from HTML
How is html_text different from .xpath('//text()') from LXML or
.get_text() from Beautiful Soup?
Text extracted with html_text does not contain inline styles,
javascript, comments and other text that is not normally visible to users;
html_text normalizes whitespace, but in a way smarter than
.xpath('normalize-space()), adding spaces around inline elements
(which are often used as block elements in html markup), and trying
to avoid adding extra spaces for punctuation;
html-text can add newlines (e.g. after headers or paragraphs), so
that the output text looks more like how it is rendered in
browsers.
This package is a dependency for python-extruct
Reply to: