Bug#1022911: ITP: python-html-text -- extract text from HTML

To: Debian Bug Tracking System <submit@bugs.debian.org>
Subject: Bug#1022911: ITP: python-html-text -- extract text from HTML
From: Christian Marillat <marillat@debian.org>
Date: Thu, 27 Oct 2022 17:07:26 +0200
Message-id: <[🔎] 166688324690.185267.3309207917401293262.reportbug@christian.marillat.net>
Reply-to: Christian Marillat <marillat@debian.org>, 1022911@bugs.debian.org

Package: wnpp
Severity: wishlist
Owner: Christian Marillat <marillat@debian.org>
X-Debbugs-Cc: debian-devel@lists.debian.org

* Package name    : python-html-text
  Version         : 0.5.2
  Upstream Author : Scrapinghub Inc
* URL             : https://github.com/TeamHG-Memex/html-text
* License         : MIT
  Programming Lang: Python
  Description     : extract text from HTML

  How is html_text different from .xpath('//text()') from LXML or
  .get_text() from Beautiful Soup? 

  Text extracted with html_text does not contain inline styles,
  javascript, comments  and other text that is not normally visible to users;

  html_text normalizes whitespace, but in a way smarter than
  .xpath('normalize-space()), adding spaces around inline elements
  (which are often used as block elements in html markup), and trying
  to avoid adding extra spaces for punctuation;

  html-text can add newlines (e.g. after headers or paragraphs), so
  that the output text looks more like how it is rendered in
  browsers. 


 This package is a dependency for python-extruct

Reply to:

Prev by Date: Bug#1022907: ITP: python-mf2py -- Microformats2 parser
Next by Date: Bug#1022915: ITP: golang-github-pin-tftp -- TFTP server and client library for Golang
Previous by thread: Bug#1022907: ITP: python-mf2py -- Microformats2 parser
Next by thread: Bug#1022915: ITP: golang-github-pin-tftp -- TFTP server and client library for Golang
Index(es):
- Date
- Thread