Package: wnpp
Severity: wishlist
Owner: Ben Finney <bignose@debian.org>
* Package name : wikiextractor
Version : 2.75
Upstream Author : Giuseppe Attardi <attardi@di.unipi.it>
* URL : http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
* License : GPL-3
Programming Lang: Python
Description : tool to extract plain text from a Wikipedia dump
The Wikipedia maintainers provide, each month, an XML dump of all
documents in the database: a single XML file containing the whole
encyclopedia, that can be used for various kinds of analysis, such as
statistics, service lists, etc.
This Wikipedia extractor tool generates plain text from a Wikipedia
database dump. It discards any other information or annotation
present in Wikipedia pages, such as images, tables, references and
lists.
Some works use Wikipedia data as part of their complete source. This
package will be useful for build chains that require processing that
data as source.
--
\ “I put instant coffee in a microwave oven and almost went back |
`\ in time.” —Steven Wright |
_o__) |
Ben Finney <bignose@debian.org>
Attachment:
signature.asc
Description: PGP signature