Bug#607964: ITP: ucto -- Unicode Tokenizer

Package: wnpp
Severity: wishlist
Owner: Joost van Baal <joostvb-debian-bugs-20101225-3@mdcc.cx>

* Package name    : ucto
  Upstream Author : ILK Research Group, Tilburg University, http://ilk.uvt.nl
* URL             : http://ilk.uvt.nl/mbt/
* License         : GPL-3
  Programming Lang: C++
  Description     : Unicode Tokenizer

 Ucto can tokenize UTF-8 encoded text files (i.e. separate words from
 punctuation, split sentences, generate n-grams), and  offers several other
 basic preprocessing steps (change case, count words/characters and reverse
 lines) that make your text suited for further processing such as indexing,
 part-of-speech tagging, or machine translation.
 Ucto is a product of the ILK Research Group, Tilburg University (The
 If you are interested in machine parsing of UTF-8 encoded text files, e.g. to
 do scientific research in natural language processing, ucto will likely be of
 use to you.


Upstream has not yet officially released ucto; currently there's just an
obsolete prerelease snapshot and some promissing code in SVN (not git).  See
also https://github.com/proycon , http://proylt.anaproy.nl/en/software/ and
http://proylt.anaproy.nl/media/software/ .

The frog package (See Bug#605905: ITP: frog -- tagger and parser for Dutch
language) will depend upon ucto.  Frog will be the new name and reincarnation
of tadpole, see http://ilk.uvt.nl/tadpole/ .



irc:joostvb@{OFTC,freenode} ∙ http://mdcc.cx/http://ad1810.com/

