[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#159715: RFP: enca -- Enca is an Extremely Naive Charset Analyser. It detects encoding of text files and is also able to convert them to other encodings.



Package: wnpp
Version: N/A; reported 2002-09-05
Severity: wishlist

* Package name    : enca
  Version         : 0.10.1
  Upstream Author : David Necas (Yeti) <yeti@physics.muni.cz>
* URL             : http://physics.muni.cz/~yeti/software/enca.shtml
* License         : GPL
  Description     : Enca is an Extremely Naive Charset Analyser. It detects encoding of text files and is also able to convert them to other encodings.

Enca currently can determine 8bit charsets of Belarussian, Czech, Polish, Russian, Slovak and Ukrainian texts and also some multibyte encodings, independently on language (provided it's some European language). The main features include:

    * recognises following 8bit charsets:
          o Belarussian: CP1251, IBM866, ISO-8859-5, KOI8-UNI, maccyr, IBM855
          o Czech: ISO-8859-2, KEYBCS2, IBM852, macce, KOI-8_CS_2, CP1250
          o Polish: ISO-8859-2, IBM852, macce, ISO-8859-13, ISO-8859-16, CP1250, baltic
          o Russian: KOI8-R, IBM866, CP1251, ISO-8859-5, maccyr
          o Slovak: CP1250, KEYBCS2, IBM852, macce, KOI-8_CS_2, ISO-8859-2
          o Ukrainian: CP1251, IBM855, ISO-8859-5, KOI8-U, maccyr, CP1125
    * recognises several multibyte encodings: UCS-2, UCS-4, UTF-8, UTF-7 and TeX accents
    * recognises all common EOL types, byte orders and also Quoted-printables
    * can report charset names after various conventions (or programs) as well as human-readable descriptions; accepts all common charset aliases
    * works with multiple files and can act as an intelligent filter
    * converts files using a built-in convertor, GNU recode library, UNIX98 iconv functions or some external convertor that can be specified on command line (e.g. cstocs, GNU recode)
    * has a special ambiguous mode for very short texts
    * can filter out binary parts of file and/or box drawing characters before guessing so it can determine encoding of pretty messy files
    * uses various tricks to solve hardly decidable cases like distinguishing between iso8859-2/cp1250, etc.

PS
Seems like it even have ./debian in source tarball

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux dimail 2.4.18 #1 ÷ÓË á×Ç 4 01:32:32 EEST 2002 i686
Locale: LANG=ru_RU.KOI8-R, LC_CTYPE=ru_RU.KOI8-R

-- no debconf information




Reply to: