[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ITP: cld2 -- Compact Language Detector 2



Hi Gianfranco,

this looks like a nice target for Debian Science linguistics tasks and
so it would be great to CC the list.  I assume you will maintain the
package in Debian Science team.

Kind regards

       Andreas.

On Tue, Feb 10, 2015 at 09:28:18AM +0000, Gianfranco Costamagna wrote:
> Package: wnpp
> Severity: wishlist
> Owner: Gianfranco Costamagna <costamagnagianfranco@yahoo.it>
> 
> * Package name    : cld2
> Version         : 0.0.0~svn193
> Upstream Author : Dick Sites dsites@google.com 
> * URL             : https://code.google.com/p/cld2/
> * License         : Apache-2.0
> Programming Lang: C++
> Description     : Compact Language Detector 2
> 
> CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML.
> Legacy encodings must be converted to valid UTF-8 by the caller. For mixed-language input,
> CLD2 returns the top three languages found and their approximate percentages of the total
> text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means about 800 bytes
> of English and 200 bytes of French). Optionally, it also returns a vector of text spans with
> the language of each identified. This may be useful for applying different spelling-correction
> dictionaries or different machine translation requests to each span. The design target is web
> pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very
> short text, lists of proper names, part numbers, etc.
> 
> CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms. For Unicode
> scripts such as Greek and Thai that map one-to-one to detected languages, the script defines
> the result. For the 80,000+ character Han script and its CJK combination with Hiragana,
> Katakana, and Hangul scripts, single letters (unigrams) are scored. For all other scripts,
> sequences of four letters (quadgrams) are scored.
> 
> Scoring is done exclusively on lowercased Unicode letters and marks, after expanding HTML
> entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram word beginnings
> and endings (indicated here by underscore) are explicitly used, so the word _look_ scores
> differently from the word-beginning _look or the mid-word look. Quadgram single-letter
> "words" are completely ignored. For each letter sequence, the scoring uses the 3-6 most
> likely languages and their quantized log probabilities. The training corpus is manually
> constructed from chosen web pages for each language, then augmented by careful automated
> scraping of over 100M additional web pages.
> 
> Several embellishments improve the basic algorithm: additional scoring of some sequences
> of two CJK letters or eight other letters; scoring some words and word pairs that are
> distinctive within sets of statistically-close languages such as {Malay, Indonesian}
> or {Spanish, Portuguese, Galician}; removing repetitive sequences/words that would
> otherwise skew the scoring, such as “jpg” in “foo.jpg bar.jpg baz.jpg”; removing
> web-specific words that convey almost no language information such as page, link,
> click, td, tr, copyright, wikipedia, http.
> 
> Several hints can be supplied. Because these can be inaccurate on web pages, they
> are just hints -- they add a bias but do not force a specific language to be the
> detection result. The hints include expected language, original document encoding,
> document URL top-level domain name, and embedded <…lang=xx …> language tags.
> 
> The table-driven extraction of letter sequences and table-driven scoring is highly optimized
> for both space and speed, running about 10x faster than other detectors and covering over 70
> languages in 1.8MB of x86 code and tables. The main quadgram lookup table consists of 256K
> four-byte entries, covering about 50 languages. Detection over the average web page of 30KB
> (half tags/digits/punctuation, half letters) takes roughly 1 msec on a current x86 processor.
> 
> CLD2 is an update of the prior CLD, adding more languages, updating to Unicode 6.2 characters,
> improving scoring, and adding the optional output vector of labelled language spans.
> 
> These 83 languages are detected: Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian
> Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese Chinese_T Danish Dhivehi
> Dutch English Estonian Finnish French Galician Ganda Georgian German Greek Gujarati Haitian_Creole
> Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian Javanese Japanese Kannada
> Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay Malayalam Maltese
> Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian Russian Scots_Gaelic
> Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog Tamil Telugu Thai
> Turkish Ukrainian Urdu Vietnamese Welsh Yiddish.
> 
> 
> Useful for the upcoming poedit 1.8 release.
> 
> 
> --
> To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> Archive: https://lists.debian.org/791546596.2444306.1423560498635.JavaMail.yahoo@mail.yahoo.com
> 
> 

-- 
http://fam-tille.de


Reply to: