[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#777588: marked as done (ITP: cld2 -- Compact Language Detector 2)



Your message dated Sun, 24 May 2015 22:00:14 +0000
with message-id <E1YwdwM-0005VL-8Z@franck.debian.org>
and subject line Bug#777588: fixed in cld2 0.0.0~svn194-1
has caused the Debian Bug report #777588,
regarding ITP: cld2 -- Compact Language Detector 2
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact owner@bugs.debian.org
immediately.)


-- 
777588: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777588
Debian Bug Tracking System
Contact owner@bugs.debian.org with problems
--- Begin Message ---
Package: wnpp
Severity: wishlist
Owner: Gianfranco Costamagna <costamagnagianfranco@yahoo.it>

* Package name    : cld2
Version         : 0.0.0~svn193
Upstream Author : Dick Sites dsites@google.com 
* URL             : https://code.google.com/p/cld2/
* License         : Apache-2.0
Programming Lang: C++
Description     : Compact Language Detector 2

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML.
Legacy encodings must be converted to valid UTF-8 by the caller. For mixed-language input,
CLD2 returns the top three languages found and their approximate percentages of the total
text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means about 800 bytes
of English and 200 bytes of French). Optionally, it also returns a vector of text spans with
the language of each identified. This may be useful for applying different spelling-correction
dictionaries or different machine translation requests to each span. The design target is web
pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very
short text, lists of proper names, part numbers, etc.

CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms. For Unicode
scripts such as Greek and Thai that map one-to-one to detected languages, the script defines
the result. For the 80,000+ character Han script and its CJK combination with Hiragana,
Katakana, and Hangul scripts, single letters (unigrams) are scored. For all other scripts,
sequences of four letters (quadgrams) are scored.

Scoring is done exclusively on lowercased Unicode letters and marks, after expanding HTML
entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram word beginnings
and endings (indicated here by underscore) are explicitly used, so the word _look_ scores
differently from the word-beginning _look or the mid-word look. Quadgram single-letter
"words" are completely ignored. For each letter sequence, the scoring uses the 3-6 most
likely languages and their quantized log probabilities. The training corpus is manually
constructed from chosen web pages for each language, then augmented by careful automated
scraping of over 100M additional web pages.

Several embellishments improve the basic algorithm: additional scoring of some sequences
of two CJK letters or eight other letters; scoring some words and word pairs that are
distinctive within sets of statistically-close languages such as {Malay, Indonesian}
or {Spanish, Portuguese, Galician}; removing repetitive sequences/words that would
otherwise skew the scoring, such as “jpg” in “foo.jpg bar.jpg baz.jpg”; removing
web-specific words that convey almost no language information such as page, link,
click, td, tr, copyright, wikipedia, http.

Several hints can be supplied. Because these can be inaccurate on web pages, they
are just hints -- they add a bias but do not force a specific language to be the
detection result. The hints include expected language, original document encoding,
document URL top-level domain name, and embedded <…lang=xx …> language tags.

The table-driven extraction of letter sequences and table-driven scoring is highly optimized
for both space and speed, running about 10x faster than other detectors and covering over 70
languages in 1.8MB of x86 code and tables. The main quadgram lookup table consists of 256K
four-byte entries, covering about 50 languages. Detection over the average web page of 30KB
(half tags/digits/punctuation, half letters) takes roughly 1 msec on a current x86 processor.

CLD2 is an update of the prior CLD, adding more languages, updating to Unicode 6.2 characters,
improving scoring, and adding the optional output vector of labelled language spans.

These 83 languages are detected: Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian
Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese Chinese_T Danish Dhivehi
Dutch English Estonian Finnish French Galician Ganda Georgian German Greek Gujarati Haitian_Creole
Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian Javanese Japanese Kannada
Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay Malayalam Maltese
Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian Russian Scots_Gaelic
Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog Tamil Telugu Thai
Turkish Ukrainian Urdu Vietnamese Welsh Yiddish.


Useful for the upcoming poedit 1.8 release.

--- End Message ---
--- Begin Message ---
Source: cld2
Source-Version: 0.0.0~svn194-1

We believe that the bug you reported is fixed in the latest version of
cld2, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to 777588@bugs.debian.org,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Gianfranco Costamagna <costamagnagianfranco@yahoo.it> (supplier of updated cld2 package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing ftpmaster@ftp-master.debian.org)


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 1.8
Date: Tue, 10 Feb 2015 10:12:04 +0100
Source: cld2
Binary: libcld2-0 libcld2-dev
Architecture: source amd64
Version: 0.0.0~svn194-1
Distribution: unstable
Urgency: low
Maintainer: Debian Science Team <debian-science-maintainers@lists.alioth.debian.org>
Changed-By: Gianfranco Costamagna <costamagnagianfranco@yahoo.it>
Description:
 libcld2-0  - Compact Language Detector 2, library package
 libcld2-dev - Compact Language Detector 2, development package
Closes: 777588
Changes:
 cld2 (0.0.0~svn194-1) unstable; urgency=low
 .
   * Initial release (Closes: #777588).
Checksums-Sha1:
 148aea0c3a666ebdc5e7a219d999f1794f1e409b 2013 cld2_0.0.0~svn194-1.dsc
 ac7bae3d485ea8ab3df7684d777cc54ccb639c44 56630112 cld2_0.0.0~svn194.orig.tar.xz
 43b3f65763bff719ec8da628bfe96480f48ca92a 3052 cld2_0.0.0~svn194-1.debian.tar.xz
 e6f84f74e4c98fe9a2d9673e2af6d6377c5b2d91 5206144 libcld2-0_0.0.0~svn194-1_amd64.deb
 6da6bfab8a79c0f2e597857dc834fb3fe581da4b 95952 libcld2-dev_0.0.0~svn194-1_amd64.deb
Checksums-Sha256:
 f9f723f953f1534b38dd47e14bd495ea1795f89d7dead2411f559c83d5a2000b 2013 cld2_0.0.0~svn194-1.dsc
 d265c51bc6264caa958cdef74624704f3755d7b830b51d35f04140c7edd5fbbe 56630112 cld2_0.0.0~svn194.orig.tar.xz
 67b7736f445f4b8a536685b83012d83cab7863e65458c5c32d963e7be8d3400d 3052 cld2_0.0.0~svn194-1.debian.tar.xz
 1af1361730635b15e802d95e5815f8b232b59003982bf17ec519035ac3d75e5f 5206144 libcld2-0_0.0.0~svn194-1_amd64.deb
 41e5d5b9db494f202efac4c5f92c5efff97a715b65f54ec41107cd8a72b11ccd 95952 libcld2-dev_0.0.0~svn194-1_amd64.deb
Files:
 5e3198e37d35b2b9ffdc73d8618e8a8e 2013 libs optional cld2_0.0.0~svn194-1.dsc
 360e871ba6b7ec9ec472778077aecb13 56630112 libs optional cld2_0.0.0~svn194.orig.tar.xz
 3aef0b643b29eb1c60f3648575e1365c 3052 libs optional cld2_0.0.0~svn194-1.debian.tar.xz
 b09e0ef0b7927fc05e45c19bf7bc368b 5206144 libs optional libcld2-0_0.0.0~svn194-1_amd64.deb
 8c1a3ffec5b19c269408abd9d1ec0bd8 95952 libdevel optional libcld2-dev_0.0.0~svn194-1_amd64.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJVSi04AAoJEJFk+h0XvV02bSMP/Rmm0QjFUAhGZXBr5+w9VicL
xtFp1mW6eUTBbupCtC2t759KG7E+bcUsSN1KHjbttde2iHGwMkVeqeEbrY4B/fdL
ekn5aZJ2CHb2hNUgyVm9NyRzNfP8mjmngVsgSpkVn2lAfZcA8HWYs2jaWPmeq0L3
YGVbpDXm19Utt8uysx6X+vCNJ+/7c7EeDaPRy+j0X1ul+53HpEmHTTNP+DmYypNV
hlecpI19LjJ/wpS/QlGnyLxBBOYuXrNiHkRb4sJVvW2OCxyMU9cssmCPftdmje9m
fazsUZkraENK0bVw4zPIgW3zodVkMxk4booJmf5Oc5KXlywzEuD0VSElSiHMx+4l
wt2Wfe990O8vcfHmRiRwHa2CJwO5puus2aknGeeL66xSaMmsFYurqSpkVWSG3VFY
h2bb1E2cLw29mn9ISz9/Cy1ca0t8aq3nanpyLK5rf/Atft5QtyZCBRJbYu8HpcKN
U4kPC3t40Va1uJQpuZMY6nURHc3xprmpB1AALd6wclruVHxCst+qcWCcJYBfEnV+
2psJqPGA/u7gy1/dVDwnLrcClpGEWJbXPwyIC/Qe0xECsnLTsfu2WXkESkk1Kbz0
y5hu8hgo6ZfEb3pvEBgXHJjRLs0DCG8pllQrLF0XfXWvAjslZc+yQQJtTIkKddyr
sii6uVX/X0gP54sy+aRf
=tVT3
-----END PGP SIGNATURE-----

--- End Message ---

Reply to: