[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1029393: debian-installer: Missing glyph detection



Source: debian-installer
Severity: minor
Tags: l10n

Hello maintainers of the Debian installer,

As a follow-up on #101435, I've updated the script to detect more cases where
glyphs are missing, but used in translations.

The steps required to run the script are mentioned in the header of the script.

If needed, I can provide the file 'collect' (the currently used translations of
all udebs in Bookworm)

Attached is also the output of the script, which lists the missing glyps

With kind regards,
Roland Clobus


-- System Information:
Debian Release: bookworm/sid
  APT prefers testing
  APT policy: (990, 'testing'), (500, 'testing-debug'), (50, 'unstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.1.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
E: Glyph: '­' 173 is used in translations for language(s): da,tg, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '·' 183 is used in translations for language(s): ar,el,kab,ku,lt, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ĩ' 296 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ĩ' 297 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ɛ' 603 is used in translations for language(s): kab, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '́' 769 is used in translations for language(s): el,vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '̆' 774 is used in translations for language(s): bg, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '̉' 777 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '־' 1470 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ו' 1493 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ע' 1506 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ף' 1507 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '׳' 1523 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '״' 1524 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '؛' 1563 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ء' 1569 is used in translations for language(s): ar, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'آ' 1570 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'أ' 1571 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ؤ' 1572 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'إ' 1573 is used in translations for language(s): ar, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ة' 1577 is used in translations for language(s): ar, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ت' 1578 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ث' 1579 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ح' 1581 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'خ' 1582 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ذ' 1584 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ز' 1586 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ص' 1589 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ض' 1590 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ط' 1591 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ظ' 1592 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ع' 1593 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'غ' 1594 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ق' 1602 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ن' 1606 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ه' 1607 is used in translations for language(s): ar,fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ي' 1610 is used in translations for language(s): ar,fa,ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '٣' 1635 is used in translations for language(s): ar, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '٥' 1637 is used in translations for language(s): ar, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ٰ' 1648 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ھ' 1726 is used in translations for language(s): ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ۆ' 1734 is used in translations for language(s): ug, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۰' 1776 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۱' 1777 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۲' 1778 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۳' 1779 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۴' 1780 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۶' 1782 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۸' 1784 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '۹' 1785 is used in translations for language(s): fa, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ჲ' 4338 is used in translations for language(s): ka, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ạ' 7840 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ạ' 7841 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ả' 7842 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ả' 7843 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ẹ' 7865 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ẻ' 7867 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ẽ' 7868 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ẽ' 7869 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ỉ' 7881 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ị' 7882 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ị' 7883 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'Ọ' 7884 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ọ' 7885 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ỏ' 7887 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ụ' 7909 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ủ' 7911 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ỳ' 7923 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ỵ' 7925 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: 'ỹ' 7929 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‎' 8206 is used in translations for language(s): ar,he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‏' 8207 is used in translations for language(s): he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '–' 8211 is used in translations for language(s): bg,fr,kk,lt,nn,se, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '—' 8212 is used in translations for language(s): be,nn,ru,ug,uk,vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‘' 8216 is used in translations for language(s): is, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '’' 8217 is used in translations for language(s): fr,he,nn,oc,tg,uk, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‚' 8218 is used in translations for language(s): he,is, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‟' 8223 is used in translations for language(s): is, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '•' 8226 is used in translations for language(s): vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '…' 8230 is used in translations for language(s): ar,ast,bg,de,fa,fr,he,kab,nb,nn,ru,se,ug,uk,vi, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‪' 8234 is used in translations for language(s): ar,fa,he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‫' 8235 is used in translations for language(s): ar,he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '‬' 8236 is used in translations for language(s): ar,fa,he, but not mentioned in any build/needed-chars/*.utf
E: Glyph: ' ' 8239 is used in translations for language(s): oc, but not mentioned in any build/needed-chars/*.utf
E: Glyph: '↵' 8629 is used in translations for language(s): ka,oc, but not mentioned in any build/needed-chars/*.utf
import re

# Generate a list of all characters that are used in translations in udeb files
#
# How to run this script:
# 1) cd path_of_git_workdirectory_of_debian-installer
# 2) find mount_point_of_installer_image -name "*.udeb" | awk -e '{ print "dpkg-deb --control ", $1; print "if [ -e DEBIAN/templates ]; then cat DEBIAN/templates >> collect; fi"; print "rm -fr DEBIAN" }' | sh
# 3) cat build/needed-characters/*.utf > all.utf
# 4) python3 this_script.py
# Carefully evaluate the proposed modifications in build/needed-characters

write_to_file = False
dump_to_console = True
report_missing_glyphs = True
report_missing_glyphs_for_languages_with_many_glyphs = False

chars = dict();
chars["all"] = set(());

file = open("all.utf", "r")
content = file.read()
file.close()
chars["all_x_utf"] = set(());
for char in content:
	if ord(char) >= 128: # Add only non-ASCII characters
		chars["all_x_utf"].add(char)

file = open("collect", "r")
content = file.read()
file.close()

lines = content.split("\n")

language = "C"
for line in lines:
	# Sample:
	# Description-am.UTF-8: የሚጫኑ የተካይ አካሎች፦
	match = re.split("\w+-([a-zA-Z@_]+).UTF-8: (.*)", line)
	if (len(match) > 2): # A translated text
		language = match[1]
		translation = match[2]
	elif line.startswith(" "): # Extended description
		translation = line[1:]
	else: # Not for translation -> reset
		language = "C"
		translation = ""
	for char in translation:
		# Debug part to find which translated text contains a specific character
		#if language == "ka" and char == '“':
		#	print(line)
		if ord(char) >= 128: # Add only non-ASCII characters
			# This is the (manually maintained) list of all *.utf files in build/needed-chars
			# Currently: No translations are provided for: graphic, ky, os
			if report_missing_glyphs_for_languages_with_many_glyphs or language in ['ar', 'ast', 'be', 'bg', 'cs', 'cy', 'da', 'de', 'el', 'eo', 'fa', 'fi', 'fr', 'gl', 'graphic', 'he', 'hr', 'is', 'kab', 'ka', 'kk', 'ku', 'ky', 'lt', 'nb', 'nl', 'nn', 'oc', 'os', 'pl', 'pt', 'ro', 'ru', 'se', 'sr@latin', 'sr', 'sv', 'tg', 'th', 'tl', 'tr', 'ug', 'uk', 'vi', 'wo']:
				if not language in chars:
					chars[language] = set(());
				chars[language].add(char)
				chars["all"].add(char)


if write_to_file:
	for language in sorted(chars):
		if not language in ['all', 'all_x_utf']:
			file = open("build/needed-characters/" + language + ".utf", "w")
			file.write(''.join(sorted(chars[language])))
			file.close()

if dump_to_console:
	for language in sorted(chars):
		print(f"Language: {language}")
		print(f"Characters: {''.join(sorted(chars[language]))}")

if report_missing_glyphs:
	for char in sorted(chars['all']):
		if not char in chars['all_x_utf']:
			languages_with_missing_glyphs = set(());
			for language in sorted(chars):
				if char in chars[language] and language != 'all':
					languages_with_missing_glyphs.add(language)
			print(f"E: Glyph: '{char}' {ord(char)} is used in translations for language(s): {','.join(sorted(languages_with_missing_glyphs))}, but not mentioned in any build/needed-chars/*.utf")

Reply to: