Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
Sorry, the thread was broken and I saw your reply just now.
On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <email@example.com> wrote:
> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
>> This is valid for most-used applications/formats like gettext, images
>> that are designed to behave in this way, but on the contrary there are
>> upstream that don't like to see such impact, especially due to the
>> complexity and performance impact.
>> Currently I am using arch:any for data files which aren't be affected
>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
>> data that is required to make a library working, I have to force them
>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
>> as "Multiarch: same", this is sufficient to avoid breakage, but again
>> it consumes a lot of space on mirror.
> Actually, what is "a lot" here? I mean, how many libraries are there
> containing endianness-critical data and how big are the actual files?
> Not that I'm any kind of expert, but this solution sounds reasonable to
As far as I know, there isn't too many libraries known to have
endianness-critical data, but there might be landmines because the
maintainer just aren't aware about it.
I have the chance to notice this problem because my team maintain
several stack of input methods, which usually need to deal with
linguistic data. 
For me here is a library named libpinyin at hand to package, which has
some data files of ~7.5MiB size after gzip -9 (the total size of this
library is no more than 9MiB after gzip -9). We have 14 architectures
on ftp-master, so the data file eats up 105MiB, while if we find some
way to have only one copy for be/le, it'll only use 15MiB. And think
about when it get released as a stable, a new copy of those data is
making their way to the archive when new version get uploaded to
Such concern is also valid to other endianness-critical data that are
not bothered with Multi-Arch at present, we need to make them arch:any
and in the end they are eating more and more space.
 Performance is critical for these applications, this doesn't mean
it consumes a lot of CPU percentage, but it must response very quickly
to user's input - do some complex calculations to split a sentence
into words and find out a list of most related suggestions, which
needs to query from 10^5 ~ 10^6 lines of data several times to
complete such an action. There was project tried to use something like
SQLite3 but the performance is a bit frustrating, so they have now
decided not to care about that but just design data format that can
fit for their requirements.