Re: Endianness of data files in MultiArch
On Fri, Feb 10, 2012 at 19:59, Goswin von Brederlow <email@example.com> wrote:
> Aron Xu <firstname.lastname@example.org> writes:
>> Sorry, the thread was broken and I saw your reply just now.
>> On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <email@example.com> wrote:
>>> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
>>>> This is valid for most-used applications/formats like gettext, images
>>>> that are designed to behave in this way, but on the contrary there are
>>>> upstream that don't like to see such impact, especially due to the
>>>> complexity and performance impact.
>>>> Currently I am using arch:any for data files which aren't be affected
>>>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
>>>> data that is required to make a library working, I have to force them
>>>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
>>>> as "Multiarch: same", this is sufficient to avoid breakage, but again
>>>> it consumes a lot of space on mirror.
>>> Actually, what is "a lot" here? I mean, how many libraries are there
>>> containing endianness-critical data and how big are the actual files?
>>> Not that I'm any kind of expert, but this solution sounds reasonable to
>> As far as I know, there isn't too many libraries known to have
>> endianness-critical data, but there might be landmines because the
>> maintainer just aren't aware about it.
>> I have the chance to notice this problem because my team maintain
>> several stack of input methods, which usually need to deal with
>> linguistic data. 
>> For me here is a library named libpinyin at hand to package, which has
>> some data files of ~7.5MiB size after gzip -9 (the total size of this
>> library is no more than 9MiB after gzip -9). We have 14 architectures
>> on ftp-master, so the data file eats up 105MiB, while if we find some
>> way to have only one copy for be/le, it'll only use 15MiB. And think
>> about when it get released as a stable, a new copy of those data is
>> making their way to the archive when new version get uploaded to
>> Such concern is also valid to other endianness-critical data that are
>> not bothered with Multi-Arch at present, we need to make them arch:any
>> and in the end they are eating more and more space.
>>  Performance is critical for these applications, this doesn't mean
>> it consumes a lot of CPU percentage, but it must response very quickly
>> to user's input - do some complex calculations to split a sentence
>> into words and find out a list of most related suggestions, which
>> needs to query from 10^5 ~ 10^6 lines of data several times to
>> complete such an action. There was project tried to use something like
>> SQLite3 but the performance is a bit frustrating, so they have now
>> decided not to care about that but just design data format that can
>> fit for their requirements.
>> Aron Xu
> It doesn't sound like the data is to big to fit into ram and it sounds
> like the overhead to fetch data from disk on demand would slow you
> down. So there seems to be no reason to have architecture independent
> data on disk and convert it to the right endianess on startup. Sure
> startup time would increase a bit but running time would remain
Well, bear in mind that the size is for compressed data. Decompressed
data are usually even larger, their properties on
compressing/decompressing are more like plain texts, so by
decompressing the 7.5MiB data, you get 22MiB on hard disk.
22MiB seems to be not large enough to not fit into RAM, but I'll
explain why it won't. Usually an input method framework carries many
different input methods (it's easier to understand them as different
algorithms), and users are able to switch them on the fly, by a mouse
click or keyboard shortcut. Different input methods have different
data, so by having three installed (this number is below the average),
usually it needs more than 50MiB data.
Hmm, 50MiB seems still not large enough. Linguist data distributed in
a free license are rare compared to the ones provided with non-free
license, and usually their quality and amount is lower/smaller than
non-free ones. Users can download those data (free to download and
use, but not distributable), and use tools provided by input method to
covert the format. This results into 10^6 lines of data, nearly 100MiB
in size. This time it looks rational to not put them into RAM.
Apart from above reasons, switching among input methods also requires
very quick response, it's hard to imagine when you click to switch to
another input method, you have to wait for a couple of seconds (even
minutes), the operation must be completed in a reasonable short time
(<1s) and not cost many resource (users don't want to see there CPU
usage bump to 200% by simply switching between input methods).
> So unless the program is restarted for every input (which would be the
> first thing to eliminate to improve responsiveness) there shouldn't be a
> problem with "fixing" this. It just means extra work you might not be
> willing (or have time) to invest.
> PS: ia32-libs is about 1GB and is going away. So there should be space
> now for 10 more sources like yours. :)
I am sure they will be eaten up once Wheey is released. ;-)