Re: Endianness of data files in MultiArch

To: Aron Xu <happyaron.xu@gmail.com>
Cc: Jan Hauke Rahm <jhr@debian.org>, debian-devel@lists.debian.org
Subject: Re: Endianness of data files in MultiArch
From: Goswin von Brederlow <goswin-v-b@web.de>
Date: Fri, 10 Feb 2012 12:59:15 +0100
Message-id: <[🔎] 87fweihpbg.fsf@frosties.localnet>
In-reply-to: <[🔎] CAMr=8w6qiM6VB_2iegzKMFx=TV+ert6LqELY6NAoqfpAco-6oQ@mail.gmail.com> (Aron Xu's message of "Fri, 10 Feb 2012 12:29:47 +0800")
References: <[🔎] CAMr=8w494XG1bWJ3LR5rQnjrgRcUNG-E6igQB+xT6BDygPrNaA@mail.gmail.com> <[🔎] 4F32B26F.8050200@debian.org> <[🔎] CAMr=8w6s+itAP8Usgjaqf86MFFypAOp+QJODeTjhdYuMb7AYmw@mail.gmail.com> <[🔎] 20120209082314.GA3215@ca.home.jhr-online.de> <[🔎] CAMr=8w6qiM6VB_2iegzKMFx=TV+ert6LqELY6NAoqfpAco-6oQ@mail.gmail.com>

Aron Xu <happyaron.xu@gmail.com> writes:

> Sorry, the thread was broken and I saw your reply just now.
>
> On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <jhr@debian.org> wrote:
>> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
>>>
>>> This is valid for most-used applications/formats like gettext, images
>>> that are designed to behave in this way, but on the contrary there are
>>> upstream that don't like to see such impact, especially due to the
>>> complexity and performance impact.
>>>
>>> Currently I am using arch:any for data files which aren't be affected
>>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
>>> data that is required to make a library working, I have to force them
>>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
>>> as "Multiarch: same", this is sufficient to avoid breakage, but again
>>> it consumes a lot of space on mirror.
>>
>> Actually, what is "a lot" here? I mean, how many libraries are there
>> containing endianness-critical data and how big are the actual files?
>> Not that I'm any kind of expert, but this solution sounds reasonable to
>> me.
>>
>> Hauke
>>
>
> As far as I know, there isn't too many libraries known to have
> endianness-critical data, but there might be landmines because the
> maintainer just aren't aware about it.
>
> I have the chance to notice this problem because my team maintain
> several stack of input methods, which usually need to deal with
> linguistic data. [1]
>
> For me here is a library named libpinyin at hand to package, which has
> some data files of ~7.5MiB size after gzip -9 (the total size of this
> library is no more than 9MiB after gzip -9). We have 14 architectures
> on ftp-master, so the data file eats up 105MiB, while if we find some
> way to have only one copy for be/le, it'll only use 15MiB. And think
> about when it get released as a stable, a new copy of those data is
> making their way to the archive when new version get uploaded to
> unstable.
>
> Such concern is also valid to other endianness-critical data that are
> not bothered with Multi-Arch at present, we need to make them arch:any
> and in the end they are eating more and more space.
>
> [1] Performance is critical for these applications, this doesn't mean
> it consumes a lot of CPU percentage, but it must response very quickly
> to user's input - do some complex calculations to split a sentence
> into words and find out a list of most related suggestions, which
> needs to query from 10^5 ~ 10^6 lines of data several times to
> complete such an action. There was project tried to use something like
> SQLite3 but the performance is a bit frustrating, so they have now
> decided not to care about that but just design data format that can
> fit for their requirements.
> -- 
> Regards,
> Aron Xu

It doesn't sound like the data is to big to fit into ram and it sounds
like the overhead to fetch data from disk on demand would slow you
down. So there seems to be no reason to have architecture independent
data on disk and convert it to the right endianess on startup. Sure
startup time would increase a bit but running time would remain
unafected.

So unless the program is restarted for every input (which would be the
first thing to eliminate to improve responsiveness) there shouldn't be a
problem with "fixing" this. It just means extra work you might not be
willing (or have time) to invest.

MfG
        Goswin

PS: ia32-libs is about 1GB and is going away. So there should be space
now for 10 more sources like yours. :)

Reply to:

Follow-Ups:
- Re: Endianness of data files in MultiArch
  - From: Osamu Aoki <osamu@debian.org>
- Re: Endianness of data files in MultiArch
  - From: Aron Xu <happyaron.xu@gmail.com>

References:
- Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
  - From: Aron Xu <happyaron.xu@gmail.com>
- Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
  - From: Simon McVittie <smcv@debian.org>
- Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
  - From: Aron Xu <happyaron.xu@gmail.com>
- Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
  - From: Jan Hauke Rahm <jhr@debian.org>
- Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
  - From: Aron Xu <happyaron.xu@gmail.com>

Prev by Date: Re: DEP-5 and files with white spaces
Next by Date: Re: Use of the first person in messages from the computer
Previous by thread: Re: Endianness of data files in MultiArch (was: Please test gzip -9n - related to dpkg with multiarch support)
Next by thread: Re: Endianness of data files in MultiArch
Index(es):
- Date
- Thread