[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Endianness of data files in MultiArch



Hi,

On Fri, Feb 10, 2012 at 12:59:15PM +0100, Goswin von Brederlow wrote:
> Aron Xu <happyaron.xu@gmail.com> writes:
> 
> > Sorry, the thread was broken and I saw your reply just now.
> >
> > On Thu, Feb 9, 2012 at 16:23, Jan Hauke Rahm <jhr@debian.org> wrote:
> >> On Thu, Feb 09, 2012 at 01:58:28AM +0800, Aron Xu wrote:
> >>>
> >>> This is valid for most-used applications/formats like gettext, images
> >>> that are designed to behave in this way, but on the contrary there are
> >>> upstream that don't like to see such impact, especially due to the
> >>> complexity and performance impact.
> >>>
> >>> Currently I am using arch:any for data files which aren't be affected
> >>> with multiarch, i.e. not "same" or "foreign". For endianness-critical
> >>> data that is required to make a library working, I have to force them
> >>> to be installed into /usr/lib/<triplet>/$package/data/ and mark them
> >>> as "Multiarch: same", this is sufficient to avoid breakage, but again
> >>> it consumes a lot of space on mirror.
> >>
> >> Actually, what is "a lot" here? I mean, how many libraries are there
> >> containing endianness-critical data and how big are the actual files?
> >> Not that I'm any kind of expert, but this solution sounds reasonable to
> >> me.
> >>
> >> Hauke
> >>
> >
> > As far as I know, there isn't too many libraries known to have
> > endianness-critical data, but there might be landmines because the
> > maintainer just aren't aware about it.
> >
> > I have the chance to notice this problem because my team maintain
> > several stack of input methods, which usually need to deal with
> > linguistic data. [1]
> >
> > For me here is a library named libpinyin at hand to package, which has
> > some data files of ~7.5MiB size after gzip -9 (the total size of this
> > library is no more than 9MiB after gzip -9). We have 14 architectures
> > on ftp-master, so the data file eats up 105MiB, while if we find some
> > way to have only one copy for be/le, it'll only use 15MiB. And think
> > about when it get released as a stable, a new copy of those data is
> > making their way to the archive when new version get uploaded to
> > unstable.

Just think any phrase data with its content size in 16bit integer.

I have bigger example :-)

ipadic: Uncompressed size: 44.5 M

This one, I made them arch:any to build many binary packages.  Similar
packages use install time conversion trick to keep them "arch: all" but
this install takes time.

naist-jdic: Uncompressed size: 28.5 M (based on my vague memory)

> > Such concern is also valid to other endianness-critical data that are
> > not bothered with Multi-Arch at present, we need to make them arch:any
> > and in the end they are eating more and more space.
> >
> > [1] Performance is critical for these applications, this doesn't mean
> > it consumes a lot of CPU percentage, but it must response very quickly
> > to user's input - do some complex calculations to split a sentence
> > into words and find out a list of most related suggestions, which
> > needs to query from 10^5 ~ 10^6 lines of data several times to
> > complete such an action. There was project tried to use something like
> > SQLite3 but the performance is a bit frustrating, so they have now
> > decided not to care about that but just design data format that can
> > fit for their requirements.
> > -- 
> > Regards,
> > Aron Xu
> 
> It doesn't sound like the data is to big to fit into ram and it sounds
> like the overhead to fetch data from disk on demand would slow you
> down. So there seems to be no reason to have architecture independent
> data on disk and convert it to the right endianess on startup. Sure
> startup time would increase a bit but running time would remain
> unafected.

I think PO files cases are manageable.  They can use one endianess for
all platform.

But for any other generic special purpose natural language processing
code, it is impossible to force upstream to complicates code to use
particular endianness.
 
> So unless the program is restarted for every input (which would be the
> first thing to eliminate to improve responsiveness) there shouldn't be a
> problem with "fixing" this. It just means extra work you might not be
> willing (or have time) to invest.

If we are ready to rewite core of such code, you are right.  But if we
simply accept upstream code design, we will endup making multiple of
such semi-arch depended data in archive as arch: any. 

Osamu


Reply to: