[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: These new diffs are great, but...

Florian Weimer <fw@deneb.enyo.de> writes:

> * Goswin von Brederlow:
>>> However, patching rred to apply patches in a single run would be a
>>> good start because all further optimizations will need it.
>> Why should the number of chunks matter?
> If you use the naïve algorithm, it does.  But rred implements
> something more involved, leading to a O(patch-files * lines + changes)
> complexity (or something very similar).
>> What matters is reading, parsing and writing the file O(lines) and
>> then the number of changes (lines of changes) O(changes). Combined
>> this gives O(lines + changes) if the file is read once at the start
>> and then all patches are applied.
> I guess you'll find it very hard to get down to O(lines + changes). 8-)
> Either you need to make multiple passes over the original file (like
> the current code does), or you need to combine patch files before
> applying them.  The latter involves some kind of sorting, and unless
> you postulate an upper bound on the number of lines in a file, you'll
> end up with super-linear complexity.

Each change in the patches is an insert line, remove line or replace
line. Thinking about it I see that insert/remove is a problem if you
have an array of line pointers. But with a tree you can still
insert/remove/replace each line in O(log lines) or O(changes * log
lines) in total, which is better than O(lines * chunks).

Or you do keep an array of line pointers and copy that for every chunk
you process. Copying 500000 pointers on every chunk doesn't sound too
slow. The theoretical complexity would be O(lines * chunks) but the
const factor should be way low.

>> You can do that by combining the individual patch files or by teaching
>> rred to do a single run with multiple patch files. Same result. Both
>> solve the problem of O(lines * chunks + changes) complexity.
> The real problem is that APT's architecture doesn't allow rred to look
> at all patches at once.  At least that's what I've been told.

True. The apt methods aren't designed to process multiple files at
once. You can't use something designed for gunziping a single file to
combine multiple patches into a new Packages file. You have to design
something new.

>> As for using pointers to lines and shuffeling them that seems to be
>> the only sane thing to do. All patch operations are line based so it
>> is essential that a line can be found and replaced in O(1). A simple
>> array of pointers to lines solves that.
> Inserting in the middle of an array is not a constant-time operation.
> That's why the naïve algorithm is slow.

Yeah. I noticed that too now. As said above use a tree [O(log n) per
operation] or copy the array [O(n) total per pass]. Copying a pointer
to a line is better than copying each line since the hidden constant
is way lower.


Reply to: