[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Tool to show maximal repeating patterns / structure in (text?) data

On Sun, Jul 13, 2008 at 10:05:23AM +0100, j t wrote:
> I might be butting up against the edge of what's theoretically
> possible ("computer science"-wise) but I think that my requirements
> have something to do with lossless compression algorithms. Perhaps I
> should start reading the source code for gzip/bzip2...?

You're on the right track here, at least for getting as far as detecting
maximal-length identical strings.  As I recall, Huffman encoding should
be what you're looking for.

Another place to look would be search indexing algorithms.  I used to
know a guy who'd done graduate work in that area and, from talking to
him about it, it sounded like this is one of their key techniques.

Although, if you're just looking for identical log entries (rather than
arbitrary repeated segments in freeform text), using awk/sed to strip
out timestamps, then feeding the result through `sort | uniq -cd` should
handle that case.  (There are already standard log analysis packages
which do essentially this, but I can't think of any names at the

News aggregation meets world domination.  Can you see the fnews?

Reply to: