Re: Tool to show maximal repeating patterns / structure in (text?) data
On Sun, Jul 13, 2008 at 10:05:23AM +0100, j t wrote:
> I might be butting up against the edge of what's theoretically
> possible ("computer science"-wise) but I think that my requirements
> have something to do with lossless compression algorithms. Perhaps I
> should start reading the source code for gzip/bzip2...?
You're on the right track here, at least for getting as far as detecting
maximal-length identical strings. As I recall, Huffman encoding should
be what you're looking for.
Another place to look would be search indexing algorithms. I used to
know a guy who'd done graduate work in that area and, from talking to
him about it, it sounded like this is one of their key techniques.
Although, if you're just looking for identical log entries (rather than
arbitrary repeated segments in freeform text), using awk/sed to strip
out timestamps, then feeding the result through `sort | uniq -cd` should
handle that case. (There are already standard log analysis packages
which do essentially this, but I can't think of any names at the
News aggregation meets world domination. Can you see the fnews?