Re: Tool to show maximal repeating patterns / structure in (text?) data

To: debian-user@lists.debian.org
Subject: Re: Tool to show maximal repeating patterns / structure in (text?) data
From: Dave Sherohman <dave@sherohman.org>
Date: Sun, 13 Jul 2008 09:26:12 -0500
Message-id: <[🔎] 20080713142612.GA1114@sherohman.org>
Mail-followup-to: Dave Sherohman <dave@liszt.debian.org>, debian-user@lists.debian.org
In-reply-to: <[🔎] b88f52540807130205h6e89bfeam714aac9d8b00e465@mail.gmail.com>
References: <[🔎] b88f52540807130205h6e89bfeam714aac9d8b00e465@mail.gmail.com>

On Sun, Jul 13, 2008 at 10:05:23AM +0100, j t wrote:
> I might be butting up against the edge of what's theoretically
> possible ("computer science"-wise) but I think that my requirements
> have something to do with lossless compression algorithms. Perhaps I
> should start reading the source code for gzip/bzip2...?

You're on the right track here, at least for getting as far as detecting
maximal-length identical strings.  As I recall, Huffman encoding should
be what you're looking for.

Another place to look would be search indexing algorithms.  I used to
know a guy who'd done graduate work in that area and, from talking to
him about it, it sounded like this is one of their key techniques.

Although, if you're just looking for identical log entries (rather than
arbitrary repeated segments in freeform text), using awk/sed to strip
out timestamps, then feeding the result through `sort | uniq -cd` should
handle that case.  (There are already standard log analysis packages
which do essentially this, but I can't think of any names at the
moment.)

-- 
News aggregation meets world domination.  Can you see the fnews?
http://seethefnews.com/

Reply to:

Follow-Ups:
- Re: Tool to show maximal repeating patterns / structure in (text?) data
  - From: "j t" <mark473@gmail.com>

References:
- Tool to show maximal repeating patterns / structure in (text?) data
  - From: "j t" <mark473@gmail.com>

Prev by Date: Re: apt upgrade to testing breaks?
Next by Date: VGA modes for the supported resolutions [Was: Re: fbdev requirement of qemu]
Previous by thread: Re: Tool to show maximal repeating patterns / structure in (text?) data
Next by thread: Re: Tool to show maximal repeating patterns / structure in (text?) data
Index(es):
- Date
- Thread