[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Tool to show maximal repeating patterns / structure in (text?) data



Hi all,

Does anyone know of a tool which will analyse a block of data and find
structure / repeating patterns in it, and then somehow show that
structure to the user?

As an example, pretend I give it the following paragraph of text (but
I don't tell it that the following paragraph contains a string
repeated 4 times):

<snip>
Support for Debian users who Support for Debian users who Support for
Debian users who Support for Debian users who
</snip>

I'd like this tool to tell me that the previous paragraph contains the
string "Support for Debian users who " 4 times (and I'd like the tool
to have worked that out on its own).

I realize that this example is trivial. I'd also like this tool to do
things which are more complicated, but since I can't find anything
that even helps me with my previous example, that will do for the time
being.

To preemptively answer the question "why do you want it / what is it
you're trying to achieve", I have a log of a dhcp conversation which
contains what I think is a repeated DHCPDISCOVER stanza. Rather than
the manual copy/paste/diff cycle, I'd like this tool to look at the
log and tell me: "Yup, you've got a stanza/paragraph repeated 4
times".

I might be butting up against the edge of what's theoretically
possible ("computer science"-wise) but I think that my requirements
have something to do with lossless compression algorithms. Perhaps I
should start reading the source code for gzip/bzip2...?

Thanks for your help, Jaime :-)


Reply to: