[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: hard drive indexing and realtime searching



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Brockway wrote:
> Ah ok.  "grep -r" and "strings" to the rescue :)  It would probably not
> be hard to put together a tool which did the equivalent of a "file" [1]
> on each file it found in the regex search and applied the right
> grep/search tool.
> 
> [1] Done internally of couse.  forking() a new copy of file for each
> file would be evil.  Yes people do this with find.
> 
> When I need to do this I use grep -r but I admit this misses pdfs which
> I have an increasing amount of information in.

Well, there are technologies (balanced trees, hash values, etc.) where
you can pre-index all files on your system, so the search for text in
files can be done as fast as for example your URL history in your
browser is doing it, or like google is working (they're certainly
using very good indexing techniques, specialty: clustered): you're
typing in your search text in a "edit box" and without real delay you
already get the results (I suppose < 5 sec, depending on hardware and
how good the hash/balanced trees are). It's also not so difficult to
integrate fault-tolerant searches. I don't know if there are products
for this (but I'm sure, there are!), cause I normally heard (and had
to learn this stuff) in my information/computer-science classes.

Disadvantages:
- - You need a certain amount of space for the index (for some Gigs of
files the index can grow up into the 100MB region) and you have to
keep this index up to date. Size also depends on the number of
"different" words, if you're indexing by words.
- - You should choose which files and file-types should be indexed. If
you don't do that, you'll get very messed up results or getting a much
too large index without any benefits from this (binary files like
videos normally don't have "normal" text in them).

Bye,
Simon

BTW and OT: Is there anyway to get Mozilla/Thunderbird to cite
correctly (with date/time), like your MUA is doing it, see:
===snip===
Robert Brockway wrote:
> On Mon, 21 Feb 2005, Simon Frettloeh wrote:
===snap===

- --
  [bysf]
  simon frettloeh # mailto:simonfr@gmx.net
  pgp keyid:0x372A2577 # available on all public key servers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFCGhbAeRT02zcqJXcRArcgAKDf5VeYw3MCVswIAOBQrRo0Eez6ZgCgr42N
5xfFXR8TjuMZmd5RHK3MVdo=
=qSxx
-----END PGP SIGNATURE-----



Reply to: