[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: finding files by content from the command-line



On Fri, 04 Nov 2005, Matt Price wrote:

> Having checked out beagle and quite liked it, I seet here ae also
> various graphical file-finding tools outthere, e.g. the gnome "search
> for files" program, that allow content searches (e.g., "contains the
> text"-type searhcing).  In many cases similar effects can be achieved
> using find andor grep, but when searching for an mp3 or for text in an
> openoffice document this strikes me as inefficient.  Does anyone know
> of a command-line tool that can deploy backends like pdf2text & other
[snip]

I use 'glimpse' to do this.  Periodically run the glimpse indexer over the
directories of interest and you have a fast content-based search engine for
your local files.  glimse has a mechanism to filter/convert files using
other software prior to indexing so you can index useful text from most any
file type (see the .glimpse_filters file below).

(Note that the potential for security problems multiplies rapidly with
the number of extra parsers you invoke on arbitrary data. YMMV.)

Here's my 'glimpse.notes' file that summarizes how I set this up.  For the
most part it is a concatenation of various bash scripts and glimpse
dot-files.  The '#*' header lines delimit the different files (or
file-fragments) and give a sample pathname for the file if applicable.

I hope this is clear enough :-)

-- Brad

###################
## Crontab entries
0  1 * * 0 nice /home/joe-user/bin/glimpseindex.sh    <directories to index>
0  5 * * * nice /home/joe-user/bin/glimpseindex.sh -f <directories to index>

###################
## /home/joe-user/bin/glimpseindex.sh
#----------------------
#! /bin/bash
glimpseindex -o -t -B -M 16 -z "$@" >& ~/.glimpse_index.log
#----------------------

###################
## Convenience commands (bash shell functions)
#----------------------
function glwin() {
  # This just kicks out filenames with content matching the search string.
  # I find myself using this command most frequently.
  glimpse -N -j -y -z -w -i "$@" |\
    perl -n -e 'chomp; s/$ENV{HOME}/~/; printf("\t\x1b[0;31m%s\x1b[0m\n",$_);' |\
    $PAGER;
}
function glwi() {
  # This command also returns the match along with a few lines of context
  # from the files.  Because of the way glimpse functions (at least the way
  # I have it set up) this can take /much/ longer to return.
  glimpse -j -y -z -w -i "$@" |\
    gawk -F : '{print gensub($1.":","\x1b[0;31m&\x1b[0m\n",1,$0)}' |\
    $PAGER;
}

###################
## /home/joe-user/.glimpse_exclude
#----------------------
.glimpse_
music/
.mp3$
.mpg$
.mpeg$
.png$
.gif$
.jpg$
.jpeg$
.eps$
.eps.gz$
.eps.bz2$
/tmp/
Cache/
cache/
.iso.
.img.
.log$
.dwb
#----------------------

###################
## /home/joe-user/.glimpse_filters
#----------------------
*.Z$	gzip -dc
*.z$	gzip -dc
*.gz$	gzip -dc
*.bz2$	bzip2 -dc
*.zip$	unzip -l
*.tar$	tar tf 
*.tgz$	tar tzf 
*.pdf$	glimpsepdftotext
*.ps$	pstotext 
*.html$	w3m -dump 
*.htm$	w3m -dump 
*.ps.gz$	pstotext 
*.ps.bz2$	pstotext 
*.tar.gz$	tar tf 
*.tar.bz2$	tar tf 
/mail/	mbox2txt
#----------------------

###################
## /home/joe-user/bin/glimpsepdftotext
#----------------------
#!/bin/bash
# Wrapper to rearrange arguments passed by glimpseindex for use with pdftotext
exec /usr/bin/pdftotext -q "$1" -
#----------------------



Reply to: