[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Package that quickly reports the closest line in a text file?



O' big brains o' bio informatics!

If I ask you nicely, may I please have the benefit
of your huge science brain thoughts?

Do you happen to know of a Debian package with a
nice, quick command line tool that says which line
of a text file most closely matches a string?

I imagine it having a syntax like grep:

    $ stupendous-tool "hello world" file

But instead of 

    reporting every line in file containing the
    string "hello world", 
    
it would 

    return the line closest to "hello world". 
    
    For example, it might return the line "hello word".

Debian already has an approximate grep package
(tre-agrep), but, But, BUT!

    a.) It uses a slow algorithm: the Levenshtein
        distance and

    b.) is limited to differences of 9 or fewer
        characters.

I believe comparing long strings of DNA is a well
known chore in bioinformatics.

I read at

    https://stackoverflow.com/questions/5859561/getting-the-closest-string-match/5859823

and

    https://stackoverflow.com/questions/49263/approximate-string-matching-algorithms

that better algorithms are available.

Debian's packages named "ncbi-blast+" and "neobio"
look close, but I have no personal experience with
either.

My question?

Can you recommend a computationally efficient
Debian package that reports which line of a text
file most closely matches a string?

Thanks,
Kingsley

-- 
Time is the fire in which we all burn.


Reply to: