[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Slickest way to do "joins"/lookups in text files



On Fri, Dec 17, 2004 at 08:39:37PM -0500, William Ballard wrote:
> In SQL you join tables.  If it's indexed it's fast.
> Can you do this using text files from the prompt?
>
> Example:
> I'd like to investigate the most popular RSS readers Debian offers.
> 
> $apt-cache search rss | cut -d' ' -f1 > rss.txt
> $wget http://popcon.debian.org/main/by_vote.gz

Yes.  There is a standard unix tool called "join" that does it.

The syntax is a bit strange, but for example:

join -t $'\t' -j1 1 -j2 1 file1 file2

This will join two TSV files on their first column.

For example, to solve your problem, you might do this:
  zgrep -v '^#' by_vote.gz | perl -pe '/^-/ && exit' |
  perl -pe 's/ +/\t/g; 1 while s/(\(.*?)\t/$1 /' >by_vote.1

to clean it up into TSV, then:
  <by_vote.1 sort -t $'\t' -k2,2 >by_vote.2

to sort it by the second column (package name), then:
  join -t $'\t' -j1 1 -j2 2 rss.txt by_vote.2 > rss_votes.tsv

and to make it human-readable again:
  < rss_votes.tsv tsv2txt > rss_votes

tsv2txt is a script I wrote, you can get it at:
  http://nipl.net/hacks/tsv2txt


There is a program called "nosql" which provides many relational functions
for TSV files, it's quite good.  Also, I have written several other scripts
that do useful database-like things with TSV files.


Sam



Reply to: