Re: Slickest way to do "joins"/lookups in text files

To: debian-user@lists.debian.org
Subject: Re: Slickest way to do "joins"/lookups in text files
From: Sam Watkins <swatkins@fastmail.fm>
Date: Sun, 19 Dec 2004 02:23:44 +1100
Message-id: <[🔎] 20041218152344.GH2503@jam.samwatkins.homeunix.net>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20041218013937.GA24920@alltel.net>
References: <[🔎] 20041218013937.GA24920@alltel.net>

On Fri, Dec 17, 2004 at 08:39:37PM -0500, William Ballard wrote:
> In SQL you join tables.  If it's indexed it's fast.
> Can you do this using text files from the prompt?
>
> Example:
> I'd like to investigate the most popular RSS readers Debian offers.
> 
> $apt-cache search rss | cut -d' ' -f1 > rss.txt
> $wget http://popcon.debian.org/main/by_vote.gz

Yes.  There is a standard unix tool called "join" that does it.

The syntax is a bit strange, but for example:

join -t $'\t' -j1 1 -j2 1 file1 file2

This will join two TSV files on their first column.

For example, to solve your problem, you might do this:
  zgrep -v '^#' by_vote.gz | perl -pe '/^-/ && exit' |
  perl -pe 's/ +/\t/g; 1 while s/(\(.*?)\t/$1 /' >by_vote.1

to clean it up into TSV, then:
  <by_vote.1 sort -t $'\t' -k2,2 >by_vote.2

to sort it by the second column (package name), then:
  join -t $'\t' -j1 1 -j2 2 rss.txt by_vote.2 > rss_votes.tsv

and to make it human-readable again:
  < rss_votes.tsv tsv2txt > rss_votes

tsv2txt is a script I wrote, you can get it at:
  http://nipl.net/hacks/tsv2txt


There is a program called "nosql" which provides many relational functions
for TSV files, it's quite good.  Also, I have written several other scripts
that do useful database-like things with TSV files.


Sam

Reply to:

References:
- Slickest way to do "joins"/lookups in text files
  - From: William Ballard <nospam_40811@alltel.net>

Prev by Date: PPPoE
Next by Date: PPPoE
Previous by thread: Re: Slickest way to do "joins"/lookups in text files
Next by thread: Re: Slickest way to do "joins"/lookups in text files
Index(es):
- Date
- Thread