[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Using wget to fill in a form



In the end I did pretty much as suggested, using wget and re-using session IDs.
I created a bash script that gets a session ID, reads the list of ISBN numbers,
and then tries to retrieve their info. If the retrieval returns a session
expired then it gets a new one. It also does a decent job of outputting the
retrieved records into a csv format for easy import into a database or XML.

The script, and my list of 25 test ISBNs are included below. Interestingly,
about five, or 20% come up with no record found.

If I try to do anything more fancy then I will learn how to query the MARC
system directly. The LOC site has a lot of information available.

I appreciate all of the help and suggestions I received.



#!/bin/bash

#*******************************************#
#               getLOCinfo.sh               #
#                                           #
# A script to read a list of ISBN numbers   #
# from an input file, and to retrieve the   #
# LOC info for that item from the LOC web   #
# search form.                              #
#                                           #
# The input file is expected to contain     #
# a single line of ISBN numbers separated   #
# by whitespace. Alternatively, the file    #
# can contain one ISBN per line as long as  #
# all but the final line ends with white-   #
# space followed by a backslash (actually   #
# I think all lines can end that way).      #
#*******************************************#

# Script Constants:
BASE_URL="http://www.loc.gov/cgi-bin/zgate";
E_BAD_ARGS=65
E_BAD_FILE=66
E_NO_SESSION_ID=67
NUM_ARGS=2
NUM_EXPIRED=10
SUCCESS=0

# Script variables:
expired_count=0
result="Your session has expired"
result_url=$BASE_URL
session_url=$BASE_URL

# A function to get a new sessionid:
GetSessionID ()
{
   session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/"
   session_url=$session_url"locils2.html,z3950.loc.gov,7090"
   sessionid=`wget $session_url -o /dev/null -O - | \
                 grep SESSION_ID | \
                 cut -d "\"" -f4`
   if [ -z $sessionid ]
   then
      echo "Unable to get session ID. Exiting"
      exit $E_NO_SESSION_ID
   fi
}

# A function to "build" the request URL:
BuildURL ()
{
   url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&"
   url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&"
   url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,"
   url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=$1&"
   url=$url"TERM_1=$2"
}

# Make sure file names were supplied when the script was called:
if [ $# -ne $NUM_ARGS ]
then
   echo "ERROR: Incorrect number of parameters supplied. Exiting..."
   exit $E_BAD_ARGS
fi

# Make sure the input file exists and is not empty:
if [ ! -f "$1" ] || [ ! -s "$1" ]
then
   echo "ERROR: $1 not found or is an empty file. Exiting..."
   exit $E_BAD_FILE
fi

# Truncate the output file if necessary:
if [ -s $2 ]
then
   echo -n "Warning: $2 exists and is not empty. Continue [y/N]? "
   read input
   if [ `echo $input | tr A-Z a-z` != "y" ]
   then
      echo "Please provide a valid output file name"
      exit $E_BAD_FILE
   fi
   cat /dev/null > $2
fi

# Get a session ID:
GetSessionID

# Read the file contents:
read isbn_list < $1

for isbn in $isbn_list
do
   BuildURL $sessionid $isbn
   result=`wget $url -o /dev/null -O - | tr "\n" " "`
   while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] &&
         [ $expired_count -lt $NUM_EXPIRED ]
   do
      let "expired_count+=1"
      GetSessionID
      BuildURL $sessionid $isbn
      result=`wget $url -o /dev/null -O - | tr "\n" " "`
   done

   if [ $expired_count -eq $NUM_EXPIRED ]
   then
      echo "Unable to get session ID. Exiting"
      exit $E_NO_SESSION_ID
   else
      expired_count=0
   fi

   if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ]
   then
      # Print the not found message to stderr:
      echo "$isbn: No record found" >&2
   else
      echo -n "\"$isbn\"," >> $2
      echo $result | sed -n -e 's/.*<pre>\(.*\)<\/pre>.*/\1/Ip' | \
         sed -e 's/  \+/ /g' | \
         sed -e 's/^Author: /"/' | \
         sed -e 's/\., [0-9]\{4\}-[0-9]\{0,4\} \(Title: \)/. \1/' | \
         sed -e 's/\. Title: /","/' | \
         sed -e 's/\. Published: /","/' | \
         sed -e 's/, c\([0-9]\{4\}\)\. LC Call No.: /","\1","/' | \
         sed -e 's/ *$/"/' \
         >> $2
   fi
done

exit $SUCCESS

##### ISBN List: ###############################################################

0805375651 \
0314027157 \
0201087987 \
9780980232714 \
0131774115 \
0789731274 \
1874416656 \
1886411484 \
9780425238981 \
0070726922 \
0495011622 \
1565927699 \
0673524841 \
0721659659 \
9781847991683 \
0596100795 \
0596001584 \
9780980455205 \
0835930513 \
9780954452971 \
0619121475 \
9780321553577 \
0130424110 \
0201612445 \
9780123705488


Sent - Gtek Web Mail



Reply to: