Re: Using wget to fill in a form
In the end I did pretty much as suggested, using wget and re-using session IDs.
I created a bash script that gets a session ID, reads the list of ISBN numbers,
and then tries to retrieve their info. If the retrieval returns a session
expired then it gets a new one. It also does a decent job of outputting the
retrieved records into a csv format for easy import into a database or XML.
The script, and my list of 25 test ISBNs are included below. Interestingly,
about five, or 20% come up with no record found.
If I try to do anything more fancy then I will learn how to query the MARC
system directly. The LOC site has a lot of information available.
I appreciate all of the help and suggestions I received.
#!/bin/bash
#*******************************************#
# getLOCinfo.sh #
# #
# A script to read a list of ISBN numbers #
# from an input file, and to retrieve the #
# LOC info for that item from the LOC web #
# search form. #
# #
# The input file is expected to contain #
# a single line of ISBN numbers separated #
# by whitespace. Alternatively, the file #
# can contain one ISBN per line as long as #
# all but the final line ends with white- #
# space followed by a backslash (actually #
# I think all lines can end that way). #
#*******************************************#
# Script Constants:
BASE_URL="http://www.loc.gov/cgi-bin/zgate"
E_BAD_ARGS=65
E_BAD_FILE=66
E_NO_SESSION_ID=67
NUM_ARGS=2
NUM_EXPIRED=10
SUCCESS=0
# Script variables:
expired_count=0
result="Your session has expired"
result_url=$BASE_URL
session_url=$BASE_URL
# A function to get a new sessionid:
GetSessionID ()
{
session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/"
session_url=$session_url"locils2.html,z3950.loc.gov,7090"
sessionid=`wget $session_url -o /dev/null -O - | \
grep SESSION_ID | \
cut -d "\"" -f4`
if [ -z $sessionid ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
fi
}
# A function to "build" the request URL:
BuildURL ()
{
url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&"
url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&"
url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov,"
url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=$1&"
url=$url"TERM_1=$2"
}
# Make sure file names were supplied when the script was called:
if [ $# -ne $NUM_ARGS ]
then
echo "ERROR: Incorrect number of parameters supplied. Exiting..."
exit $E_BAD_ARGS
fi
# Make sure the input file exists and is not empty:
if [ ! -f "$1" ] || [ ! -s "$1" ]
then
echo "ERROR: $1 not found or is an empty file. Exiting..."
exit $E_BAD_FILE
fi
# Truncate the output file if necessary:
if [ -s $2 ]
then
echo -n "Warning: $2 exists and is not empty. Continue [y/N]? "
read input
if [ `echo $input | tr A-Z a-z` != "y" ]
then
echo "Please provide a valid output file name"
exit $E_BAD_FILE
fi
cat /dev/null > $2
fi
# Get a session ID:
GetSessionID
# Read the file contents:
read isbn_list < $1
for isbn in $isbn_list
do
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] &&
[ $expired_count -lt $NUM_EXPIRED ]
do
let "expired_count+=1"
GetSessionID
BuildURL $sessionid $isbn
result=`wget $url -o /dev/null -O - | tr "\n" " "`
done
if [ $expired_count -eq $NUM_EXPIRED ]
then
echo "Unable to get session ID. Exiting"
exit $E_NO_SESSION_ID
else
expired_count=0
fi
if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ]
then
# Print the not found message to stderr:
echo "$isbn: No record found" >&2
else
echo -n "\"$isbn\"," >> $2
echo $result | sed -n -e 's/.*<pre>\(.*\)<\/pre>.*/\1/Ip' | \
sed -e 's/ \+/ /g' | \
sed -e 's/^Author: /"/' | \
sed -e 's/\., [0-9]\{4\}-[0-9]\{0,4\} \(Title: \)/. \1/' | \
sed -e 's/\. Title: /","/' | \
sed -e 's/\. Published: /","/' | \
sed -e 's/, c\([0-9]\{4\}\)\. LC Call No.: /","\1","/' | \
sed -e 's/ *$/"/' \
>> $2
fi
done
exit $SUCCESS
##### ISBN List: ###############################################################
0805375651 \
0314027157 \
0201087987 \
9780980232714 \
0131774115 \
0789731274 \
1874416656 \
1886411484 \
9780425238981 \
0070726922 \
0495011622 \
1565927699 \
0673524841 \
0721659659 \
9781847991683 \
0596100795 \
0596001584 \
9780980455205 \
0835930513 \
9780954452971 \
0619121475 \
9780321553577 \
0130424110 \
0201612445 \
9780123705488
Sent - Gtek Web Mail
Reply to: