[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: help needed for converting strings in a file



On 11/10/05, Sourabh Bora <sourabhbora@gmail.com> wrote:
> hi,,
>          i am making a small tool for offline web browsing.. For this I need
> to change the source of html files.
>     let me explain::
>
>          In a web page the hyper links are written as
>
>  href="http://www.micronux.com/catalog/";
>
>  i want this particular string to convert to
>
>
>
> href="./micronux.com_catalog"
>
>  The logic is --1)delete
> http://www.
>  2) replace '/' '?' etc with '_'
>
>  I want to write a script using sed or awk which will do all the conversion
> in a file..

Since the responses so far have suggested alternatives, rather than
how to do what you're asking, and knowing how to do this sort of thing
is valuable in and of itself, here are some examples, though not a
complete sed script.  In fact, I'm going to use perl's regular
expressions, since those are the ones with which I'm most familiar.

s|http://www\.||i

This will do a case-insensitive replacement of "http://www."; with "". 
The "|" is a delimiter  around the search target and the replacement. 
The standard delimiter is a slash, but then you'd have to write

s/http:\/\/www\.//i

which is a bit harder to read.

For the other replacements, \w will match "word" characters, where
word characters are a-z, A-Z, 0-9, and _.  You can do the replacement
as

s|[^\w.\s]|_|g

I've used "|" again for consistency.  The "g" at the end tells perl to
do this replacement as many times as it can on the current line.  The
expression in brackets means not (^) word characters (\w), periods
(just . here), or whitespace (\s).

To prepend the "./", you can do

s|^|\./|

where the carat now matches the beginning of the line.

If you have a URL saved as $url, you can then do the following in perl:
$url =~ s|http://www\.||i;
$url =~ s|[^\w.\s]|_|g;
$url =~ s|^|\./|;

Note that this doesn't quite do what you want, since it produces a
trailing "_".  I'll leave getting rid of that as an exercise, with the
added note that "$" matches the end of a line.

--
Michael A. Marsh
http://www.umiacs.umd.edu/~mmarsh
http://mamarsh.blogspot.com



Reply to: