[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: another script query (perl?)



Richard Lyons wrote:
> On Fri, Sep 07, 2007 at 07:19:17AM -0700, tabris wrote:
>
>   
>> Richard Lyons wrote:
>>     
>>> Hi, all you script wizards.
>>>
>>> I thought this would be easy, but I haven't found anything to crib
>>> from...
>>>
>>> I need a script to read a text file (actually tex) and parse lines of a
>>> table that may or may not span newline characters in the file.
>>> Basically, there are lines of the form
>>>
>>>    {some text} & {some more text} & {text c} & {text d} \\
>>>
>>> where the braces are only for clarity and do not occur in the files, and
>>> where the bits of text may include whitespace which may include newline
>>> characters. There may also be escaped ampersands in the text ('\&'), and
>>> the text fragments may be empty.
>>>
>>> I suspect perl may be the way forward.  I need to be able to read each
>>> file, parse each set of three ampersands with a double backslash
>>> breaking it into four substrings, manipulate the substrings and write
>>> the file anew.  A typical manipulation will be to take text c and copy
>>> it to text d. I shall also try to strip leading and trailing whitespace
>>> to tidy up the file.
>>>
>>> Any and all pointers will be gratefully received!
>>>
>>>   
>>>       
>> please give real examples the text you have, as well as more info about
>> what processing you will do with it.
>> There are multiple ways to approach this, we need to have more
>> information first.
>>
>>     
> I'm not sure it helps a lot, as they vary quite lot, but here is one:
>
>     \mbox{Walls} &Plain plastered and painted white. &GC but to soiled
> around switch, RHS as entering, HL marks. OW nail near centre, some
>  blue-tac remnants. LHW hairline cracking at HL. pipe boxing far RH
>    corner, white painted, cracks at junctions.  & \\
>
> and here is another:
>
>    &catch, diecast \& epoxy coated with security lock & GC &\\
>
> If it is unclear to any non-latex-user, the ampersands are table column
> separators in latex.
>
> After the manipulation I gave as an example, (copu text c to text d), I
> would hope they would look like this:
>
> \mbox{Walls} & Plain plastered and painted white. & GC but to soiled
> around switch, RHS as entering, HL marks. OW nail near centre, some
> blue-tac remnants. LHW hairline cracking at HL. pipe boxing far RH
> corner, white painted, cracks at junctions. & GC but to soiled around
> switch, RHS as entering, HL marks. OW nail near centre, some blue-tac
> remnants. LHW hairline cracking at HL. pipe boxing far RH corner, white
> painted, cracks at junctions. \\
>
> and:
>
>   & catch, diecast \& epoxy coated with security lock & GC & GC \\
>
> The first example shows the problem of included newlines, which might
> occur as here or anywhere else in the text. Note that the whole text
> fragment has been copied to the previously void fourth field.
>
> The second example shows the need not to be confused by '\&'.  
>
> If that is any clearer...
>
>   
well, I'd say something along these lines assuming that you have $l
populated with the entire piece you want.
Also note that this attempts to avoid use of regexps where possible, as
they tend to be slow and hard to read. Not that I dislike regexps, but I
don't think they're necessary here. Also note that none of this code has
been tested, it's the product of about 5 minutes of hacking.

my @phrases = split('&', $l);
{
    my @tmp;
    while(my $phrase = shift @phrases) {
        if (substr($phrase, -2) eq '\') {
           my $tmp = $phrase .'&'. (shift @phrases);
        }
        push @tmp, $phrase;
    }
    @phrases = @tmp;
}

# remove trailing or leading whitespace
foreach my $phrase (@phrases) {
    $phrase =~ s/^\s//; #remove leading spaces
    $phrase =~ s/\s$//; # remove trailing spaces
    $phrase =~ s/\n/ /g; # change all new-line chars to spaces
}

# now reconstruct your text however you want it.
# I have a good (free, public-domain) line splitter if you need one.

Attachment: signature.asc
Description: OpenPGP digital signature


Reply to: