[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: remove/replace non-ascii characters from file



Johannes Wiedersich wrote:
I have a silly Window$ application that is supposed to export ascii data. In fact the file is 99% percent ascii (after dos2unix), but contains a line starting with "Comment: " that contains non-ascii garbage (represented as ^@^@^@^@^@^@ etc.)

I suppose you mean "non-graphic ASCII". Those are NUL characters,
which the ASCII *definition* states can be inserted or removed
from *any* stream without changing its meaning. This means that
your application is not ASCII compliant. Sorry, but in this case
(unusual, I know) Windows is right and your app is wrong.

I tried
$ grep -v Comment
but that just returns
Binary file darkaa2.dat matches

Yah. Unfortunately, grep isn't very smart in this way.

Is there a simple way to remove this line?
Before I start looking at sed or gawk, I would just like to know, if they would work with these silly 'binary files'.

NB: I can open the file with nano and manually delete the line, but it's not just one file to process.

You might try tr. On another note, here's a C program which will do what
you want. It's written as a filter, so no file names on the line... this
is strictly no-frills programming. Placed into the public domain by
me, the original author today, Thursday 3 August 2006. If you *need*
file names on the command line (like for use with find and xargs)
then I can add that, but I thought something quick'n'nasty might
be more what you need.

---- nonul.c ----
#include <stdio.h>
#include <stdlib.h>

#define NUL 0x00

#define OMIT NUL

int     main(void) {
    int     Chr;

    while ((Chr = getchar()) != EOF) {
        if (Chr != OMIT)
            putchar(Chr);
    }
    return EXIT_SUCCESS;
}
---- end nonul.c ----

$ gcc -o nonul nonul.c
$ hexdump -C junk.txt
00000000 43 6f 6d 6d 65 6e 74 3a 20 00 00 00 00 00 00 00 |Comment: .......|
00000010  00 00 00 00 00 00 0a 0a                           |........|
00000018
$ ./nonul < junk.txt >junk1.txt
$ hexdump -C junk1.txt
00000000  43 6f 6d 6d 65 6e 74 3a  20 0a 0a                 |Comment: ..|
0000000b

If you find that the characters are something other than NUL
(ASCII code 0x00), then just substitute that for NUL. For example
if it is backspace (BS) then add this line...

#define BS 0x08

and change OMIT to BS

#define OMIT BS


Thanks,

HTH. If not, then I can ship you a program.

Mike
--
p="p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
You have found the bank of Larn.
I can explain it for you, but I can't understand it for you.
I speak only for myself, and I am unanimous in that!



Reply to: