Re: remove/replace non-ascii characters from file

Johannes Wiedersich wrote:
I have a silly Window$ application that is supposed to export ascii data. In fact the file is 99% percent ascii (after dos2unix), but contains a line starting with "Comment: " that contains non-ascii garbage (represented as ^@^@^@^@^@^@ etc.)

I suppose you mean "non-graphic ASCII". Those are NUL characters,
which the ASCII *definition* states can be inserted or removed
from *any* stream without changing its meaning. This means that
your application is not ASCII compliant. Sorry, but in this case
(unusual, I know) Windows is right and your app is wrong.

I tried
$ grep -v Comment
but that just returns
Binary file darkaa2.dat matches

Yah. Unfortunately, grep isn't very smart in this way.

Is there a simple way to remove this line?
Before I start looking at sed or gawk, I would just like to know, if they would work with these silly 'binary files'.

NB: I can open the file with nano and manually delete the line, but it's not just one file to process.

You might try tr. On another note, here's a C program which will do what
you want. It's written as a filter, so no file names on the line... this
is strictly no-frills programming. Placed into the public domain by
me, the original author today, Thursday 3 August 2006. If you *need*
file names on the command line (like for use with find and xargs)
then I can add that, but I thought something quick'n'nasty might
be more what you need.

---- nonul.c ----
#include <stdio.h>
#include <stdlib.h>

#define NUL 0x00

#define OMIT NUL

int     main(void) {
    int     Chr;

    while ((Chr = getchar()) != EOF) {
        if (Chr != OMIT)
    return EXIT_SUCCESS;
---- end nonul.c ----

$ gcc -o nonul nonul.c
$ hexdump -C junk.txt
00000000 43 6f 6d 6d 65 6e 74 3a 20 00 00 00 00 00 00 00 |Comment: .......|
00000010  00 00 00 00 00 00 0a 0a                           |........|
$ ./nonul < junk.txt >junk1.txt
$ hexdump -C junk1.txt
00000000  43 6f 6d 6d 65 6e 74 3a  20 0a 0a                 |Comment: ..|

If you find that the characters are something other than NUL
(ASCII code 0x00), then just substitute that for NUL. For example
if it is backspace (BS) then add this line...

#define BS 0x08

and change OMIT to BS

#define OMIT BS


HTH. If not, then I can ship you a program.

