[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: remove/replace non-ascii characters from file



Johannes Wiedersich wrote:
Mike McCarty wrote:

garbage (represented as ^@^@^@^@^@^@ etc.)


I suppose you mean "non-graphic ASCII". Those are NUL characters,
which the ASCII *definition* states can be inserted or removed
from *any* stream without changing its meaning. This means that
your application is not ASCII compliant. Sorry, but in this case
(unusual, I know) Windows is right and your app is wrong.


Well, I don't know that much about the ASCII *definition*, but if I open the file in Window$ notepad (I never use that for any purpose, I just did it out of curiosity), these characters appear as additional spaces. They are saved as spaces and in the saved file the characters are replaced by spaces (ie. linux-compliant spaces).

So, if you are right, that means that M$ notepad converts these NUL characters to spaces, which is a bad thing, if these are indeed different characters and useful for anything.

Yes, it is doing a Bad Thing. ASCII was originally intended for use
as an Information Interchange, including use over serial lines,
and to slow (mechanical) printers connected on the other end.
The purpose of NUL was to allow the sender to pad the transmission
after sending characters which might take the receiver a "long"
time to process, like CR (carriage return). They are like NOPs in
computer programming. They eat time, but otherwise do nothing
else. One is supposed to be able to insert or delete them from
any ASCII stream without changing the meaning of the stream.

The ASCII code for SP (graphic space) is 0x20. The ASCII code for NUL
(null character) is 0x00. They are indeed not the same thing. SP is
supposed to be *meaningful* in an ASCII stream. NUL is not.
Deleting/inserting an ASCII space is supposed to change its meaning.
For example, "therapist" and "the rapist" do not mean the same thing
(usually).

Anyway, I don't think it is a useful feature of a program to include NUL characters in the header of data files like the present one which just consists of a short header and two columns of x and y data. I'd be curious of the programmer's reason for putting about 50 of these at the end of the comment.

I have no idea why they were inserted there[*]. They are not very useful
when used to *store* as opposed to *move* data. If one had a very dumb
terminal program, and needed to communucate with some possibly slow
"other" device (like a uController programming EEPROM or the like)
it might be useful to insert NUL characters into the file itself
at strategic points to allow programming time.

[*]A possible guess why they were put there: This is a fixed-length
field, and it makes a C programmer's job a little easier if he reads
a NUL terminated string into a fixed array.

You might try tr. On another note, here's a C program which will do what
you want. It's written as a filter, so no file names on the line... this
is strictly no-frills programming. Placed into the public domain by
me, the original author today, Thursday 3 August 2006. If you *need*
file names on the command line (like for use with find and xargs)
then I can add that, but I thought something quick'n'nasty might
be more what you need.


I appreciate your effort! I was anyway writing a script to postprocess the data, so the most convenient way was to remove the junk via another command line.

You're welcome, and no problem if you don't use it. It was a 15 minute
effort anyway. I did test it, as you saw, though.

Mike
--
p="p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
This message made from 100% recycled bits.
You have found the bank of Larn.
I can explain it for you, but I can't understand it for you.
I speak only for myself, and I am unanimous in that!



Reply to: