[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: piping find to zip -- with spaces in path



Bob Proulx wrote:
> Real Unix(TM) users never put [^[:ascii:]] characters in file names.

That is what I get for attemping humor on a technical list!  Sure
spaces and other whitespace are ASCII and so the attempt inevitably
falls into a syntactical correction of my blown punch line.  Oh the
humanity!  :-)

Robert Blair Mason Jr. wrote:
> True.  Underscores are _wonderful_ things.  But remember, Linux is Not Unix!

Right.  Linux is a kernel.  Or are you trying to start a new LNU
acronym? :-)

Doug wrote:
> The comment about real Unixers not using ascii characters: what
> about urls?  They come from the Unix world, and are full of
> underscores and question marks and equal signs.  Then there are
> emails, all of which require the @ sign.  Not complaining, just
> asking.

URLs are a subset of ASCII.  Space, at, underscore, question marks are
all ASCII characters.  See RFC 1738 / 3986 for details but here are
some snippets:

   Octets must be encoded if they have no corresponding graphic
   character within the US-ASCII coded character set, if the use of
   the corresponding character is unsafe, or if the corresponding
   character is reserved for some other interpretation within the
   particular URL scheme.

   URLs are written only with the graphic printable characters of the
   US-ASCII coded character set. The octets 80-FF hexadecimal are not
   used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
   control characters; these must be encoded.

   Characters can be unsafe for a number of reasons.  The space
   character is unsafe because significant spaces may disappear and
   insignificant spaces may be introduced when URLs are transcribed or
   typeset or subjected to the treatment of word-processing programs.

So as you can see whitespace isn't safe to use in URLs.  This is
basically the same as for Unix filenames.  In URLs spaces will
typically be encoded as '+' characters.

Robert Blair Mason Jr. wrote:
> Well, to be technical, almost all characters *ARE* ascii.  Just not
> the alphanumeric subset.

For single byte characters ASCII defines 0-127 meaning that exactly
half are ASCII and half are not.  Of the half that are defined 00-1F
plus FF (DEL) are control characters, non-printable, and not valid in
URLs.  Although all but 0 can be used in filenames.

For multiple byte characters such as unicode there are definitely more
characters than 128 defined meaning that almost all multibyte
characters are not ASCII. :-)

> The underscore is often included in the set of 'ascii characters'.

Underscore is a valid ASCII character.  It follows right after the ^.

The underscore is not valid as a hostname however.  Only ASCII letters
A through Z plus digits 0 through 9 plus the hyphen are valid.  See
RFC 952 / 1123 for details and for position requirements.  The
underscore has often been a common tripped into issue when naming
hosts.

> Generally, what is meant by non-ascii-character is any character
> which might have a special meaning to the users shell.

I was sloppy in my joke.  Sorry.

Bob

Attachment: signature.asc
Description: Digital signature


Reply to: