[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: piping find to zip -- with spaces in path



Bob Proulx wrote:
Dan B. wrote:
Bob Proulx wrote:
So as you can see whitespace isn't safe to use in URLs.  This is
basically the same as for Unix filenames.

They're not quite the same:

Not quite the same is basically the same here.  :-)

Okay.  They're not the same.  So they're not basically the same.

The question of the topic was:

   ... what about urls?  They come from the Unix world, and are full of
   underscores and question marks and equal signs.  Then there are
   emails, all of which require the @ sign.  Not complaining, just
   asking.

I think "basically the same" describes things adequately.

Having a clear specification and having things build on it
consistently (at least mostly) is not basically the same as not
having any clear specification and having inconsistent support for
spaces.

There's a major difference between the two--in the first case you
can easily, definitely know what must be supported (at least to be
compliant, correct, complete, or whatever) but in the second case
you can't.


People with
a Unix background wouldn't normally include spaces in either of file
names or URLs, or other related "handles" to data.  If you do then
they are much more of a pain to manipulate in shell scripts.  And so
you just don't do it...

No argument there.

and don't think about whether it is technically possible or not.

Except when you're trying to write scripts (etc.) that don't break
the instant someone _else_ or some _other_ tool uses characters you
didn't happen to use or think of.

Remember that it's not just space characters.

For example, when a kernel driver file used a comma in its name,
Emacs' etags feature broke because it used commas to delimit
filenames (with no encoding/escaping mechanism).


> ...


In URIs, it's not that whitespace "isn't safe to use"; it's simply
that whitespace is not allowed, period.  ...

No.  Actually it was exactly that, "unsafe".  *Exactly* as I said.

   RFC 1738
   "The space character is unsafe because ..."

Literally they are documented as being "unsafe".

That is not the specification; that is the rationale for what
was specified.


The reason we avoid putting spaces in (real) URIs isn't just that
spaces are unsafe (e.g., something might break).

The reason we avoid putting spaces in (real) URIs is because they
are not allowed by the specification.





> Later RFCs have
clarified this somewhat.  But regardless of being unsafe most software
does actually allow them.  (I sometimes see them inappropriately used
in slug lines.)

   wget -O- "http://www.example.com/one two three.html"

Fine.  Wget's argument is not a URI but some other kind of string.

(So does that wget argument generate the URI
"http://www.example.com/onetwothree.html"; or the URI
"http://www.example.com/one%20two%20three.html";?)



And even though the space hasn't been included in the possible
characters RFC 3986 includes this statement:

    Using<>  angle brackets around each URI is especially recommended
    as a delimiting style for a reference that contains embedded
    whitespace.

Pay attention to the words "reference" and "URI" there.  They don't
mean the same thing:

That's so that a _reference_ in text to a URI can have whitespace
inserted to allow line wrapping.  The thing that contains that
space is NOT a URI, but some kind of reference to a URI--and the
W3C says that such whitespace is removed when mapping that reference
back to the URI is represents (that is, the whitespace characters are
not to be encoded as %20, etc.).


...

Unfortunately, on the other hand, Unix filenames have no
corresponding specification, at least one that is followed
consistently.  The kernel and file systems allow spaces, and
some utilities/commands/scripts/etc. do, but many don't.

The Unix filesystem allows all characters except for the zero
character.  Because the zero character delimits the end of the string
it cannot be used in the string.  And of course the '/' is used as a
directory separator.  If an application doesn't allow other characters
then it is arguably a bug in that application.  (However the
application may document its limitations and stop there.)  Core
utilities will of course be okay but I am sure that fringe
applications have bugs in them.

Right, other than that without a clear specification of what's
supposed to be, it's harder to say whether an application is
buggy/limited or not.


Partially, I wish Unix/Linux were like VMS in how VMS defined
which characters were legal or not in filenames.  Because of that
definition, it was clear which characters would never, ever appear
in a file name and therefore could be used in, say, command line
interpreter syntax without causing things to break on somebody
else's chosen file names.



Daniel


Reply to: