[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sha256sum --text generating blank spaces and hyphens?



On 4/26/23, David Wright <deblis@lionunicorn.co.uk> wrote:
> I guess you need the expense of sha256 rather than md5 as you're
> downloading the entire web?

 I am not downloading the entire web. I have no way of knowing how
they entertained those ideations but I think we could use their
estimate when they said that approximately 1 million and a half books
have been ever published. Think of it! It is not that much data. It
would all fit nicely in one hard drive include some searching
capability and "bye bye google" will be the name of your movie.

On 4/26/23, Dan Ritter <dsr@randomstring.org> wrote:
> The only characters used in the sha256 hash itself are [a-f] and
> [0-9]

 Yes, I knew that; that is why I could not understand why sha256sum
was being "courteous" to me.

On 4/26/23, Nicolas George <george@nsup.org> wrote:
> shaXsum always writes X/4 hexadecimal nibbles then two spaces then the
> file name. If the input is from stdin, then the convention is the file
> name is ‘-’.
>
> (Well, not always always: if the file name contains very special
> characters, it will use an escaped output format. And there is the -z
> option.)

On 4/26/23, Thomas Schmitt <scdbackup@gmx.net> wrote:
> "FILE" is the minus-sign for standard input. The second blank is there
> to indicate the text mode of sha256sum.
> Only the first blank is somewhat puzzling. But it's always there.
>
>
> https://www.gnu.org/software/coreutils/manual/html_node/sha2-utilities#sha2-utilities
> points to
>
> https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html
> which says
>   For each file, ‘md5sum’ outputs by default, the MD5 checksum, a space,
>   a flag indicating binary or text input mode, and the file name. Binary
>   mode is indicated with ‘*’, text mode with ‘ ’ (space). Binary mode is
>   the default on systems where it’s significant, otherwise text mode is
>   the default. The cksum command always uses binary mode and a ‘ ’
>   (space) flag.
>
> So the first blank can be relied on and thus the proposal by Andy Smith
> to use "awk '{print $1}'" is valid.

 OK, now I see why cutting off the string on the first space that
appears is safe. I never saw such cases because I always used sha*sums
on files. I would expect if a user enters a string via printf that was
all there was to it. Of course, sha*sums can tell apart a file from a
string a plain text.

On 4/26/23, Jeffrey Walton <noloader@gmail.com> wrote:
> There's no guarantee a URL will map onto a filesystem.

> I seem to
> recall Stunnel tried to do that in a caching mode, but it had weird
> corner cases. (In addition to problems with filesystems that had
> character set and path limitations).

 Well, no; and I am fine with:
 a) trying to best match both; the URL path as best as possible
 b) the extra malabarism base64-ing and hashsing the name of the file ...

 Something I have learned as a corpora research kind of guy is not to
ever try to "educate" people. I would just take their sh!t as they
dump it and cleanse, deal with it!

 You would not hear the end of it if I start telling stories of the
kind of cr@p you find out there when you look at the web from that
point of view: from folks at archive.org who would list: "Henry
Valentine Miller", "Henry V. Miller", "Henry Miller", "henry miller",
"Miller, Henry", "Miller, Henry 12-1891 06-1980" apparently as
different authors/"creators", to the gutenberb.org large text bank
including some protagonistic bs in the actual texts, to developers of
libreoffice watermarking text with some cr@p which of course is being
used for "monitoring" purposes by the kinds of folks who put
"intelligence" in the names of the organizations they work for and to
make sure they are making sense they put flags around them when they
fart through their mouths whatever nonsense they think of.

 I had had rehearsing day dreams about becoming a dictator of the
world ;-) and making people do "the right thing" (tm) ... until I had
once an epiphany while watching Trump talk to a media prestitude who
caracteristically wasn't making much sense. After asking a few
questions trying to make sense of what she was saying, prestitude said
"let me formulate it better". Trump quietly sat back saying: "OK, take
your time"!!!

 I was amazed! There you have someone the U.S. media, who as a mouth
piece of the status quo, were being viscerally offensive towards
anything relating to him, including posting on the front page of
mainstream US news papers naked pictures of his wife and mother of his
child one month before she became "the first lady" and he took it
easy, respectfully on her! That was the best case I have noticed so
far of "separating the message from the messenger". I mean people who
erect all those pay walls and somehow see themselves as authoring,
guarding content are not even the messengers and we all have to put up
with their bs.

> I think your best bet is to digest the URL into a representation. I
> suggest using SipHash+Base64 or Base64URL. SipHash provides collision
> resistance, a uniform distribution, and its fast. SipHash has a very
> good pedigree since it was designed by Jean-Philippe Aumasson and
> Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures
> you stay within printable character range without reserved file system
> characters.

 Thank you I will look into what they did when I get a chance,

 lbrtchx


Reply to: