Re: sha256sum --text generating blank spaces and hyphens?
On Wed, Apr 26, 2023 at 3:42 AM Albretch Mueller <lbrtchx@gmail.com> wrote:
>
> This is not a debian question per se (more like a Linux bash one),
> but I wasn't able to find an answer on the Internet.
>
> Here is first the problem I am having before you start reading a
> conspiracy theory into it ;-)
>
> I need to somehow map URL on the web to a local file, but you can't
> do that for two main reasons:
>
> 1) URLs are free text
> 2) which people take to their heart's content.
>
> Take for example:
>
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
>
> that file and the pdf you would download I need to map to a local
> directory looking like: ... /pub/dokumen/qdownload/ ...
>
> but the file name (excluding the extension) is 306 characters long,
> which Windows NTFS would not swallow. There may be also funky rules
> regarding character sets and where in a string certain chars may be
> used; so, as a way to work around those kinds of problems I:
>
> a) encode the string name as base64
> b) calculate the sha256sum of §a
> c) use §b as file name (of course, leaving the original extension as it is)
> d) include a "§b_file_name.txt" plain text file decriptor which only
> content is the actual prehash name of that file.
>
>
> https://dokumen.pub/qdownload/nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860.html
> _TXT="nietzsche-und-der-deutsche-geist-band-4-ausbreitung-und-wirkung-des-nietzscheschen-werkes-im-deutschen-sprachraum-bis-zum-ende-des-zweiten-weltkrieges-ein-schrifttumsverzeichnis-der-jahre-1867-1945-ergnzungen-berichtigungen-und-gesamtverzeichnisse-zu-den-bnden-i-iii-9783110202861-9783110189865-3110189860"
> _B64TXTENC=$(printf '%s' "${_TXT}" | base64 )
> echo "// __ \$_B64TXTENC: |${_B64TXTENC}|"
> _B64TXTDEC=$(printf '%s' "${_B64TXTENC}" | base64 --decode)
> echo "// __ \$_B64TXTDEC: |${_B64TXTDEC}|"
> if [[ "${_TXT}" == "${_B64TXTDEC}" ]]; then
> echo "// __ [[ \${_TXT} == \${_B64TXTDEC} ]]: |${_TXT}|"
> _SHA256=$(printf '%s' "${_TXT}" | sha256sum --text )
> echo "// __ \$_SHA256: |${_SHA256}|"
> fi
>
> // __ $_SHA256:
> |7d5895cb24ab49692a8ad495e036074fec8e61b22040544f02a9b69c926dbdeb -|
>
> I am trying to avoid funky characters and sha256sum --text still
> generates them!?!
>
> I work like this because I need replicate the original URL as a local
> path in a way that would be compatible any file system.
>
> Do you know of a better way to deal with such issues?
There's no guarantee a URL will map onto a filesystem. I seem to
recall Stunnel tried to do that in a caching mode, but it had weird
corner cases. (In addition to problems with filesystems that had
character set and path limitations).
I think your best bet is to digest the URL into a representation. I
suggest using SipHash+Base64 or Base64URL. SipHash provides collision
resistance, a uniform distribution, and its fast. SipHash has a very
good pedigree since it was designed by Jean-Philippe Aumasson and
Daniel J. Bernstein. The final Base64 or Base64URL encoding ensures
you stay within printable character range without reserved file system
characters.
Jeff
Reply to: