[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sha256sum --text generating blank spaces and hyphens?

On 4/27/23, David Christensen <dpchrist@holgerdanske.com> wrote:
> Please see the OP, step (d).

>On 4/26/23, Albretch Mueller <lbrtchx@gmail.com> wrote:
>>  a) encode the string name as base64
>> b) calculate the sha256sum of §a
>>  c) use §b as file name (of course, leaving the original extension as it
>> is)
>>  d) include a "§b_file_name.txt" plain text file descriptor which only
>> content is the actual prehash name of that file.

 I do that because base64 would (must?) work on any OS and the
conversion from and to any other encoding is straightforward. As you
suggested, I am more friendly to the idea of including hashes of the
data payload, even though I think it is not that important, because
the actual big problem that corpora research people have is files with
exactly the same look and feel and the same content which have
different hashes (for example, pdf files). I have been thinking about
a way to compute hashes which resemble more faithfully, both,
structural and content similarity among files. Do you know of any way
to do such thing? The structural aspect should be "easy". It could be
handled as DAGs of some sort of XPaths.

  I was actually going to show to you what I meant, but I was happy to
see "I was wrong". I even waited to try it from some other access
point. I have used this one liner to show how
google/youtube/NSA/"Vladimir Putin"/... was watermarking files for
whatever reason, but it worked fine when I was trying to show it to
you ;-)

_YT_URI=EngW7tLk6R8; _OFL="${_YT_URI}_"$(date +%Y%m%d%H%M%S)".mp4";
./yt-dlp --verbose --format "mp4" --output "${_OFL}" -- "${_YT_URI}";
ls -l "${_OFL}"; file --brief "${_OFL}"; time sha256sum "${_OFL}"

-rwxrwxrwx 1 user user 828540 Aug 15  2022 EngW7tLk6R8_20230501185618.mp4
ISO Media, MP4 v2 [ISO 14496-14]

-rwxrwxrwx 1 user user 828540 Aug 15  2022 EngW7tLk6R8_20230501185657.mp4
ISO Media, MP4 v2 [ISO 14496-14]

 Max Nikulin (12023-04-28):
> And you will quickly face servers that sends incorrectly Content-Type or
> intentionally put application/octet-stream with no sniff header to force
> browser to save the file instead of opening it e.g. in built-in PDF
> reader.

 Even if not totally syntactic (so you can't functionally solve it
with some code), this is a relatively manageable problem, you would:

 a) take notice of the sites that do such things;
 b) sniff not only the http headers, but notice the file extension of
the file; and
 c) safe the file to a temp repository for the Linux util "file" to be
run on it ...

 Out of those heuristics you should be able to strategize around such problems.


Reply to: