[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sha256sum --text generating blank spaces and hyphens?



On 4/26/23 16:21, Albretch Mueller wrote:
On 4/26/23, David Christensen <dpchrist@holgerdanske.com> wrote:
I suggest hashing the document content rather than the URL.  This would
work nicely for static documents.

  What do you mean by "hashing the document content"?


2023-04-26 21:03:08 dpchrist@taz ~
$ touch foo

2023-04-26 21:03:12 dpchrist@taz ~
$ sha256sum foo
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  foo


In this case, the content is an empty string and the hexadecimal encoding of the the SHA256 hash is "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855".


  How would that help when what you are trying to do is cleanse and
canonize texts as best as you could to find relationships among their
text segments?

  lbrtchx


* Each unique text would be stored once regardless of how many URL's link to it.

* If the content at a URL changes, the new content will have a new hash. So, the new content will be saved and the old content will be preserved (instead of the new content overwriting the old content).

* With regard to my response to the post by Nicolas George, a database of metadata could benefit analysis regardless of the scheme used to name content files.


David



Reply to: