[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: sha256sum --text generating blank spaces and hyphens?



On 4/26/23 15:48, Nicolas George wrote:
David Christensen (12023-04-26):
I suggest hashing the document content rather than the URL.  This would work
nicely for static documents.

That will be very convenient to retrieve the document content from the
URL.


My suggestion assumes that the URL => hash => content mapping is saved somehow. For example, save the content in a file named after the hash and save the URL in a file whose name is the hash plus a suffix. Finding a document by URL then becomes a grep(1) invocation.


Things get more interesting when you approach the problem as a database. Save the content wherever and put the metadata into a table -- content hash (primary key), URL, download timestamp, author, subject, title, keywords, etc.. Create fully inverted indexes. Create a search engine. Create a spider. Implementation could range from a CSV/TSV flat-file and shell/P* scripts, to a desktop database/UI, to a LAMP stack, and beyond (NoSQL, N-tier). There are distributed file sharing systems based on such ideas.


David


Reply to: