[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How could you load only once a Linux ultility without a batch --input-files kind of option and repeatedly use it on many files? . . .



On 2020-05-14 03:47, Albretch Mueller wrote:
  The thing is that I have to call, say sha256sum, on millions of files

  Probably debian admin people dealing with packaging have to deal with
the same kinds of issues.

  lbrtchx


The need to checksum files is common; it is a good test case for trying out different computing paradigms and/or programming languages.


As other people have mentioned, using find(1) or xargs(1) from the command line to invoke sha256sum(1) is one possibility. All of these tools are mature and should produce predictable results. When used correctly, their performance is good. For ad-hoc tasks, this is how its done.


If you find that you need to parameterize the invocation, such as to use one set of arguments and/or options for one set of files and another set for other files, you can cut and paste an example invocation into a text file, parameterize it with variables, and add code to make the file into a script. I would start with a Bourne shell script. Of course, there are many other scripting languages to choose from; pick your favorite.


Even if you do not need parameterization, typing './myscript' requires fewer keystrokes and less mental effort than recalling a find(1) incantation over and over again. And, it provides consistency. These considerations are important when you are brain fried and heading for log off, or crawling through the files months later.


As you plan to perform the SHA256 computation a great many times, you should consider the cost of Unix process creation and tear-down -- e.g. CPU cycles (time) and memory usage. If you write a program that computes many checksums per process, it will have less overhead and should finish in less time than a program that creates one process per input file. Benchmarking will tell.


The above is related to the desired output format. Obvious choices include one checksum file for all input files vs. one checksum file per input file. The plus sign in the '-exec command {} +' option to find(1) facilitates the former, and should be efficient.


Also, where to put the output file(s) -- in the current working directory, within the input tree, within a parallel tree, or someplace else? One output file for everything is easiest, but my archive and image scripts checksum the input files individually and touch(1) the checksum file modification times to match.


Another consideration is concurrency. If you have a multi-core processor and implement a solution that puts two or more cores to work at the same time, a concurrent program should finish sooner than a sequential program. Again, benchmarking.


I find that Bourne shell scripts are comfortable only up to a certain level of complexity. Above that, I use Perl. That said, Go would be well suited to this task and should be faster. Then there is C, assembly, and/or hardware acceleration.


David


Reply to: