>> A program that is CPU-bound *and* can be encoded more efficiently will
>> benefit from compiler optimizations. Some CPU bound things just aren't
>> going to be helped much by vectorization, instruction reordering, etc. I
>> mean, integer multiply is integer multiply.

> But if the target cpu supports pipelining, and has multiple multiplication
> units(which means it can do them in parallel), or can do a 128bit multiple,
> or 1 64 bit multiple, at once, then it's more efficient to do a partial
> loop unroll, and thereby have faster code, because of more efficient
> parallization.

Converting some multiplies to shifts (or shift plus some other arithmetic),
or arranging that one of the source registers normally contains the lower
value, can also help. (At least, on ARM...)

