World's most popular travel blog for travel bloggers.

[Solved]: Detecting overflow in summation

, , No Comments
Problem Detail: 

Suppose I am given an array of $n$ fixed width integers (i.e. they fit in a register of width $w$), $a_1, a_2, \dots a_n$. I want to compute the sum $S = a_1 + \ldots + a_n$ on a machine with 2's complement arithmetic, which performs additions modulo $2^w$ with wraparound semantics. That's easy — but the sum may overflow the register size, and if it does, the result will be wrong.

If the sum doesn't overflow, I want to compute it, and to verify that there is no overflow, as fast as possible. If the sum overflows, I only want to know that it does, I don't care about any value.

Naively adding numbers in order doesn't work, because a partial sum may overflow. For example, with 8-bit registers, $(120, 120, -115)$ is valid and has a sum of $125$, even though the partial sum $120+120$ overflows the register range $[-128,127]$.

Obviously I could use a bigger register as an accumulator, but let's assume the interesting case where I'm already using the biggest possible register size.

There is a well-known technique to add numbers with the opposite sign as the current partial sum. This technique avoids overflows at every step, at the cost of not being cache-friendly and not taking much advantage of branch prediction and speculative execution.

Is there a faster technique that perhaps takes advantage of the permission to overflow partial sums, and is faster on a typical machine with an overflow flag, a cache, a branch predictor and speculative execution and loads?

(This is a follow-up to Overflow safe summation)

Asked By : Gilles

Answered By : AProgrammer

You can add $n$ numbers of size $w$ without any overflow if you are using $\lceil \log n\rceil + w$ bits arithmetic. My suggestion is to do just that and then check if the result is in the range. Algorithms for multiprecision arithmetic are well-known (see TAOCP section 4.3 if you need a reference), there is often hardware support for addition (carry flag and add with carry instruction), even without such support you can implement it without data dependant jump (which is good for jump predictors) and you need just one pass on the data and you may visit the data in the most convenient order (which is good for cache).

If the data doesn't fit in memory, the limiting factor will be the IO and how well you succeed in overlapping the IO with the computation.

If the data fit in memory, you'll probably have $\lceil \log n\rceil \leq w$ (the only exception I can think of is 8-bits microprocessor which usually have 64K of memory) which means you are doing double precision arithmetic. The overhead over a loop doing $w$-bits arithmetic can be just two instructions (one to sign extend, the other to add with carry) and a slight increase of register pressure (but if I'm right, even the register starved x86 has enough registers that the only memory access in the inner loop can the data fetch). I think it is probable that an OO processor will be able to schedule the additional operations during the memory load latency so the inner loop will be executed at the memory speed and thus the exercise will be one of maximising the use of the available bandwidth (prefetch or interleaving techniques could help depending on the memory architecture).

Considering the latest point, it is difficult to think of other algorithms with better performance. Data dependant (and thus not predictable) jumps are out of question as are several passes on the data. Even trying to use the several cores of today's processor would be difficult as the memory bandwidth will probably be saturated, but it could be an easy way to implement interleaved access.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/1426

0 comments:

Post a Comment

Let us know your responses and feedback