Quantcast
Channel: Compare 2 byte arrays - Stack Overflow
Viewing all articles
Browse latest Browse all 5

Answer by PTwr for Compare 2 byte arrays

$
0
0

"Is there a way to do data1-data2 = data3 without enumerating through a loop?" No. It is technically impossible.

At best, or rather worst, you can call function that will do enumeration for you. But it will be slow. In case of LINQ, ungodly slow.

For machine I am currently working on results from other answers are as following for 4KB table (1024 integers).

  • 23560 ticks - Giannis Paraskevopoulos. Array-Enumerable-Array conversions not too fast, Copying array via ToList().ToArray() chain is roughly 25 times slower than Array.Copy().
  • 10198 ticks - Selman22. 2 times faster, but still slow. Lambdas are eye candy to make creating events prettier, not faster. You end up with some anonymous method laying around, which takes can eat more CPU time for call-return than its operation (remember that math we do here CPU can do in just few cycles).
  • 566 ticks - Tim Schmelter GetDifference() function (Main culprit is JIT here, in native code and/or more often usage difference would be negligible)
  • 27 ticks - Just a loop. 400 times faster than Zip, over 800 faster than converting array to list and back.

Loop code:

for (int i = 0; i < data3.Length; i++){  data3[i] = Math.Abs(data1[i] - data2[i]);}

Such basic memory operations can be directly translated to machine code without horrible performance and humongous memory footprint of LINQ.

Moral of the story is: LINQ is for readability (which in this case is arguable) not for performance (which in this case is noticeable).


Optimization time! Lets abuse our CPU a tiny bit.

  1. Unroll loop. Or do not. Your experience may vary. Even inassembler itself loop unrolling performance gain or lose varygreatly in same family of processors. New CPU's and compilers areaware of old tricks and simply implement them on their own. Fori3-3220 I tested code on loop unroll to 4 lines resulted in fasterexecution on 32bit code but on 64bit it was a bit slower while unroll to 8 was opposite.
  2. Compile for x64. As we are here working on 32bit data we won't makeuse of 64bit registers... or will we? On x86 less than half ofregisters are truly available to generated code (in assembly writtenby hand you can always squeeze out more), on x64 however you get eight bonus registers which are free to use. The more you can do without accessing memory, the faster your code. In this case speed gain is at about 20%.
  3. Close Visual Studio. Do not speed-test 64 bit code in 32bit IDE(there is no 64bit version as for now, and probably wont be forlong time). It will make x64 code roughly two times slower dueto architecture mismatch. (Well... you should never speed-test code under debugger anyway...)
  4. Do not use built-in functions too much. In this case Math.Abs haveoverhead hidden inside. For some reasons (which will need analyze of IL to find out), checking for negative values was faster with ?: than with If-Else. Such check saved a lot of time.

UPDATE: ?: is faster than If-Else due to differences in resulting machine code... at least for just comparing two values. Its machine code is much less weird than If-Else (which does not look like what you would write "by-hand"). Apparently it is not just different form of writing If-Else statement but fully separate command optimized for simple conditional assignment.

Resulting code was roughly 8 times faster than simple loop with Math.Abs(); Remember you can unroll loop only to divisors of your dataset size. You wrote that your dataset size is 25920, so 8 is fine. (max is 64, but I doubt it would have any sense to go that high). I suggest hiding this code in some function as it is fugly.

int[] data3 = new int[data1.Length];for (int i = 0; i < data1.Length; i += 8){    int b;    b = (data1[i + 0] - data2[i + 0]);    data3[i + 0] = b < 0 ? -b : b;    b = (data1[i + 1] - data2[i + 1]);    data3[i + 1] = b < 0 ? -b : b;    b = (data1[i + 2] - data2[i + 2]);    data3[i + 2] = b < 0 ? -b : b;    b = (data1[i + 3] - data2[i + 3]);    data3[i + 3] = b < 0 ? -b : b;    b = (data1[i + 3] - data2[i + 4]);    data3[i + 4] = b < 0 ? -b : b;    b = (data1[i + 5] - data2[i + 5]);    data3[i + 5] = b < 0 ? -b : b;    b = (data1[i + 6] - data2[i + 6]);    data3[i + 6] = b < 0 ? -b : b;    b = (data1[i + 7] - data2[i + 7]);    data3[i + 7] = b < 0 ? -b : b;}

This is not even its final form. I will try to do some more heretic tricks on it.

BitHack's, low level cheats!

As I mentioned, there was still place for improvements.

After cutting out LINQ, main ticks munchkin was Abs(). When it was removed from code we got left with contest between IF-ELSE and shorthand ?: operator.Both are branching operators, which once in the past were widely known as being slower than linear code. Currently ease of use/writing tend to be picked over performance (sometimes correctly, sometimes incorrectly).

So lets make our branching condition linear. It is possible by abusing fact that branching in this code contains math operating on just single variable. So lets make code equivalent of this.

Now do you remember how to negate Two's complement number?, negate all bits and add one. Lets do it in one line without conditions then!

It is bitwise operators time to shine. OR and AND are boring, real men use XOR. Whats so cool about XOR? Aside of its usual behavior you can also turn it into NOT (negation) and NOP (no-operation).

1 XOR 1 = 00 XOR 1 = 1

so XOR'ing by value filled with only 1's gives you NOT operation.

1 XOR 0 = 10 XOR 0 = 0

so XOR'ing by value filled with only 0's does nothing at all.

We can obtain sign from our number. For 32bit integer it is as simple as x>>31. It moves bit sign to lowest bit. As even wiki will tell you, bits inserted from left will be zeros, so you result of x>>31 will be 1 for negative number (x<0) and 0 for non-negative (x>=0), right?

Nope. For signed values Arithmetic shift is used over plain bit-shift. So we will get -1 or 0 depending on sign.... which means that 'x>>31' will give 111...111 for negative and 000...000 for non-negative. If you will XOR original x by result of such shift you will perform NOT or NOP depending on value sign. Another useful thing is that 0 will result in NOP for addition/negation so we can add/subtract -1 depending on value sign.

So 'x^(x>>31)' will flip bits of negative number while making no changes to non-negative and 'x-(x>>31)' will add 1 (negated negative value gives positive) to negative x and make no changes to non-negative value.

When combined you get '(x ^ (x >> 31)) - (x >> 31)'... which can be translated to:

IF X<0  X=!X+1

and it is just

IF X<0  X=-X

How does it affect performance?Our XorAbs() requires just four basic integer operations with one load and one store. Branching operator itself takes about as much CPU ticks. And while modern CPU's are great at doing branch-predicting, they are still faster by simply not doing it when feed sequential code.

And whats the score?

  1. Roughly four times faster than built-in Abs();
  2. About twice as fast as previous code (versions without unrolling)
  3. Depending on CPU it can get better result without loop-unrolling.Due to elimination of code branching CPU can "unroll" loop on itsown. (Haswells are weird with unrolling)

Resulting code:

for (int i = 0; i < data1.Length; i++){  int x = data1[i] - data2[i];  data3[i] = (x ^ (x >> 31)) - (x >> 31);}

Parallelism and Cache usage

CPU have super fast Cache memory, when processing an array sequentially it will copy whole chunks of it to cache.But if you write crappy code you will get cache misses. You can easily fall into this trap by screwing up order of nested loops.

Parallelism (multiple threads, same data) must works on sequential chunks in order to make good use of cpu cache.

Writing threads by hand will allow you to pick chunks for threads manually, but it is bothersome way.Since 4.0 .NET comes with helpers for that, however default Parallel.For makes a mess of cache.So this code is actually slower than its single thread version due to cache-miss.

Parallel.For(0, data1.Length,fn =>{  int x = data1[fn] - data2[fn];  data3[fn] = (x ^ (x >> 31)) - (x >> 31);}

It is possible to make manual use of cached data by performing sequential operation in it. For example you can unroll loop, but its dirty hack and unrolling have its own performance issues (it depends on CPU model).

Parallel.For(0, data1.Length >> 3,i =>{    int b;    b = (data1[i + 0] - data2[i + 0]);    data3[i + 0] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 1] - data2[i + 1]);    data3[i + 1] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 2] - data2[i + 2]);    data3[i + 2] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 3] - data2[i + 3]);    data3[i + 3] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 3] - data2[i + 4]);    data3[i + 4] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 5] - data2[i + 5]);    data3[i + 5] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 6] - data2[i + 6]);    data3[i + 6] = b < 0 ? (b ^ -1) + b : b;    b = (data1[i + 7] - data2[i + 7]);    data3[i + 7] = b < 0 ? (b ^ -1) + b : b;}

However .NET also have Parrarel.ForEach and Load Balancing Partitioners.By using both of them you get best of all worlds:

  • dataset size independent code
  • short, neat code
  • multithreading
  • good cache usage

So final code would be:

var rangePartitioner = Partitioner.Create(0, data1.Length);Parallel.ForEach(rangePartitioner, (range, loopState)=>{    for (int i = range.Item1; i < range.Item2; i++)    {        int x = data1[i] - data2[i];        data3[i] = (x ^ (x >> 31)) - (x >> 31);    }});

It is far from maximum CPU usage (which is more complicated than just maxing its clock, there are multiple cache levels, several pipelines and much more) but it is readable, fast and platform independent (except integer size, but C# int is alias to System.Int32 so we are safe here).

Here I think we will stop with optimization.It came out as an article rather than answer, I hope no one will purge me for it.


Viewing all articles
Browse latest Browse all 5

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>