Sunday, January 27, 2008

SSE Tutorial : Part II

A note before I start: not all the features mentioned below appeared in the same revision of the SSE instruction set. (IOW, the instruction set is constantly evolving, so your mileage may vary depending on when you bought your processor).

The key point to SSE is that AMD and Intel have put 8 new 128-bit registers on their processors. You can view the registers as:
  • 16 8-bit integers
  • 8 16-bit integers
  • 4 32-bit integers
  • 4 32-bit single precision floating point numbers (float)
  • 2 64-bit double precision floating point numbers (double)
And you can do math on all the 16/8/4/2 elements "in parallel". The obvious benefit is that if you're doing the same operation on all the elements of an array, doing it 16 at a time or 4 at a time is faster than doing one at a time. There is also another benefit - it turns out that on modern processors and buses, reading memory elements that are aligned on 128-bit boundaries is much faster. So, optionally, SSE provides instructions to perform 128-bit aligned reads and writes from memory. (Of course, you can choose not to use them, because there are equivalent unaligned reads as well) There are also instructions provides closer control over caching behaviour - for instance the MOVNTDQ instruction writes a 128-bit value (Double Quadword) to memory whilst hinting to the processor to not store that value in the cache. The instruction PREFETCH, (obviously), prefetches values into the cache.

One big feature (especially for C/C++ programmers) that came with SSE was faster double/float to integer conversion instructions. When coding using the x87 floating point stack, the code to convert a floating point number to an integer in C, would work roughly like this: (a) store the FPU rounding mode (b) Set the rounding mode to truncate [note that this is required by the C standard] (c) Convert double to integer (d) Restore old FPU rounding mode. The whole circus was rather expensive. SSE gives 2 sets of fast instructions to convert doubles/floats (and the tuples of 4 floats and 2 doubles) to integers. The instruction itself specifies the rounding mode to use (i.e. truncate/round) and so no messing with the FPU state is required. In fact, the so-called SSE acceleration that the VC++ 8 compiler pretends to do is mainly limited to using the faster convert instructions!

Besides the fast math there are some other benefits. SSE finally got rid of the floating point register stack, so the new floating point code is easier to write and a little faster at times. There are also some pretty cool instructions aimed at specific problems. An example is the CRC calculation instruction that is available with SSE revision 4!

Anyway, that's it for now. Next time, I promise more code and less gyan. Let me know if you have questions/comments.

0 comments: