This time we would like to share some details about realization of one of our projects in medical imaging.

Currently one of the bottlenecks in MR image reconstruction is speed improvement. Improving the speed of image reconstruction is difficult from algorithmic point of view. But it’s becoming more popular to improve algorithm performance using GPU.

### Introduction

In magnetic resonance (MR) image reconstruction raw data measured from the scanner correspond to the Fourier coefficients of the target image, so the fast Fourier transform (FFT) is used for reconstruction. An important procedure called the density compensation is necessary in the very beginning to account for the non-uniform sampling. GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) is a partially parallel acquisition (PPA) method which is used to reduce scan times. In this case only partial data is acquired and missing data is mathematically calculated from available data.

### Subject for optimization

Original implementation was based on FFTW library for Fast Fourier transforms and adapted Singular Value Decomposition (SVD) algorithm from ACM Collected Algorithms for GRAPPA preprocessing. These two algorithms are the most computationally intensive parts of the whole image reconstruction. FFTW library is claimed to be the fastest CPU implementation using all possible optimizations like SSE2/3 and hyper threading, however it does not leverages the power of modern GPU cards. SVD algorithm was done on CPU as well. It is known to be badly parallelizable for small matrices, but in case of GRAPPA algorithm we have many image frames with same size which can be processed in parallel. Besides there are many intermediate steps which consume a lot of CPU and they can be easily parallelized on GPU.

### Technical analysis

FFTW library performance is comparable with Intel MKL implementation. NVidia provides comparison for their CUDA based cuFFT library with MKL implementation (Figure 1):

**Figure 1** *Comparison of CUDA based cuFFT library with MKL implementation*

According to this we should achieve up to 8-10x faster FFT processing when using GPU accelerated cuFFT library from Nvidia. GPU accelerated SVD algorithm is also available, for example CULA library by EM Photonics. However, current CULA library implementation does not support batch mode, so we will need to process all image frames as a sequence. Brief testing showed that 64 image frames (256*256) are processed even slower than CPU based version. Since we haven’t found any good alternative to CULA library we decided to implement our own GPU accelerated version of SVD algorithm.

### Implementation

FFT part of image reconstruction when using cuFFT library was straightforward, however we had to deal with image frames which does not fit into available GPU memory. We had to write algorithm to run FFT over portions of the large data frame with subsequent aggregation. Figure 2.1 below shows case when all data fits into GPU memory.

**Figure 2.1**

Figure 2.2 illustrates the case when huge data is processed. Solid lines in figure below show measured performance, dashed lines show estimated time in case data fits into GPU memory.

**Figure 2.2**

Much more interesting was to implement GPU accelerated SVD algorithm with batch mode. All implementations we had found are focusing on maximum parallelization of a single SVD run, hence we had to change approach. Basically SVD algorithm consists of HouseHolder Reduction, QR Diagonalization and Back Transformation steps. All are iterative processes when next step depends on results from previous step. In case of small matrices each CUDA kernel can’t effectively utilize all parallel processing units of modern GPU. So we had to write kernels in a way when every iteration for all matrices is processed by a single kernel run. This way in case of 64 matrices with 128×128 size each we can process 64*128 elements at a time instead of 128. Figures 3.1 and 3.2 show performance comparison for CULA Library and our implementation.

**Figure 3.2**

With more than 8 frames per batch our implementation shows much better performance comparing to sequential CULA calls, although it is not so efficient for a single frame.

### Results

As a result we have developed a pure C++ library with a set of specialized functions which perform various stages of image reconstruction. It requires only CUDA runtime libraries and free cuFFT library provided by NVidia. In addition we have implemented lightweight C# wrapper for convenient usage. Also we have run a lot of benchmarks with various GPU cards and on different platforms. On test cases provided by customer we received up to 150x speedup comparing to original single-threaded CPU implementation. However significant part of received speedup was due to poorly optimized original code which was completely rewritten and ported to CUDA whenever possible.

While it is usually understood what FFT stage does in image reconstruction, GRAPPA stage is not so obvious. Due to parallel acquisition of different frames arises distortion of acquired data which is effectively eliminated. Figure 4 shows visual representation of images before and after reconstruction.

**Figure 4** *The image before the reconstruction (left), image after reconstruction (right)*

What processor did you use with ASUS P9X79 WS?

“…and change settings there in order to enable Xeon Phi support”

what settings?

Core i7. Don’t remember exactly, but I think it was Sandy Bridge once.

There is a setting in BIOS with name like “Xeon Phi Support – Enabled/Disabled”.

This comment has been removed by the author.

This comment has been removed by the author.

Thanks for the useful posting, Victor! After having tried our 5110P in several desktop’ish motherboards, unsuccessfully, we’re now thinking of purchasing a new machine to run it in. It appears that motherboards supporting Xeon Phi generally use Intel’s C600 or X79 chipset (I guess this is almost(?) the same chipset, marketed to different segments under these names), but there may also be issues with BIOS and with other peripherals. We have a few questions:

How are you cooling your 5110P? What temperatures are you getting (per micinfo)? Here’s our improvised cooling setup for ours, which we didn’t get a chance to test yet (because of the motherboard compatibility issues): http://openwall.info/wiki/internal/xeon_phi#An-attempt-at-cooling-it

What case did you put the ASUS P9X79 WS motherboard in, given that it’s SSI CEB form factor rather than ATX (although it’s very similar)?

There’s the ASUS ESC700 G2 platform from ASUS, which uses this motherboard, but we’re unhappy with its bundled 500W PSU (not letting us install more accelerator cards in the future – whether MICs or GPUs – without upgrading the PSU) and its mediocre cooling. On the other hand, it appears to be the cheapest suitable platform out there (although we do have to invent our own cooling for the MIC card, unless we had a 3120A).

Some other suitable platforms (according to our “research” so far), for your blog readers:

SuperMicro:

http://www.supermicro.com/products/nfo/Xeon_Phi.cfm

http://www.supermicro.com/products/nfo/GPU_MIC.cfm

ASUS:

http://www.asus.com/Commercial_Servers_Workstations/ESC4000_G2/

Intel:

Xeon E5-2400 series: P4000SC, R2000BB

Xeon E5-2600 series: R1000JP, R2000GZ, P4000CO, P4000CR, P4000IP