论文部分内容阅读
The Unified Memory in CUDA 6.0 is one of the most significant update in the history of CUDA.Before CUDA 6.0,existing programming model for GPU computing relies on programmers to explicitly manage data transfers between CPU and GPU and manage memory coherence.While Unified Memory provides a new CUDA programming model that defines a new managed memory space in which CPU and GPU see a single coherent memory image with a common address space.The underlying system manages data access and locality without the need for explicit memory copy calls.This paper is about the influence on application performance caused by Unified Memory and analyzing the underlying implementation.We studied Diffusion 3D Benchmark,Parboil Benchmark Suite,and the Matrix Multiplication from CUDA SDK Samples as supplement and ported these benchmarks to Unified Memory version.The evaluation is based on NVIDIA Kepler K40 and Jetson TK1 by comparing the performance between Unified Memory version and original version.K40 is the latest and fastest GPU with Kepler architecture,and TK1 is the first mobile processor built on the same Kepler architecture which shares a 2 GB main memory with CPU and GPU.This paper shows that Unified Memory causes at most 10% performance loss both on K40 and TK1.Furthermore,we use NVIDIA Visual Profiler to dig into the underlying mechanism of the Unified Memory.Finally,we state the reason for performance loss.