论文部分内容阅读
本文主要研究了在CPU/GPU异构集群上的并行矩量法的加速技术。本文设计出一种MPI/CUDA软件编程架构,解决了CPU/GPU异构集群上并行LU分解跨节点计算的难题。此架构基于矩阵分块二维循环分布的数据分配策略,利用MPI实现计算节点之间的通信,同时利用GPU加速矩阵更新过程。为了突破GPU显存对LU分解的矩阵规模的限制,本文进一步研究了“显存—内存”核外算法。为了优化算法性能,本文提出了基于“CUDA流”技术和“异步通信”技术的设计方案,实现了GPU通信与计算的重叠,有效隐藏了GPU通信时间,获到了明显的加速效果。
This paper mainly studies the acceleration technology of the parallel method of moments on CPU / GPU heterogeneous clusters. This paper designs a MPI / CUDA software programming architecture that solves the problem of parallel LU decomposition across nodes in CPU / GPU heterogeneous clusters. Based on the data distribution strategy of two-dimensional cyclic distribution of matrix partition, this architecture uses MPI to realize the communication between computing nodes and at the same time accelerates the matrix updating process by using GPU. In order to break the limit of matrix size of LU decomposition by GPU memory, this paper further studies the “memory-memory” extra-core algorithm. In order to optimize the performance of the algorithm, a design scheme based on “CUDA stream” technology and “asynchronous communication” technology is proposed in this paper, which realizes the overlap of GPU communication and computing, effectively hides the GPU communication time and achieves obvious acceleration effect .