论文部分内容阅读
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units(GPUs)using the Compute Unified Device Architecture(CUDA)of NVIDIA and Message Passing Interface(MPI)and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version.The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi,and the CPU used is Intel Xeon W5660.To effectively overlap inter-process communication with computation,we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first.While the MPI messages travel across the network,the GPU performs computation on inner elements,and all other calculations that do not use information of outer elements from neighboring subdomains.A significant portion of the speedup also comes from a customized matrix–matrix multiplication kernel,which is used extensively throughout our program.Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphics Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is an Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. Whilst the MPI messages travel across the network, the GPU performs computation on inner elements, and all other other calculations that do not use information of outer elements from neighboring subdomains. A significan t portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities.