论文部分内容阅读
随着高性能计算技术的发展,越来越多的行业都将高性能计算作为重要研发手段,依赖系统管理员手工进行运维的工作方式面临越来越多的挑战,逐渐表现出种种不足:运维效率低,容易出错,不利于知识传承,针对高性能计算集群运维工作特点,梳理总结出运维自动化工作三方面主要内容:系统监测自动化、数据分析自动化、问题处理自动化,研究了三步走的运维自动化实施途径:操作流程化、流程标准化、标准自动化,通过运维自动化系统设计和巡检脚本可视化配置为例介绍了运维自动化的技术实现要点。
With the development of high-performance computing technology, more and more industries regard high-performance computing as an important research and development tool, and rely on system administrators to face more and more challenges in their manual operation and maintenance. Gradually they show various shortcomings: According to the characteristics of high performance computing cluster operation and maintenance, summed up the main contents of the three aspects of operation and maintenance automation: system monitoring automation, data analysis automation, problem processing automation, studied the three Step by step operation and maintenance automation implementation: operation of the process, process standardization, standard automation, through the operation and maintenance automation system design and inspection script visual configuration as an example to introduce the operation and maintenance automation technology to achieve points.