论文部分内容阅读
随着计算机技术的不断发展和系统规模的不断扩大,高可用集群系统的管理和维护变得越来越复杂。为了提供稳定的计算环境,并及时发现定位系统中的故障隐患,提出了故障的主动管理方法。该文首先分析了自主计算的相关概念和技术,在分析集群计算环境管理需求的基础上,提出了一种基于规则的自主故障管理软件结构。根据集群系统的特点,选择分级管理方式,设计了局部故障管理模块(LFM)和全局故障管理模块(GFM),并具体说明了二者内部的功能结构。
With the continuous development of computer technology and the continuous expansion of system scale, the management and maintenance of highly available cluster systems have become more and more complicated. In order to provide a stable computing environment and discover hidden troubles in the positioning system in time, a method of active management of the fault is proposed. This paper first analyzes the concepts and technologies of autonomous computing. Based on the analysis of cluster computing environment management requirements, a rule-based autonomous fault management software architecture is proposed. According to the characteristics of the cluster system, we select the hierarchical management mode, design local fault management module (LFM) and global fault management module (GFM), and specify the internal functional structure of the two.