论文部分内容阅读
聚类方法的核心是如何度量事物间的邻近性。介绍了邮件特征的向量表示形式、构建了邮件特征矩阵,并使用变形后的极值分布函数模型拟合了邮件间通信特征信息;在此基础上提出了一个新的邻近性度量方法(ex-treme value distribution similarity,EVS),用以指导邮件社区划分;使用微聚类-宏聚类邮件社区划分算法验证了该方法的有效性。实验表明,在测试数据集上,相比余弦、PCC等经典的邻近性度量方法,以EVS作为划分依据的邮件社区划分算法能够更加有效地发现高质量的邮件社区。
The core of clustering method is how to measure the proximity between things. This paper introduces the vector representation of e-mail features, constructs a e-mail signature matrix, and uses the transformed extremal distribution function model to fit the e-mail communication feature information. Based on this, a new ex- treme value distribution similarity (EVS), which is used to guide the mail community segmentation. The micro-clustering-macro cluster mail community segmentation algorithm is used to verify the effectiveness of the method. Experiments show that in the test dataset, compared with the classical measure of proximity such as cosine and PCC, the mail community segmentation algorithm based on EVS can find the high quality mail community more effectively.