0%

深度学习-离散的解纠缠评测方法(基于ClusterGAN)

本文是基于ClusterGAN的论文对离散的解纠缠方法的度量方法,从数学原理及相关的sklearn源码对相应的的评测方法的简单讲解,文章分别采用了标准化互信息(NMI)、调整Rand指数(ARI)和聚类纯度(ACC)。

ACC(Clustering Purity)

计算正确聚类的数占总数的比例

值在0~1之间,完全错误值为0,完全正确值为1.

数学公式

\operatorname{purity}(\Omega, C)=\frac{1}{N} \sum_{k} \max _{j}\left|\omega_{k} \cap c_{j}\right|

其中 \Omega=\left\{\omega_{1}, \omega_{2}, \ldots, \omega_{k}\right\} 表示聚类的集合,而 w_k 表示第 k 个聚类的集合; C=\left\{c_{1}, c_{2}, \ldots, c_{j}\right\} 表示真实的类别信息

NMI(Normalized Mutual Info)

标准化互信息(NMI)是互信息(MI)得分的一种标准化,用于衡量0(无互信息)和1(完全相关)之间的结果。

类别或聚类的标签值的排列不会以任何方式改变评分值。

该方法是对称的,将label_true与label_pred交换将返回相同的评分值。

数学公式

假设有两个标签集合(对相同的 N 个对象) U V ,它们的熵是指一个集合的不确定性:

H(U) = - \sum_{i=1}^{|U|}P(i)\log(P(i))

其中, P(i) = |U_i| / N 是从 U 中随机选取类别 U_i 的概率,同理 P’(j) = |V_j| / N

H(V) = - \sum_{j=1}^{|V|}P'(j)\log(P'(j))

U V 的互信息为:

\text{MI}(U, V) = \sum_{i=1}^{|U|}\sum_{j=1}^{|V|}P(i, j)\log\left(\frac{P(i,j)}{P(i)P'(j)}\right)

其中 P(i, j) = |U_i \cap V_j| / N 是随机选择的对象同时属于两个类别 U_i V_j 的概率,等同于下式:

\text{MI}(U, V) = \sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i \cap V_j|}{N}\log\left(\frac{N|U_i \cap V_j|}{|U_i||V_j|}\right)

则归一化的互信息为:

\text{NMI}(U, V) = \frac{\text{MI}(U, V)}{\text{mean}(H(U), H(V))}

优点

  • 接近零的值表示两个标签在很大程度上是独立的,而接近1的值表示明显的一致性

缺点

  • 需要知道真实的类别信息;
  • 不会因为偶然性而调整值的大小(即当随机聚类时值不会接近0);

sklearn代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def normalized_mutual_info_score(labels_true, labels_pred,
average_method='warn'):
"""Normalized Mutual Information between two clusterings.

Parameters
----------
labels_true : int array, shape = [n_samples]
A clustering of the data into disjoint subsets.

labels_pred : array, shape = [n_samples]
A clustering of the data into disjoint subsets.

average_method : string, optional (default: 'warn')
How to compute the normalizer in the denominator. Possible options
are 'min', 'geometric', 'arithmetic', and 'max'.
If 'warn', 'geometric' will be used. The default will change to
'arithmetic' in version 0.22.

Returns
-------
nmi : float
score between 0.0 and 1.0. 1.0 stands for perfectly complete labeling
"""
if average_method == 'warn':
warnings.warn("The behavior of NMI will change in version 0.22. "
"To match the behavior of 'v_measure_score', NMI will "
"use average_method='arithmetic' by default.",
FutureWarning)
average_method = 'geometric'
labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
classes = np.unique(labels_true)
clusters = np.unique(labels_pred)
# Special limit cases: no clustering since the data is not split.
# This is a perfect match hence return 1.0.
if (classes.shape[0] == clusters.shape[0] == 1 or
classes.shape[0] == clusters.shape[0] == 0):
return 1.0
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
contingency = contingency.astype(np.float64)
# Calculate the MI for the two clusterings
mi = mutual_info_score(labels_true, labels_pred,
contingency=contingency)
# Calculate the expected value for the mutual information
# Calculate entropy for each labeling
h_true, h_pred = entropy(labels_true), entropy(labels_pred)
normalizer = _generalized_average(h_true, h_pred, average_method)
# Avoid 0.0 / 0.0 when either entropy is zero.
normalizer = max(normalizer, np.finfo('float64').eps)
nmi = mi / normalizer
return nmi

ARI(Adjusted Rand Index)

Rand指数通过考虑所有样本对,通过计算在预测的聚类结果和真实的类别标签的相同或不同样本对,来计算两个聚类之间的相似性度量。

类别或聚类的标签值的排列不会以任何方式改变评分值。

该方法是对称的,将label_true与label_pred交换将返回相同的评分值。

数学公式

\mathrm{RI}=\frac{a+b}{C_{2}^{n_{\text {samples}}}}

对于兰德指数,其中 C 是真实的类别信息,而K是聚类结果的类别信息, a 表示聚类结果类别和实际类别一致的元素对数, b 表示聚类结果类别跟实际类别不一致的元素对数, C_2^{n_{samples}} 是数据集中可能的总对数。

但是RI分数不能保证随机标签进行分类时指标为0,故采用调整兰德指数,用 E[RI] 表示随机标签的期望RI分数:

\mathrm{ARI}=\frac{\mathrm{RI}-E[\mathrm{RI}]}{\max (\mathrm{RI})-E[\mathrm{RI}]}

优点

  • 随机标签得到的ARI分数接近0;
  • 范围为[-1,1],值越大表示越相似;
  • 对聚类的结构不需要假设;

缺点

  • 需要知道真实的类别信息

sklearn代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def adjusted_rand_score(labels_true, labels_pred):
"""Rand index adjusted for chance.

Parameters
----------
labels_true : int array, shape = [n_samples]
Ground truth class labels to be used as a reference

labels_pred : array, shape = [n_samples]
Cluster labels to evaluate

Returns
-------
ari : float
Similarity score between -1.0 and 1.0. Random labelings have an ARI
close to 0.0. 1.0 stands for perfect match.
"""
labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
n_samples = labels_true.shape[0]
n_classes = np.unique(labels_true).shape[0]
n_clusters = np.unique(labels_pred).shape[0]

# Special limit cases: no clustering since the data is not split;
# or trivial clustering where each document is assigned a unique cluster.
# These are perfect matches hence return 1.0.
if (n_classes == n_clusters == 1 or
n_classes == n_clusters == 0 or
n_classes == n_clusters == n_samples):
return 1.0

# Compute the ARI using the contingency data
# 共现矩阵:行为真实类别数,列为聚类个数,第i行第j列表示真实类别为i的元素被聚类为j的元素个数
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
sum_comb_c = sum(_comb2(n_c) for n_c in np.ravel(contingency.sum(axis=1)))
sum_comb_k = sum(_comb2(n_k) for n_k in np.ravel(contingency.sum(axis=0)))
sum_comb = sum(_comb2(n_ij) for n_ij in contingency.data)

prod_comb = (sum_comb_c * sum_comb_k) / _comb2(n_samples)
mean_comb = (sum_comb_k + sum_comb_c) / 2.
return (sum_comb - prod_comb) / (mean_comb - prod_comb)

参考:2.3. Clustering