いくつかのクラスタリング実装サンプルを紹介

クラスタリングアルゴリズムを探していた時に一通り実装されているサンプルを発見したので試してみます。 https://github.com/haifengl/smile BIRCH(balanced iterative reducing and clustering using hierarchies) BIRCHは階層的クラスタリングの一つ。巨大化つ階層的にクラスタリングする場合効果を発揮するようです。 以下がサンプルソースを実行した結果です。

Training rand index = 76.63%	adjusted rand index = 28.16%
Testing rand index = 76.51%	adjusted rand index = 27.70%

CLARANS(Clustering for Large Applications) CLARANSは巨大な地形データのクラスタリングに向いているようです。

Training rand index = 87.15%	adjusted rand index = 32.92%
Testing rand index = 86.48%	adjusted rand index = 28.84%

DBSCAN DBSCANは密度準拠クラスタリングの一つです。

11:51:15.728 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed.
DBSCAN clusters of 10000 data points:
  0	  775 ( 7.8%)
  1	 2767 (27.7%)
  2	 2644 (26.4%)
  3	 1273 (12.7%)
Noise	 2541 (25.4%)

The number of clusters: 4
Training rand index = 55.60%	adjusted rand index = 18.53%

DENCLUE(DENsity-based CLUstEring) DENCLUEはカーネル密度推定の極値を元にクラスタリングするそうです。

12:18:45.184 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed.
The number of clusters: 4
Training rand index = 60.62%	adjusted rand index = 24.14%

Deterministic Annealing(確定的アニーリング方) ファジィクラスタリングのためのアルゴリズム

12:37:09.091 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed.
12:37:09.107 [main] INFO smile.math.matrix.PowerIteration - Largest eigenvalue after  13 power iterations: 21.62133

12:37:10.484 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 100 iterations at temperature 43.2527 and k = 1: 5053.67572 (soft distortion = 881317.48253 )

12:37:10.843 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after  27 iterations at temperature 34.6021 and k = 1: 4218.41060 (soft distortion = 849034.97407 )

12:37:10.907 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   4 iterations at temperature 27.6817 and k = 2: 8016.28311 (soft distortion = 819244.69569 )

12:37:10.937 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   2 iterations at temperature 22.1454 and k = 2: 7276.76608 (soft distortion = 802115.91055 )

12:37:10.972 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   2 iterations at temperature 17.7163 and k = 2: 6651.26079 (soft distortion = 790674.79037 )

12:37:11.473 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after  18 iterations at temperature 14.1730 and k = 3: 3752.63561 (soft distortion = 647773.11009 )

12:37:11.691 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   3 iterations at temperature 17.7163 and k = 6: 9890.74112 (soft distortion = 669434.46936 )

12:37:11.819 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   2 iterations at temperature 15.9447 and k = 6: 9281.17267 (soft distortion = 659590.12137 )

12:37:11.995 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   3 iterations at temperature 14.3502 and k = 6: 8564.13827 (soft distortion = 648875.25076 )

12:37:14.611 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after  30 iterations at temperature 15.9447 and k = 8: 9261.27505 (soft distortion = 606799.53657 )

12:37:15.407 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   7 iterations at temperature 16.7839 and k = 10: 11720.41166 (soft distortion = 635656.61070 )

12:37:15.673 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after   2 iterations at temperature 17.2142 and k = 10: 11863.93615 (soft distortion = 638788.02732 )

Training rand index = 80.36%	adjusted rand index = 29.27%
Testing rand index = 82.43%	adjusted rand index = 35.40%

G-means k-meansのアルゴリズムクラスタリング数を動的に決定するアルゴリズムです。 ガウス分布に基づいて仮説検証します。

12:39:19.983 [main] INFO smile.clustering.GMeans - G-Means distortion with 1 clusters: 881320.09122
12:39:20.973 [main] INFO smile.clustering.GMeans - Cluster   0	Anderson-Darling adjusted test statistic: 89.5362
12:39:20.974 [main] INFO smile.clustering.GMeans - Split cluster 0
12:39:21.000 [main] INFO smile.clustering.GMeans - G-Means distortion with 2 clusters: 772688.31108

12:39:21.175 [main] INFO smile.clustering.GMeans - Cluster   0	Anderson-Darling adjusted test statistic: 85.3636
12:39:21.604 [main] INFO smile.clustering.GMeans - Cluster   1	Anderson-Darling adjusted test statistic: 187.8591
12:39:21.604 [main] INFO smile.clustering.GMeans - Split cluster 1
12:39:21.605 [main] INFO smile.clustering.GMeans - Split cluster 0
12:39:23.063 [main] INFO smile.clustering.GMeans - G-Means distortion with 4 clusters: 669317.79045

12:39:23.114 [main] INFO smile.clustering.GMeans - Cluster   0	Anderson-Darling adjusted test statistic: 20.4162
12:39:23.355 [main] INFO smile.clustering.GMeans - Cluster   1	Anderson-Darling adjusted test statistic: 22.8995
12:39:23.596 [main] INFO smile.clustering.GMeans - Cluster   2	Anderson-Darling adjusted test statistic: 27.0563
12:39:23.644 [main] INFO smile.clustering.GMeans - Cluster   3	Anderson-Darling adjusted test statistic: 6.5705
12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 2
12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 1
12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 0
12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 3
12:39:24.849 [main] INFO smile.clustering.GMeans - G-Means distortion with 8 clusters: 593842.16381

12:39:24.944 [main] INFO smile.clustering.GMeans - Cluster   0	Anderson-Darling adjusted test statistic: 9.5921
12:39:25.006 [main] INFO smile.clustering.GMeans - Cluster   1	Anderson-Darling adjusted test statistic: 34.6213
12:39:25.081 [main] INFO smile.clustering.GMeans - Cluster   2	Anderson-Darling adjusted test statistic: 77.9564
12:39:25.112 [main] INFO smile.clustering.GMeans - Cluster   3	Anderson-Darling adjusted test statistic: 15.2918
12:39:25.121 [main] INFO smile.clustering.GMeans - Cluster   4	Anderson-Darling adjusted test statistic: 12.7549
12:39:25.162 [main] INFO smile.clustering.GMeans - Cluster   5	Anderson-Darling adjusted test statistic: 19.4333
12:39:25.201 [main] INFO smile.clustering.GMeans - Cluster   6	Anderson-Darling adjusted test statistic: 4.0901
12:39:25.240 [main] INFO smile.clustering.GMeans - Cluster   7	Anderson-Darling adjusted test statistic: 1.8631
12:39:25.241 [main] INFO smile.clustering.GMeans - Split cluster 2
12:39:25.241 [main] INFO smile.clustering.GMeans - Split cluster 1
12:39:26.119 [main] INFO smile.clustering.GMeans - G-Means distortion with 10 clusters: 557633.81645

Training rand index = 89.68%	adjusted rand index = 45.78%
Testing rand index = 88.69%	adjusted rand index = 40.63%

MEC(Maximum Entropy Clustering)

Training rand index = 91.98%	adjusted rand index = 57.89%
Testing rand index = 87.66%	adjusted rand index = 40.08%

SIB(Sequential Information Bottleneck algorithm) 初期値にクラスタ数の指定が必要。 また計算量としては多め

Sequential Information Bottleneck distortion: 531269.92948
Clusters of 15935 data points of dimension 62061:
  0	  269 ( 1.7%)
  1	 1979 (12.4%)
  2	  138 ( 0.9%)
  3	  160 ( 1.0%)
  4	  188 ( 1.2%)
  5	  675 ( 4.2%)
  6	  150 ( 0.9%)
  7	  895 ( 5.6%)
  8	 3038 (19.1%)
  9	  928 ( 5.8%)
 10	 1439 ( 9.0%)
 11	  207 ( 1.3%)
 12	  791 ( 5.0%)
 13	  290 ( 1.8%)
 14	 2235 (14.0%)
 15	  800 ( 5.0%)
 16	 1135 ( 7.1%)
 17	  270 ( 1.7%)
 18	   41 ( 0.3%)
 19	  307 ( 1.9%)

Training rand index = 89.71%	adjusted rand index = 26.36%
Testing rand index = 88.67%	adjusted rand index = 24.60%

Spectral Clustering 連結性を元にクラスタ分類するアルゴリズムです。 相互距離を元にしてクラスタリングに対して優位性があります。

17:17:54.757 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed.
17:18:06.693 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 2 converged eigenvalues of the 20-by-20 matrix
17:18:06.696 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.9999999999999999
17:18:06.697 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.3440960370197921
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 20 converged eigenvalues of the 71-by-71 matrix
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.9999999999999999
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.3440960370197921
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[2] = 0.2176537564044241
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[3] = 0.16587857681070378
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[4] = 0.11408139263353684
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[5] = 0.10146172269681576
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[6] = 0.0943170333357171
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[7] = 0.08400746550385749
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[8] = 0.07516964128540907
17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[9] = 0.07260401818350089
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[10] = 0.060432605019956225
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[11] = 0.05522169184561691
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[12] = 0.046155079827079254
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[13] = 0.04513001653482768
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[14] = 0.04210242465987843
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[15] = 0.03973546953827896
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[16] = 0.03767489100164228
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[17] = 0.0354510341585054
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[18] = 0.03295587024266742
17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[19] = 0.03131428751396554
17:18:08.825 [main] INFO smile.math.matrix.Lanczos - Lanczos: 2 iterations for Matrix of size 7291
Training rand index = 91.28%	adjusted rand index = 53.73%
USPS Nystrom approximation
17:18:30.968 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 1 converged eigenvalues of the 20-by-20 matrix
17:18:30.968 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.01315025103472493
17:18:30.970 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 26 converged eigenvalues of the 78-by-78 matrix
17:18:30.970 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.01315025103472493
17:18:30.970 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.004037218845535946
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[2] = 0.0022807011604413186
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[3] = 0.0018962506615791516
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[4] = 0.0016009006166485616
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[5] = 0.0010741951276320594
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[6] = 8.685053008593615E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[7] = 8.037386050440827E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[8] = 6.710115520161615E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[9] = 5.624926523943083E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[10] = 4.984899904038704E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[11] = 3.9460623179314336E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[12] = 3.492662089671578E-4
17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[13] = 2.8386756860679835E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[14] = 2.4362282612684847E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[15] = 1.447801452018425E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[16] = 1.243339681976431E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[17] = 9.267063565285238E-5
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[18] = 7.308529313499025E-5
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[19] = 2.548402744660674E-5
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[20] = 8.555675561265639E-6
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[73] = -5.929726315622524E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[74] = -6.309493804730305E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[75] = -7.564578450710687E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[76] = -7.681575235064698E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[77] = -8.035419554610577E-4
17:18:30.972 [main] INFO smile.math.matrix.Lanczos - Lanczos: 2 iterations for Matrix of size 100
Training rand index = 91.01%	adjusted rand index = 51.99%

X-means k-meansに対してベイズ情報量規準を元にクラスタ数を動的に決定します。

17:24:02.300 [main] INFO smile.clustering.XMeans - X-Means distortion with 1 clusters: 881320.09122
17:24:03.321 [main] INFO smile.clustering.XMeans - Cluster   0	BIC: -4486334.92946	BIC after split: -4369766.08592	improvement: 116568.84354
17:24:03.322 [main] INFO smile.clustering.XMeans - Split cluster 0
17:24:03.352 [main] INFO smile.clustering.XMeans - X-Means distortion with 2 clusters: 772688.29348
17:24:03.601 [main] INFO smile.clustering.XMeans - Cluster   0	BIC: -1834253.40162	BIC after split: -1793773.89342	improvement: 40479.50820
17:24:03.946 [main] INFO smile.clustering.XMeans - Cluster   1	BIC: -2521550.54473	BIC after split: -2442361.91371	improvement: 79188.63102
17:24:03.946 [main] INFO smile.clustering.XMeans - Split cluster 1
17:24:03.946 [main] INFO smile.clustering.XMeans - Split cluster 0
17:24:05.428 [main] INFO smile.clustering.XMeans - X-Means distortion with 4 clusters: 669317.79045
17:24:05.555 [main] INFO smile.clustering.XMeans - Cluster   0	BIC: -1184617.84871	BIC after split: -1160937.87376	improvement: 23679.97495
17:24:05.594 [main] INFO smile.clustering.XMeans - Cluster   1	BIC: -492999.62807	BIC after split: -460716.00608	improvement: 32283.62199
17:24:05.644 [main] INFO smile.clustering.XMeans - Cluster   2	BIC: -664476.43388	BIC after split: -645134.78533	improvement: 19341.64855
17:24:05.821 [main] INFO smile.clustering.XMeans - Cluster   3	BIC: -1801481.83274	BIC after split: -1773026.39522	improvement: 28455.43752
17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 1
17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 3
17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 0
17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 2
17:24:06.790 [main] INFO smile.clustering.XMeans - X-Means distortion with 8 clusters: 593846.45745
17:24:06.836 [main] INFO smile.clustering.XMeans - Cluster   0	BIC: -242164.41273	BIC after split: -230406.33072	improvement: 11758.08201
17:24:06.844 [main] INFO smile.clustering.XMeans - Cluster   1	BIC: -158053.02056	BIC after split: -152753.97471	improvement: 5299.04585
17:24:06.940 [main] INFO smile.clustering.XMeans - Cluster   2	BIC: -719569.22828	BIC after split: -710216.48830	improvement: 9352.73998
17:24:07.000 [main] INFO smile.clustering.XMeans - Cluster   3	BIC: -867575.01346	BIC after split: -846049.73767	improvement: 21525.27579
17:24:07.068 [main] INFO smile.clustering.XMeans - Cluster   4	BIC: -733854.66538	BIC after split: -716871.96741	improvement: 16982.69798
17:24:07.097 [main] INFO smile.clustering.XMeans - Cluster   5	BIC: -540561.96749	BIC after split: -524390.29500	improvement: 16171.67249
17:24:07.126 [main] INFO smile.clustering.XMeans - Cluster   6	BIC: -371209.94351	BIC after split: -362998.07825	improvement: 8211.86526
17:24:07.157 [main] INFO smile.clustering.XMeans - Cluster   7	BIC: -362518.26790	BIC after split: -354405.89716	improvement: 8112.37074
17:24:07.159 [main] INFO smile.clustering.XMeans - Split cluster 3
17:24:07.159 [main] INFO smile.clustering.XMeans - Split cluster 4
17:24:08.059 [main] INFO smile.clustering.XMeans - X-Means distortion with 10 clusters: 557633.55991
Training rand index = 89.68%	adjusted rand index = 45.76%
Testing rand index = 88.71%	adjusted rand index = 40.75%

うまく説明できていない部分もありますが、Javaでかつライブラリとして提供されている上ソースも綺麗にまとまっているため学習材料としてはとてもよいプロジェクトだと思います。