クラスタリングのアルゴリズムを探していた時に一通り実装されているサンプルを発見したので試してみます。 https://github.com/haifengl/smile BIRCH(balanced iterative reducing and clustering using hierarchies) BIRCHは階層的クラスタリングの一つ。巨大化つ階層的にクラスタリングする場合効果を発揮するようです。 以下がサンプルソースを実行した結果です。
Training rand index = 76.63% adjusted rand index = 28.16% Testing rand index = 76.51% adjusted rand index = 27.70%
CLARANS(Clustering for Large Applications) CLARANSは巨大な地形データのクラスタリングに向いているようです。
Training rand index = 87.15% adjusted rand index = 32.92% Testing rand index = 86.48% adjusted rand index = 28.84%
DBSCAN DBSCANは密度準拠クラスタリングの一つです。
11:51:15.728 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed. DBSCAN clusters of 10000 data points: 0 775 ( 7.8%) 1 2767 (27.7%) 2 2644 (26.4%) 3 1273 (12.7%) Noise 2541 (25.4%) The number of clusters: 4 Training rand index = 55.60% adjusted rand index = 18.53%
DENCLUE(DENsity-based CLUstEring) DENCLUEはカーネル密度推定の極値を元にクラスタリングするそうです。
12:18:45.184 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed. The number of clusters: 4 Training rand index = 60.62% adjusted rand index = 24.14%
Deterministic Annealing(確定的アニーリング方) ファジィクラスタリングのためのアルゴリズム
12:37:09.091 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed. 12:37:09.107 [main] INFO smile.math.matrix.PowerIteration - Largest eigenvalue after 13 power iterations: 21.62133 12:37:10.484 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 100 iterations at temperature 43.2527 and k = 1: 5053.67572 (soft distortion = 881317.48253 ) 12:37:10.843 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 27 iterations at temperature 34.6021 and k = 1: 4218.41060 (soft distortion = 849034.97407 ) 12:37:10.907 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 4 iterations at temperature 27.6817 and k = 2: 8016.28311 (soft distortion = 819244.69569 ) 12:37:10.937 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 2 iterations at temperature 22.1454 and k = 2: 7276.76608 (soft distortion = 802115.91055 ) 12:37:10.972 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 2 iterations at temperature 17.7163 and k = 2: 6651.26079 (soft distortion = 790674.79037 ) 12:37:11.473 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 18 iterations at temperature 14.1730 and k = 3: 3752.63561 (soft distortion = 647773.11009 ) 12:37:11.691 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 3 iterations at temperature 17.7163 and k = 6: 9890.74112 (soft distortion = 669434.46936 ) 12:37:11.819 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 2 iterations at temperature 15.9447 and k = 6: 9281.17267 (soft distortion = 659590.12137 ) 12:37:11.995 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 3 iterations at temperature 14.3502 and k = 6: 8564.13827 (soft distortion = 648875.25076 ) 12:37:14.611 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 30 iterations at temperature 15.9447 and k = 8: 9261.27505 (soft distortion = 606799.53657 ) 12:37:15.407 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 7 iterations at temperature 16.7839 and k = 10: 11720.41166 (soft distortion = 635656.61070 ) 12:37:15.673 [main] INFO smile.clustering.DeterministicAnnealing - Deterministic Annealing clustering entropy after 2 iterations at temperature 17.2142 and k = 10: 11863.93615 (soft distortion = 638788.02732 ) Training rand index = 80.36% adjusted rand index = 29.27% Testing rand index = 82.43% adjusted rand index = 35.40%
G-means k-meansのアルゴリズムのクラスタリング数を動的に決定するアルゴリズムです。 ガウス分布に基づいて仮説検証します。
12:39:19.983 [main] INFO smile.clustering.GMeans - G-Means distortion with 1 clusters: 881320.09122 12:39:20.973 [main] INFO smile.clustering.GMeans - Cluster 0 Anderson-Darling adjusted test statistic: 89.5362 12:39:20.974 [main] INFO smile.clustering.GMeans - Split cluster 0 12:39:21.000 [main] INFO smile.clustering.GMeans - G-Means distortion with 2 clusters: 772688.31108 12:39:21.175 [main] INFO smile.clustering.GMeans - Cluster 0 Anderson-Darling adjusted test statistic: 85.3636 12:39:21.604 [main] INFO smile.clustering.GMeans - Cluster 1 Anderson-Darling adjusted test statistic: 187.8591 12:39:21.604 [main] INFO smile.clustering.GMeans - Split cluster 1 12:39:21.605 [main] INFO smile.clustering.GMeans - Split cluster 0 12:39:23.063 [main] INFO smile.clustering.GMeans - G-Means distortion with 4 clusters: 669317.79045 12:39:23.114 [main] INFO smile.clustering.GMeans - Cluster 0 Anderson-Darling adjusted test statistic: 20.4162 12:39:23.355 [main] INFO smile.clustering.GMeans - Cluster 1 Anderson-Darling adjusted test statistic: 22.8995 12:39:23.596 [main] INFO smile.clustering.GMeans - Cluster 2 Anderson-Darling adjusted test statistic: 27.0563 12:39:23.644 [main] INFO smile.clustering.GMeans - Cluster 3 Anderson-Darling adjusted test statistic: 6.5705 12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 2 12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 1 12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 0 12:39:23.644 [main] INFO smile.clustering.GMeans - Split cluster 3 12:39:24.849 [main] INFO smile.clustering.GMeans - G-Means distortion with 8 clusters: 593842.16381 12:39:24.944 [main] INFO smile.clustering.GMeans - Cluster 0 Anderson-Darling adjusted test statistic: 9.5921 12:39:25.006 [main] INFO smile.clustering.GMeans - Cluster 1 Anderson-Darling adjusted test statistic: 34.6213 12:39:25.081 [main] INFO smile.clustering.GMeans - Cluster 2 Anderson-Darling adjusted test statistic: 77.9564 12:39:25.112 [main] INFO smile.clustering.GMeans - Cluster 3 Anderson-Darling adjusted test statistic: 15.2918 12:39:25.121 [main] INFO smile.clustering.GMeans - Cluster 4 Anderson-Darling adjusted test statistic: 12.7549 12:39:25.162 [main] INFO smile.clustering.GMeans - Cluster 5 Anderson-Darling adjusted test statistic: 19.4333 12:39:25.201 [main] INFO smile.clustering.GMeans - Cluster 6 Anderson-Darling adjusted test statistic: 4.0901 12:39:25.240 [main] INFO smile.clustering.GMeans - Cluster 7 Anderson-Darling adjusted test statistic: 1.8631 12:39:25.241 [main] INFO smile.clustering.GMeans - Split cluster 2 12:39:25.241 [main] INFO smile.clustering.GMeans - Split cluster 1 12:39:26.119 [main] INFO smile.clustering.GMeans - G-Means distortion with 10 clusters: 557633.81645 Training rand index = 89.68% adjusted rand index = 45.78% Testing rand index = 88.69% adjusted rand index = 40.63%
MEC(Maximum Entropy Clustering)
Training rand index = 91.98% adjusted rand index = 57.89% Testing rand index = 87.66% adjusted rand index = 40.08%
SIB(Sequential Information Bottleneck algorithm) 初期値にクラスタ数の指定が必要。 また計算量としては多め
Sequential Information Bottleneck distortion: 531269.92948 Clusters of 15935 data points of dimension 62061: 0 269 ( 1.7%) 1 1979 (12.4%) 2 138 ( 0.9%) 3 160 ( 1.0%) 4 188 ( 1.2%) 5 675 ( 4.2%) 6 150 ( 0.9%) 7 895 ( 5.6%) 8 3038 (19.1%) 9 928 ( 5.8%) 10 1439 ( 9.0%) 11 207 ( 1.3%) 12 791 ( 5.0%) 13 290 ( 1.8%) 14 2235 (14.0%) 15 800 ( 5.0%) 16 1135 ( 7.1%) 17 270 ( 1.7%) 18 41 ( 0.3%) 19 307 ( 1.9%) Training rand index = 89.71% adjusted rand index = 26.36% Testing rand index = 88.67% adjusted rand index = 24.60%
Spectral Clustering 連結性を元にクラスタ分類するアルゴリズムです。 相互距離を元にしてクラスタリングに対して優位性があります。
17:17:54.757 [main] INFO smile.math.matrix.Factory - smile-netlib module is not available in the classpath. Pure Java matrix library will be employed. 17:18:06.693 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 2 converged eigenvalues of the 20-by-20 matrix 17:18:06.696 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.9999999999999999 17:18:06.697 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.3440960370197921 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 20 converged eigenvalues of the 71-by-71 matrix 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.9999999999999999 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.3440960370197921 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[2] = 0.2176537564044241 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[3] = 0.16587857681070378 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[4] = 0.11408139263353684 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[5] = 0.10146172269681576 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[6] = 0.0943170333357171 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[7] = 0.08400746550385749 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[8] = 0.07516964128540907 17:18:08.817 [main] INFO smile.math.matrix.Lanczos - ritz[9] = 0.07260401818350089 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[10] = 0.060432605019956225 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[11] = 0.05522169184561691 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[12] = 0.046155079827079254 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[13] = 0.04513001653482768 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[14] = 0.04210242465987843 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[15] = 0.03973546953827896 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[16] = 0.03767489100164228 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[17] = 0.0354510341585054 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[18] = 0.03295587024266742 17:18:08.818 [main] INFO smile.math.matrix.Lanczos - ritz[19] = 0.03131428751396554 17:18:08.825 [main] INFO smile.math.matrix.Lanczos - Lanczos: 2 iterations for Matrix of size 7291 Training rand index = 91.28% adjusted rand index = 53.73% USPS Nystrom approximation 17:18:30.968 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 1 converged eigenvalues of the 20-by-20 matrix 17:18:30.968 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.01315025103472493 17:18:30.970 [main] INFO smile.math.matrix.Lanczos - Lancozs method found 26 converged eigenvalues of the 78-by-78 matrix 17:18:30.970 [main] INFO smile.math.matrix.Lanczos - ritz[0] = 0.01315025103472493 17:18:30.970 [main] INFO smile.math.matrix.Lanczos - ritz[1] = 0.004037218845535946 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[2] = 0.0022807011604413186 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[3] = 0.0018962506615791516 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[4] = 0.0016009006166485616 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[5] = 0.0010741951276320594 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[6] = 8.685053008593615E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[7] = 8.037386050440827E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[8] = 6.710115520161615E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[9] = 5.624926523943083E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[10] = 4.984899904038704E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[11] = 3.9460623179314336E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[12] = 3.492662089671578E-4 17:18:30.971 [main] INFO smile.math.matrix.Lanczos - ritz[13] = 2.8386756860679835E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[14] = 2.4362282612684847E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[15] = 1.447801452018425E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[16] = 1.243339681976431E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[17] = 9.267063565285238E-5 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[18] = 7.308529313499025E-5 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[19] = 2.548402744660674E-5 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[20] = 8.555675561265639E-6 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[73] = -5.929726315622524E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[74] = -6.309493804730305E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[75] = -7.564578450710687E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[76] = -7.681575235064698E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - ritz[77] = -8.035419554610577E-4 17:18:30.972 [main] INFO smile.math.matrix.Lanczos - Lanczos: 2 iterations for Matrix of size 100 Training rand index = 91.01% adjusted rand index = 51.99%
X-means k-meansに対してベイズ情報量規準を元にクラスタ数を動的に決定します。
17:24:02.300 [main] INFO smile.clustering.XMeans - X-Means distortion with 1 clusters: 881320.09122 17:24:03.321 [main] INFO smile.clustering.XMeans - Cluster 0 BIC: -4486334.92946 BIC after split: -4369766.08592 improvement: 116568.84354 17:24:03.322 [main] INFO smile.clustering.XMeans - Split cluster 0 17:24:03.352 [main] INFO smile.clustering.XMeans - X-Means distortion with 2 clusters: 772688.29348 17:24:03.601 [main] INFO smile.clustering.XMeans - Cluster 0 BIC: -1834253.40162 BIC after split: -1793773.89342 improvement: 40479.50820 17:24:03.946 [main] INFO smile.clustering.XMeans - Cluster 1 BIC: -2521550.54473 BIC after split: -2442361.91371 improvement: 79188.63102 17:24:03.946 [main] INFO smile.clustering.XMeans - Split cluster 1 17:24:03.946 [main] INFO smile.clustering.XMeans - Split cluster 0 17:24:05.428 [main] INFO smile.clustering.XMeans - X-Means distortion with 4 clusters: 669317.79045 17:24:05.555 [main] INFO smile.clustering.XMeans - Cluster 0 BIC: -1184617.84871 BIC after split: -1160937.87376 improvement: 23679.97495 17:24:05.594 [main] INFO smile.clustering.XMeans - Cluster 1 BIC: -492999.62807 BIC after split: -460716.00608 improvement: 32283.62199 17:24:05.644 [main] INFO smile.clustering.XMeans - Cluster 2 BIC: -664476.43388 BIC after split: -645134.78533 improvement: 19341.64855 17:24:05.821 [main] INFO smile.clustering.XMeans - Cluster 3 BIC: -1801481.83274 BIC after split: -1773026.39522 improvement: 28455.43752 17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 1 17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 3 17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 0 17:24:05.821 [main] INFO smile.clustering.XMeans - Split cluster 2 17:24:06.790 [main] INFO smile.clustering.XMeans - X-Means distortion with 8 clusters: 593846.45745 17:24:06.836 [main] INFO smile.clustering.XMeans - Cluster 0 BIC: -242164.41273 BIC after split: -230406.33072 improvement: 11758.08201 17:24:06.844 [main] INFO smile.clustering.XMeans - Cluster 1 BIC: -158053.02056 BIC after split: -152753.97471 improvement: 5299.04585 17:24:06.940 [main] INFO smile.clustering.XMeans - Cluster 2 BIC: -719569.22828 BIC after split: -710216.48830 improvement: 9352.73998 17:24:07.000 [main] INFO smile.clustering.XMeans - Cluster 3 BIC: -867575.01346 BIC after split: -846049.73767 improvement: 21525.27579 17:24:07.068 [main] INFO smile.clustering.XMeans - Cluster 4 BIC: -733854.66538 BIC after split: -716871.96741 improvement: 16982.69798 17:24:07.097 [main] INFO smile.clustering.XMeans - Cluster 5 BIC: -540561.96749 BIC after split: -524390.29500 improvement: 16171.67249 17:24:07.126 [main] INFO smile.clustering.XMeans - Cluster 6 BIC: -371209.94351 BIC after split: -362998.07825 improvement: 8211.86526 17:24:07.157 [main] INFO smile.clustering.XMeans - Cluster 7 BIC: -362518.26790 BIC after split: -354405.89716 improvement: 8112.37074 17:24:07.159 [main] INFO smile.clustering.XMeans - Split cluster 3 17:24:07.159 [main] INFO smile.clustering.XMeans - Split cluster 4 17:24:08.059 [main] INFO smile.clustering.XMeans - X-Means distortion with 10 clusters: 557633.55991 Training rand index = 89.68% adjusted rand index = 45.76% Testing rand index = 88.71% adjusted rand index = 40.75%
うまく説明できていない部分もありますが、Javaでかつライブラリとして提供されている上ソースも綺麗にまとまっているため学習材料としてはとてもよいプロジェクトだと思います。