dask_ml.cluster.KMeans

`dask_ml.cluster`.KMeans¶

class dask_ml.cluster.KMeans(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None, n_init='auto')¶

用于聚类的可伸缩 KMeans

参数

n_clustersint, 默认 8

最终需要的聚类数量

init{‘k-means||’, ‘k-means++’ 或 ndarray}

中心初始化方法，默认为 'k-means||'。

‘k-means||’ : 选择 gg

‘k-means++’ : 以智能方式选择初始聚类中心以加速收敛。使用 scikit-learn 的实现。

警告

如果使用 'k-means++'，整个数据集将一次性读入内存。

可以使用形状为 (n_clusters, n_features) 的数组作为明确的起始点

oversampling_factorint, 默认 2

k-means|| 算法中使用的过采样因子。

max_iterint

尝试的最大 EM 迭代次数。

init_max_iterint

初始化步骤的迭代次数。

tolfloat

声明收敛的相对于惯量的相对容差

algorithm‘full’

用于 EM 步骤的算法。只允许使用 “full” (LLoyd 算法)。

random_stateint, RandomState 实例或 None, 可选, 默认: None

如果为 int，random_state 是随机数生成器使用的种子；如果为 RandomState 实例，random_state 是随机数生成器；如果为 None，则随机数生成器是 np.random 使用的 RandomState 实例。

n_init‘auto’ 或 int, 默认=10

使用不同质心种子运行 k-means 算法的次数。最终结果将是 n_init 次连续运行中以惯量计的最佳输出。当 n_init=’auto’ 时，如果使用 init=’random’，运行次数为 10；如果使用 init=’kmeans++’，运行次数为 1。.. versionadded:: 1.2

为 n_init 添加了 'auto' 选项。

自 1.4 版本改变: 在 1.4 版本中，n_init 的默认值将从 10 更改为 ‘auto’。

属性

cluster_centers_np.ndarray [n_clusters, n_features]: 一个包含聚类中心的 NumPy 数组
labels_da.array [n_samples,]: 一个 dask 数组，包含此样本属于 cluster_centers_ 中的索引位置。
inertia_float: 样本到其最近聚类中心的距离总和。
n_iter_int: 达到收敛所需的 EM 步骤数

另请参阅

sklearn.cluster.MiniBatchKMeans
sklearn.cluster.KMeans

注意

此类别实现了 k-Means 的并行和分布式版本。

使用 k-means|| 初始化

KMeans 的默认初始化器是 k-means||，而不是 scikit-learn 的 k-means++。这是 Scalable K-Means++ (2012) 中描述的算法。

k-means|| 设计用于分布式环境。它是 k-means++ 的变体，设计用于并行工作（k-means++ 本质上是顺序的）。目前，此处的 k-means|| 实现比 scikit-learn 的 k-means++ 慢，如果您的整个数据集可以在单台机器的内存中容纳。如果是这种情况，请考虑使用 init='k-means++'。

并行 LLoyd 算法

LLoyd 算法（scikit-learn 中使用的默认期望最大化算法）是天然并行的。在简单的基准测试中，这里的实现比 scikit-learn 快 2-3 倍。

初始化步骤和 EM 步骤都需要多次遍历数据。如果可能，在运行 .fit 之前，将 dask 集合持久化到（分布式）内存中。

参考文献

Scalable K-Means++, 2012 Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii https://arxiv.org/abs/1203.6402

方法

`fit_transform`(X[, y])	拟合数据，然后进行转换。
`get_metadata_routing`()	获取此对象的元数据路由。
`get_params`([deep])	获取此估计器的参数。
`predict`(X)	预测 X 中每个样本所属的最接近的聚类。
`set_output`(*[, transform])	设置输出容器。
`set_params`(**params)	设置此估计器的参数。

fit
transform

__init__(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None, n_init='auto')¶

dask_ml.wrappers.Incremental

dask_ml.cluster.SpectralClustering

dask_ml.cluster.KMeans

dask_ml.cluster.KMeans¶

`dask_ml.cluster`.KMeans¶