clustering#
- class datacheese.clustering.KMeans(seed=None)#
Bases:
object
K-means clustering model.
- Parameters:
seed (int or None, default None) – Random seed used to shuffle the data.
Examples
>>> import numpy as np >>> from datacheese.clustering import KMeans
Generate input data:
>>> X = np.array( ... [ ... [1, 2], ... [1, 4], ... [1, 0], ... [10, 2], ... [10, 4], ... [10, 0], ... ], ... dtype=np.float64, ... ) >>> X array([[ 1., 2.], [ 1., 4.], [ 1., 0.], [10., 2.], [10., 4.], [10., 0.]])
Fit model using data:
>>> model = KMeans() >>> labels, centroids = model.fit(X, k=2) >>> labels array([1, 1, 1, 0, 0, 0], dtype=int64) >>> centroids array([[10., 2.], [ 1., 2.]])
Use model to make predictions:
>>> X_test = np.array([[2, 1], [11, 2]], dtype=np.float64) >>> X_test array([[ 2., 1.], [11., 2.]]) >>> model.predict(X_test) array([1, 0], dtype=int64)
Compute within-clusters sum of squares:
>>> model.score(X_test, metric='wcss') 3.0000000000000004
Compute between-clusters sum of squares:
>>> model.score(X_test, metric='bcss') 40.5
- fit(X, k, max_iters=1000)#
Fit model by clustering given data into
k
clusters.- Parameters:
X (numpy.ndarray) – 2D features array, of shape
n x d
, wheren
is the number of data points andd
is the number of dimensions.k (int) – Number of clusters.
max_iters (int, default 1000) – Maximum number of iterations.
- Returns:
labels (numpy.ndarray) – 1D array, of shape
n
, containing labels for each data point.centroids (numpy.ndarray) – 2D array, of shape
k x d
, contraining centroid coordinates.
- predict(X)#
Use stored cluster centroids to predict labels for given data.
- Parameters:
X (numpy.ndarray) – 2D features array, of shape
m x d
, wherem
is the number of data points andd
is the number of dimensions.- Returns:
labels – 1D array, of shape
m
, containing labels for each data point.- Return type:
numpy.ndarray
- score(X, metric='wcss')#
Use stored centroids to predict labels for given data and compute clustering score. This can be the within-clusters sum of squares, or the between-clusters sum of squares, depending on the chosen metric.
- Parameters:
X (numpy.ndarray) – 2D features array, of shape
m x d
, wherem
is the number of data points andd
is the number of dimensions.metric (str) – Chosen metric. Must be one of
wcss
orbcss
, corresponding to within-clusters sum of squares or the between-clusters sum of squares respectively.
- Returns:
score – Clustering score.
- Return type:
float