clustering#

class datacheese.clustering.KMeans(seed=None)#

Bases: object

K-means clustering model.

Parameters:

seed (int or None, default None) – Random seed used to shuffle the data.

Examples

>>> import numpy as np
>>> from datacheese.clustering import KMeans

Generate input data:

>>> X = np.array(
...     [
...         [1, 2],
...         [1, 4],
...         [1, 0],
...         [10, 2],
...         [10, 4],
...         [10, 0],
...     ],
...     dtype=np.float64,
... )
>>> X
array([[ 1.,  2.],
       [ 1.,  4.],
       [ 1.,  0.],
       [10.,  2.],
       [10.,  4.],
       [10.,  0.]])

Fit model using data:

>>> model = KMeans()
>>> labels, centroids = model.fit(X, k=2)
>>> labels
array([1, 1, 1, 0, 0, 0], dtype=int64)
>>> centroids
array([[10.,  2.],
       [ 1.,  2.]])

Use model to make predictions:

>>> X_test = np.array([[2, 1], [11, 2]], dtype=np.float64)
>>> X_test
array([[ 2.,  1.],
       [11.,  2.]])
>>> model.predict(X_test)
array([1, 0], dtype=int64)

Compute within-clusters sum of squares:

>>> model.score(X_test, metric='wcss')
3.0000000000000004

Compute between-clusters sum of squares:

>>> model.score(X_test, metric='bcss')
40.5
fit(X, k, max_iters=1000)#

Fit model by clustering given data into k clusters.

Parameters:
  • X (numpy.ndarray) – 2D features array, of shape n x d, where n is the number of data points and d is the number of dimensions.

  • k (int) – Number of clusters.

  • max_iters (int, default 1000) – Maximum number of iterations.

Returns:

  • labels (numpy.ndarray) – 1D array, of shape n, containing labels for each data point.

  • centroids (numpy.ndarray) – 2D array, of shape k x d, contraining centroid coordinates.

predict(X)#

Use stored cluster centroids to predict labels for given data.

Parameters:

X (numpy.ndarray) – 2D features array, of shape m x d, where m is the number of data points and d is the number of dimensions.

Returns:

labels – 1D array, of shape m, containing labels for each data point.

Return type:

numpy.ndarray

score(X, metric='wcss')#

Use stored centroids to predict labels for given data and compute clustering score. This can be the within-clusters sum of squares, or the between-clusters sum of squares, depending on the chosen metric.

Parameters:
  • X (numpy.ndarray) – 2D features array, of shape m x d, where m is the number of data points and d is the number of dimensions.

  • metric (str) – Chosen metric. Must be one of wcss or bcss, corresponding to within-clusters sum of squares or the between-clusters sum of squares respectively.

Returns:

score – Clustering score.

Return type:

float