K-Means and DBSCAN Clustering: Detailed Procedure and Silhouette Score

Here’s a streamlined version of the documentation, focusing only on the steps you took as a developer. I also added an example for clarity:


Overview

This section documents the process we followed to apply K-Means and DBSCAN clustering algorithms to our dataset, and how we evaluated their performance using the Silhouette Score. Despite experimenting with both algorithms, the clustering results were not ideal, with the best Silhouette Score being 0.2 from K-Means.


K-Means Clustering

Step-by-Step Procedure

  1. Preprocessing: Performed feature extraction using TF-IDF, stop-word removal, and data normalization.
  2. Choosing K: Experimented with values of k from 20 to 200, ultimately using k=107 for further evaluation.
  3. Fitting the Model: Applied K-Means with k=10.
  4. Evaluation: Calculated the Silhouette Score for K-Means, which was 0.2, indicating weak cluster separation.

Example:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Example of fitting K-Means and calculating the Silhouette Score
kmeans = KMeans(n_clusters=107, random_state=42)
kmeans.fit(data)
labels = kmeans.labels_
score = silhouette_score(data, labels)

print(f"Silhouette Score: {score}")

DBSCAN Clustering

Step-by-Step Procedure

  1. Preprocessing: Applied the same preprocessing steps as in K-Means.
  2. Choosing Parameters: Started with eps=0.5 and min_samples=5.
  3. Fitting the Model: Ran DBSCAN on the preprocessed data.
  4. Evaluation: The Silhouette Score for DBSCAN was lower than 0.2, with the model either detecting too few clusters or classifying too many points as noise.

Example:

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Example of fitting DBSCAN and calculating the Silhouette Score
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(data)
labels = dbscan.labels_
score = silhouette_score(data, labels)

print(f"Silhouette Score: {score}")