K-Means with FAISS (Facebook AI Similarity Search): Procedure and Results

Overview

After trying traditional K-Means and DBSCAN, we implemented K-Means with FAISS (Facebook AI Similarity Search) to improve clustering performance on large datasets. FAISS K-Means showed a slight improvement in the Silhouette Score, but cluster quality remained unsatisfactory.


FAISS Overview

FAISS is a library by Facebook for fast similarity search and clustering, optimized for high-dimensional datasets. It offers:


K-Means in FAISS: Implementation Steps

  1. Data Preparation:
    • We used TF-IDF to convert documents into high-dimensional vectors.
    • Preprocessing included stop-word removal and normalization.
  2. K-Means Initialization:
    • Initialized exact K-Means in FAISS with the TF-IDF vectors.
  3. Indexing for Optimization:
    • Utilized Inverted File Index (IVF) for faster nearest neighbor search.
    • GPU acceleration was available but unnecessary for this dataset.
  4. Clustering:
    • Standard K-Means steps: assignment to centroids, centroid updates, iteration until convergence.
  5. Evaluation:
    • Silhouette Score was calculated for cluster quality.
    • FAISS K-Means showed a slight improvement over traditional K-Means but remained low overall.

Example:

import faiss

# Create index and perform clustering
index = faiss.IndexFlatL2(dim)  # Exact K-Means
faiss_kmeans = faiss.Kmeans(d=dim, k=num_clusters)
faiss_kmeans.train(tfidf_vectors)

# Get cluster labels and centroids
labels = faiss_kmeans.index.search(tfidf_vectors, 1)[1]
centroids = faiss_kmeans.centroids

Results: FAISS vs Traditional K-Means

Key Issues: