K-Means with FAISS (Facebook AI Similarity Search): Procedure and Results
Overview
After trying traditional K-Means and DBSCAN, we implemented K-Means with FAISS (Facebook AI Similarity Search) to improve clustering performance on large datasets. FAISS K-Means showed a slight improvement in the Silhouette Score, but cluster quality remained unsatisfactory.
FAISS Overview
FAISS is a library by Facebook for fast similarity search and clustering, optimized for high-dimensional datasets. It offers:
- Exact and Approximate K-Means for performance vs accuracy tradeoffs.
- Scalability for large datasets, GPU support, and Approximate Nearest Neighbor Search.
K-Means in FAISS: Implementation Steps
- Data Preparation:
- We used TF-IDF to convert documents into high-dimensional vectors.
- Preprocessing included stop-word removal and normalization.
- K-Means Initialization:
- Initialized exact K-Means in FAISS with the TF-IDF vectors.
- Indexing for Optimization:
- Utilized Inverted File Index (IVF) for faster nearest neighbor search.
- GPU acceleration was available but unnecessary for this dataset.
- Clustering:
- Standard K-Means steps: assignment to centroids, centroid updates, iteration until convergence.
- Evaluation:
- Silhouette Score was calculated for cluster quality.
- FAISS K-Means showed a slight improvement over traditional K-Means but remained low overall.
Example:
import faiss
# Create index and perform clustering
index = faiss.IndexFlatL2(dim) # Exact K-Means
faiss_kmeans = faiss.Kmeans(d=dim, k=num_clusters)
faiss_kmeans.train(tfidf_vectors)
# Get cluster labels and centroids
labels = faiss_kmeans.index.search(tfidf_vectors, 1)[1]
centroids = faiss_kmeans.centroids
Results: FAISS vs Traditional K-Means
- Traditional K-Means: Silhouette Score = 0.2
- FAISS K-Means: Slightly higher but still below ideal.
Key Issues:
- High-dimensional data affects distance metrics, causing poor separation.
- TF-IDF features may not capture important aspects—potentially requiring PCA or feature refinement.