Introduction to Clustering

Overview

In the website categorization pipeline, clustering is used to organize websites into distinct categories based on textual content. The chosen algorithm for this task is K-Means, which groups websites into 90 clusters, evaluated using silhouette scores and PCA visualization. The clustering results are manually labeled to create a labeled dataset for model training.

Clustering Algorithm Selection

K-Means

For partitioning the dataset into distinct categories, I implemented K-Means clustering. The algorithm groups websites into clusters based on the textual content scraped from each website, with each cluster representing a unique category.

DBSCAN

I experimented with DBSCAN for its outlier detection capability but found that K-Means performed better for well-separated clusters in our dataset. DBSCAN was less efficient for large datasets and clusters with varying densities.

Key Considerations

Feature Extraction

Before clustering, the textual content of each website was converted into numerical vectors using TF-IDF. This method was effective in identifying important terms for categorization.

Dimensionality Reduction

Due to the high-dimensional nature of the TF-IDF vectors, PCA was applied to reduce the number of dimensions while preserving the data structure. This step was crucial in improving clustering performance.

Preprocessing for Clustering

Examples