Introduction

Problem Statement

In order to block unwanted websites, our firewall requires the ability to categorize websites based on their content. Given the vast number of websites, this categorization must be automated to ensure efficient and reliable blocking of harmful sites. The objective is to build a model capable of predicting the category of any new website.

Project Overview

This project involves creating an automated pipeline that:

  1. Scrapes website content.
  2. Extracts keywords using TF-IDF.
  3. Clusters websites based on similar keywords for manual labeling, producing a labeled dataset.

The end goal is to build a model that can predict the category of new websites, enabling automated website blocking in the firewall.

Approach

1. Scraping Website Content

We use Playwright for automating browser navigation, extracting textual content while ignoring irrelevant HTML components (e.g., ads, JavaScript). Key implementation details:

2. Keyword Extraction

We use the Term Frequency-Inverse Document Frequency (TF-IDF) method to extract keywords. This will be used for clustering.

3. Website Clustering

We use K-Means to group websites with similar keywords.

4. Category Prediction for New Websites

Once the labeled dataset is prepared, we build a pipeline for predicting categories of new websites.

This pipeline allows for automated, scalable website categorization, enabling effective website blocking in the firewall system.