Keyword Extraction Process

Overview

This document outlines the keyword extraction process used in the automated website categorization pipeline. The objective is to extract the most relevant keywords from website content to support categorization.

Chosen Method: TF-IDF

We chose Term Frequency-Inverse Document Frequency (TF-IDF) for keyword extraction due to its balance between efficiency and relevance. Although more advanced methods like BERT were explored, they proved too resource-intensive for our large-scale use case.

Advantages of TF-IDF:

Implementation Steps

  1. Data Preprocessing: The website content undergoes the following transformations:
    • Tokenization: Split text into words (tokens).
    • Stop-word Removal: Remove common words like “the”, “is”.
    • Lowercasing: Convert text to lowercase for consistency.
    • Punctuation Removal: Eliminate punctuation and special characters.
    • Lemmatization: Reduce words to their base form (e.g., “running” becomes “run”).
  2. TF-IDF Calculation:
    For each term in a document:
    • Term Frequency (TF): Frequency of the term in the document.
    • Inverse Document Frequency (IDF): Rarity of the term across all documents.
  3. Ranking and Selection: Rank terms by their TF-IDF scores and select top keywords.

Example Workflow

Input:

Website content:
“Artificial intelligence (AI) is transforming industries. AI models are powerful.”

Preprocessing Result:

[“artificial”, “intelligence”, “ai”, “transforming”, “industries”, “models”, “powerful”]

Vectorization:

Term TF-IDF Score
AI 0.5
models 0.7
powerful 0.5

The top keywords would be “models”, “AI”, and “powerful.”