Keyword Extraction Process
Overview
This document outlines the keyword extraction process used in the automated website categorization pipeline. The objective is to extract the most relevant keywords from website content to support categorization.
Chosen Method: TF-IDF
We chose Term Frequency-Inverse Document Frequency (TF-IDF) for keyword extraction due to its balance between efficiency and relevance. Although more advanced methods like BERT were explored, they proved too resource-intensive for our large-scale use case.
Advantages of TF-IDF:
- Efficiency: Computationally lightweight compared to deep learning models.
- Simplicity: Easy to implement and interpret.
- Scalability: Can handle large datasets effectively.
Implementation Steps
- Data Preprocessing: The website content undergoes the following transformations:
- Tokenization: Split text into words (tokens).
- Stop-word Removal: Remove common words like “the”, “is”.
- Lowercasing: Convert text to lowercase for consistency.
- Punctuation Removal: Eliminate punctuation and special characters.
- Lemmatization: Reduce words to their base form (e.g., “running” becomes “run”).
- TF-IDF Calculation:
For each term in a document:- Term Frequency (TF): Frequency of the term in the document.
- Inverse Document Frequency (IDF): Rarity of the term across all documents.
- Ranking and Selection: Rank terms by their TF-IDF scores and select top keywords.
Example Workflow
Input:
Website content:
“Artificial intelligence (AI) is transforming industries. AI models are powerful.”
Preprocessing Result:
[“artificial”, “intelligence”, “ai”, “transforming”, “industries”, “models”, “powerful”]
Vectorization:
| Term | TF-IDF Score |
|---|---|
| AI | 0.5 |
| models | 0.7 |
| powerful | 0.5 |
The top keywords would be “models”, “AI”, and “powerful.”