Step 0: Data Collection

In this step, I focused on collecting a comprehensive list of websites for the categorization project. The data collection process involved the following:

1. Extracting URL Logs from Internal Sources

We leveraged URL logs from two colleges, GNDEC Ludhiana and Thapar University, to extract domain names visited over the past 3 years.

Developer Actions:

Example:

from urllib.parse import urlparse

url = 'https://sub.example.com/page'
domain = urlparse(url).netloc  # Outputs: 'sub.example.com'

2. Domain Name Extraction

The logs contained full URLs, but only domain names were necessary for the project. Subdomains were removed to ensure focus on primary domains.

Developer Actions:

Example:

domain = urlparse(url).netloc.split('.')[-2:]  # Reduces 'sub.example.com' to 'example.com'

3. Filtering and Cleaning Data

The raw list of domains was further filtered to remove irrelevant entries such as hosting services and CDN domains. Duplicate domains were also eliminated.

Developer Actions:

Example:

unique_domains = list(set(domain_list))

4. Handling Expired Domains

A significant portion of domains from the logs were expired or unreachable.

Developer Actions:

Example:

import requests

def check_domain_availability(domain):
    try:
        response = requests.get(f"http://{domain}", timeout=5)
        return response.status_code == 200
    except requests.ConnectionError:
        return False

5. Augmenting Data with the Top 10 Million Domains

To enhance the dataset, I downloaded a list of the top 10 million domains from Domcop.

Developer Actions:

6. Issues with Alternative Data Sources

While considering other sources like Kaggle and GitHub, we encountered formatting issues, such as missing protocols and data inconsistencies.

Developer Actions:

7. Final Domain List

After these processes, I had a refined and structured list of domains ready for scraping and categorization.

Developer Actions: