Step 0: Data Collection
In this step, I focused on collecting a comprehensive list of websites for the categorization project. The data collection process involved the following:
1. Extracting URL Logs from Internal Sources
We leveraged URL logs from two colleges, GNDEC Ludhiana and Thapar University, to extract domain names visited over the past 3 years.
Developer Actions:
- Log Extraction: I extracted URL logs from the firewall/network monitoring system of both colleges.
- Data Parsing: Using Python, I parsed these logs to isolate the domain names from the URLs.
Example:
from urllib.parse import urlparse
url = 'https://sub.example.com/page'
domain = urlparse(url).netloc # Outputs: 'sub.example.com'
2. Domain Name Extraction
The logs contained full URLs, but only domain names were necessary for the project. Subdomains were removed to ensure focus on primary domains.
Developer Actions:
- Parsing and Cleaning: I used Python’s
urlparsemodule to extract domain names and cleaned the list to remove subdomains.
Example:
domain = urlparse(url).netloc.split('.')[-2:] # Reduces 'sub.example.com' to 'example.com'
3. Filtering and Cleaning Data
The raw list of domains was further filtered to remove irrelevant entries such as hosting services and CDN domains. Duplicate domains were also eliminated.
Developer Actions:
- Filtering: I implemented custom filters to exclude domains related to hosting or CDNs.
- Duplicate Removal: A simple Python script removed duplicates from the dataset.
Example:
unique_domains = list(set(domain_list))
4. Handling Expired Domains
A significant portion of domains from the logs were expired or unreachable.
Developer Actions:
- Availability Check: I developed a script to check the reachability of each domain.
- Filtering: Expired domains were automatically removed from the list.
Example:
import requests
def check_domain_availability(domain):
try:
response = requests.get(f"http://{domain}", timeout=5)
return response.status_code == 200
except requests.ConnectionError:
return False
5. Augmenting Data with the Top 10 Million Domains
To enhance the dataset, I downloaded a list of the top 10 million domains from Domcop.
Developer Actions:
- Data Integration: I merged the cleaned internal domain list with the top 10 million domains from Domcop, ensuring proper formatting for scraping.
6. Issues with Alternative Data Sources
While considering other sources like Kaggle and GitHub, we encountered formatting issues, such as missing protocols and data inconsistencies.
Developer Actions:
- Formatting Issues: I discarded inconsistent or incomplete lists from Kaggle and GitHub and focused on reliable sources like Domcop.
7. Final Domain List
After these processes, I had a refined and structured list of domains ready for scraping and categorization.
Developer Actions:
- Combining and Preprocessing: I combined multiple sources and ensured the domains had proper “http” or “https” prefixes for smooth scraping.