Problems in Web Scraping
Challenges in Web Scraping
As the developer of this project, I encountered several technical challenges related to bandwidth, memory usage, and CPU utilization. Below is a summary of these issues.
1. Bandwidth Limitations
Problem: Scraping multiple websites simultaneously using Playwright resulted in timeouts due to bandwidth bottlenecks when opening numerous browser tabs in parallel.
Example: Scraping 100 domains at once led to timeouts on 30% of the websites.
2. Memory Leaks in Playwright
Problem: Running the scraper for extended periods caused Playwright to leak memory, leading to excessive RAM usage and eventually crashing the system.
Example: After 10 hours of continuous scraping, RAM usage reached 90%, causing the process to stop.
3. CPU Bottleneck
Problem: My local setup with a 4-core CPU couldn’t handle the high concurrency of scraping hundreds of websites, resulting in slow performance.
Example: The CPU usage consistently maxed out, significantly increasing scraping times.
Solution: Infrastructure Upgrade
To address all three problems—bandwidth limitations, memory leaks, and CPU bottlenecks—I upgraded to a more powerful server with the following specifications:
- Location: Singapore
- Bandwidth: 1 GBPS dedicated connection
- Memory: 128 GB RAM
- Processor: 8-core CPU
Improvements:
- Bandwidth: The 1 GBPS connection eliminated timeouts, allowing efficient concurrent scraping of hundreds of websites.
- Memory: The 128 GB of RAM provided sufficient headroom to prevent memory leaks from crashing the process. Additionally, I implemented periodic process restarts for memory management.
- CPU Performance: The 8-core processor enabled better parallelism, speeding up the scraping process.
# Example of async scraping in Playwright
async with Playwright() as p:
browser = await p.chromium.launch()
tasks = [scrape_page(url) for url in urls]
await asyncio.gather(*tasks)
# Restart the browser every 100 websites to prevent memory leaks
if counter % 100 == 0:
await browser.close()
browser = await p.chromium.launch()
Network Performance Comparison:
Despite having a 1 GBPS connection on both my local setup and the rented server, the server consistently utilized the full bandwidth, whereas the local connection (on WiFi/Ethernet) maxed out at 400-600 Mbps. The server’s dedicated infrastructure proved far more reliable for sustained, high-performance web scraping.