Problems in Web Scraping

Challenges in Web Scraping

As the developer of this project, I encountered several technical challenges related to bandwidth, memory usage, and CPU utilization. Below is a summary of these issues.

1. Bandwidth Limitations

Problem: Scraping multiple websites simultaneously using Playwright resulted in timeouts due to bandwidth bottlenecks when opening numerous browser tabs in parallel.

Example: Scraping 100 domains at once led to timeouts on 30% of the websites.

2. Memory Leaks in Playwright

Problem: Running the scraper for extended periods caused Playwright to leak memory, leading to excessive RAM usage and eventually crashing the system.

Example: After 10 hours of continuous scraping, RAM usage reached 90%, causing the process to stop.

3. CPU Bottleneck

Problem: My local setup with a 4-core CPU couldn’t handle the high concurrency of scraping hundreds of websites, resulting in slow performance.

Example: The CPU usage consistently maxed out, significantly increasing scraping times.


Solution: Infrastructure Upgrade

To address all three problems—bandwidth limitations, memory leaks, and CPU bottlenecks—I upgraded to a more powerful server with the following specifications:

Improvements:

  1. Bandwidth: The 1 GBPS connection eliminated timeouts, allowing efficient concurrent scraping of hundreds of websites.
  2. Memory: The 128 GB of RAM provided sufficient headroom to prevent memory leaks from crashing the process. Additionally, I implemented periodic process restarts for memory management.
  3. CPU Performance: The 8-core processor enabled better parallelism, speeding up the scraping process.
# Example of async scraping in Playwright
async with Playwright() as p:
    browser = await p.chromium.launch()
    tasks = [scrape_page(url) for url in urls]
    await asyncio.gather(*tasks)

# Restart the browser every 100 websites to prevent memory leaks
if counter % 100 == 0:
    await browser.close()
    browser = await p.chromium.launch()

Network Performance Comparison:

Despite having a 1 GBPS connection on both my local setup and the rented server, the server consistently utilized the full bandwidth, whereas the local connection (on WiFi/Ethernet) maxed out at 400-600 Mbps. The server’s dedicated infrastructure proved far more reliable for sustained, high-performance web scraping.