79218363

Date: 2024-11-23 16:52:48
Score: 1.5
Natty:
Report link

I recently found the answer to my own question. The Washington Post, and doubtless many other websites, uses software that can detect whether requests are from a browser or from something else. This webpage at ScrapFly describes the techniques used by Akami, the package used by the Washington Post, to detect non-browser attempts to access: Akamai Detection Techniques

While the Washington Post does hinder scraping of their webpages (for instance, by introducing a 10 second delay before responding to the request), they do allow it. It was their own RSS feeds (for example, https://www.washingtonpost.com/arcio/rss/) that they began blocking by non-browsers on August 2, 2024. Browsers could still access these feeds, but they're in XML format with links appearing as plain text, not very useful when displayed on a browser page, and requiring additional steps to process into a useful form.

The information supplied by the ScrapFly website is sufficient for brewing your own solutions, but there's a readymade alternative available with curl_impersonate at https://github.com/lwthiker/curl-impersonate. It can mimic the behavior of the four major browsers: Chrome, Firefox, Safari and Microsoft Edge.

I needed a PHP solution and so additionally used kelvinzer0/curl-impersonate-php, which performs the necessary setup and invokes the curl_impersonate executables.

Reasons:
  • Blacklisted phrase (0.5): I need
  • Long answer (-1):
  • No code block (0.5):
  • Self-answer (0.5):
  • Low reputation (1):
Posted by: Percy