I am currently developing multiple scrapers that should be maintained for the next couple of years. My typcial approach to traverse large pages is to use scrapy, a well maintained python framework to build scrapers.
Yet for some of my current targets, I am running into typical barriers like cloudflare, recaptchas, etc., even when using sensible defaults for my user-agent, concurrency, and also integrating with scrapy splash to render JS for the pages.
I played around a bit with two wrappers for Javscript browser automation, pupeteer-extra and playwright-extra as they offer well working plugins to deal with cloudflare and recaptchas and an array of other challenges.
However, scrapy offers strong benefits in terms of page traversal and using different strategies for interacting with pages and extracting data from different types of pages.
So my question is two-fold:
Is there a solution that has implemented playwright- or pupeteer-extra as a plugin to use with scrapy? I am aware of scrapy-playwright, but specifically need to use the other drop-in plugins for playwright.
If there is not anything implemented, is there an option to call a javascript function to handle the requests from python?
Also, I am really happy for any feedback or tips to make this work that I might not have considered yet.
Thanks a bunch!