How to alter request handling?

Question!

Is it possible to alter request handling in Scrapy? For example if I want that particular URL must be requested not by scrapys stanard machinery, but with Selenium, and make me able to operate with it with Selenium driver methods.

How to do that?



Answers

You do not have to alter the request, you can simply execute Selenium within your spider.

def parse(self, response):
    browser = webdriver.Firefox()
    for i in response.xpath("//a/@href").extract():
        browser.get(i)
        #then do other stuff

But in my experience, there are very very very few things you cannot do with scrapy alone. Do you have a link to show what kind of things you're looking for ?



What you want to write is a downloader middleware component. You ask if it's possible to "alter request handling"; its introduction says it's a "system for globally altering Scrapy’s requests and responses"; I don't know why you wouldn't think that's what you're looking for, but if you read on, it is exactly what it sounds like.

The key method in a DownloaderMiddleware object is process_request. As the docs say:

This method is called for each request that goes through the download middleware.

process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.

If it returns a Response object, Scrapy won’t bother calling any other process_request() or process_exception() methods, or the appropriate download function; it’ll return that response.

So, you just write a DownloaderMiddleware whose process_request calls Selenium, processes what it gets back, and returns it wrapped in a Response.

The built-in HttpCacheMiddleware should demonstrate how to do this if it's not obvious.

By : abarnert


This video can help you solving your question :)
By: admin