Scrapy. First response requires Selenium


I'm scraping a website that strongly depends on Javascript. The main page from which I need to extract the urls that will be parsed depends on Javascript, so I have to modify start_requests. I'm looking for a way to connect start_requests, with the linkextractor and with process_match

class MatchSpider(CrawlSpider):
    name = "match"
    allowed_domains = ["whoscored"]
    rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[contains(@class, "match-report")]//@href'), callback='parse_item'),

    def start_requests(self):
        url = ''
        browser = Browser(browser='Chrome')
        # should return a request with the html body from Selenium driver so that LinkExtractor rule can be applied

    def process_match(self, response):
        match_item = MatchItem()
        regex = re.compile("matchCentreData = \{.*?\};", re.S)
        match =, response.text).group()
        match = match.replace('matchCentreData =', '').replace(';', '')
        match_item['match'] = json.loads(match)
        match_item['url'] = response.url
        match_item['project'] = self.settings.get('BOT_NAME')
        match_item['spider'] =
        match_item['server'] = socket.gethostname()
        match_item['date'] =
        yield match_item

A wrapper I'm using around Selenium:

class Browser:
    selenium on steroids. allows you to create different types of browsers plus
    adds methods for safer calls
    def __init__(self, browser='Firefox'):
        type: silent or not
        browser: chrome of firefox
        self.browser = browser

    def _start(self):
        starts browser
        if self.browser == 'Chrome':
            chrome_options = webdriver.ChromeOptions()
            prefs = {"profile.managed_default_content_settings.images": 2}
            chrome_options.add_experimental_option("prefs", prefs)
            self.driver_ = webdriver.Chrome(executable_path='./libcommon/chromedriver', chrome_options=chrome_options)
        elif self.browser == 'Firefox':
            profile = webdriver.FirefoxProfile()
            profile.set_preference("general.useragent.override", random.choice(USER_AGENTS))
            profile.set_preference('permissions.default.image', 2)
            profile.set_preference('', 'false')
            profile.set_preference("webdriver.load.strategy", "unstable")
            self.driver_ = webdriver.Firefox(profile)
        elif self.browser == 'PhantomJS':
            self.driver_ = webdriver.PhantomJS()
            self.driver_.set_window_size(1120, 550)

    def close(self):

    def return_when(self, condition, locator):
        returns browser execution when condition is met
        for _ in range(5):
            with suppress(Exception):
                wait = WebDriverWait(self.driver_, timeout=100, poll_frequency=0.1)
                self.driver_.execute_script("return window.stop")
                return True
        return False

    def __getattr__(self, name):
        ruby-like method missing: derive methods not implemented to attribute that
        holds selenium browser
        def _missing(*args, **kwargs):
            return getattr(self.driver_, name)(*args, **kwargs)
        return _missing


There's two problems I see after looking into this. Forgive any ignorance on my part, because it's been a while since I was last in the Python/Scrapy world.

First: How do we get the HTML from Selenium?

According to the Selenium docs, the driver should have a page_source attribute containing the contents of the page.

browser = Browser(browser='Chrome')
html = browser.driver_.page_source

You may want to make this a function in your browser class to avoid accessing browser.driver_ from MatchSpider.

# class Browser
def page_source(self):
    return self.driver_.page_source 
# end class

html = browser.page_source()

Second: How do we override Scrapy's internal web requests?

It looks like Scrapy tries to decouple the behind-the-scenes web requests from the what-am-I-trying-to-parse functionality of each spider you write. start_requests() should "return an iterable with the first Requests to crawl" and make_requests_from_url(url) (which is called if you don't override start_requests()) takes "a URL and returns a Request object". When internally processing a Spider, Scrapy starts creating a plethora of Request objects that will be asynchronously executed and the subsequent Response will be sent to parse(response)...the Spider never actually does the processing from Request to `Response.

Long story short, this means you would need to create middleware for the Scrapy Downloader to use Selenium. Then, you can remove your overridden start_requests() method and add a start_urls attribute. Specifically, your SeleniumDownloaderMiddleware should overwrite the process_request(request, spider) method to use the above Selenium code.

By : Sam

You could use Convert.ChangeType for that:

public void Add(object value)
    var t = Convert.ChangeType(value, typeof(T));

See the fiddle:

If you need object insead of T you can use dynamic in your Add method.

 public void Add(object value)
      //When T is decimal, then I get 
      // System.InvalidCastException
      dynamic t1 = value;
      var t = (T)t1;
By : steryd

This video can help you solving your question :)
By: admin