Simulating a browser on Google App Engine

By : Uri

I want to use selenium or windmill inside google app engine in order to scrape a JS filled website. I know that windmill is written in python and javascript.

Is this possible? If it is, how do insert the library?
If not, could you explain why and provide alternatives?



I searched a little more and saw that scrapy is pure python.
Will that work? Does it handle javascript?

Both Selenium and windmill (which is think is now unmaintaned) are controllers for a real browser. Usually they spawn a real browser (e.g. Firefox) as a subprocess and control it. I don't think you can do that in AppEngine. The closest thing to a pure-code browser that I know of is htmlunit, put that's Java. As far as I know there is no equivalent for Python.

By : Keith

Any python "scraping" library is unlikely to be able to interpret the javascript for you on appengine since it would probably require some kind of C-extension (like a binding to spidermonkey or v8) which would be against the GAE sandboxing.

But, if you were to venture over to the Java side you might have more luck. I know that you can get Rhino running on AppEngine, with a little help from env.js you could emulate the DOM, a quick google shows a bunch of scraping tools for Java. It's just a matter of tying it all together.

HtmlUnit Looks like it attempts to do just this, but it is unclear wether it is currently appengine-friendly as it appears to be threaded.

I believe both Selenium and Windmill only allow you to control a browser, not simulate one. They expect to run in a desktop environment and drive a real browser, which you can't do with App Engine.

You can use the URL Fetch API and an HTML parser like BeautifulSoup to handle page scraping from App Engine.

