Simulating a browser on Google App Engine

By : Uri

I want to use selenium or windmill inside google app engine in order to scrape a JS filled website. I know that windmill is written in python and javascript.

Is this possible? If it is, how do insert the library?
If not, could you explain why and provide alternatives?



I searched a little more and saw that scrapy is pure python.
Will that work? Does it handle javascript?

By : Uri

Both Selenium and windmill (which is think is now unmaintaned) are controllers for a real browser. Usually they spawn a real browser (e.g. Firefox) as a subprocess and control it. I don't think you can do that in AppEngine. The closest thing to a pure-code browser that I know of is htmlunit, put that's Java. As far as I know there is no equivalent for Python.

By : Keith

Any python "scraping" library is unlikely to be able to interpret the javascript for you on appengine since it would probably require some kind of C-extension (like a binding to spidermonkey or v8) which would be against the GAE sandboxing.

But, if you were to venture over to the Java side you might have more luck. I know that you can get Rhino running on AppEngine, with a little help from env.js you could emulate the DOM, a quick google shows a bunch of scraping tools for Java. It's just a matter of tying it all together.

HtmlUnit Looks like it attempts to do just this, but it is unclear wether it is currently appengine-friendly as it appears to be threaded.

I believe both Selenium and Windmill only allow you to control a browser, not simulate one. They expect to run in a desktop environment and drive a real browser, which you can't do with App Engine.

You can use the URL Fetch API and an HTML parser like BeautifulSoup to handle page scraping from App Engine.

This video can help you solving your question :)
By: admin