My goal was to automate the process using a headless browser (that is, a browser that runs without a GUI, allowing you to navigate the web and interact with web pages from your terminal). In this case, I chose to use PhantomJS.
I had a mixed experience. Headless browsing is notoriously difficult to debug, and this case was no different: errors were tough to reproduce, results were inconsistent, and producing informative output was challenging.
Once I got past the initial difficulties, however, PhantomJS was an impressive tool—hats off to the creators, as always.
In this post, I'd like to describe some of the common "gotchas" that I've found associated with PhantomJS and walk through their solutions. (I call them "common" due to: 1. my own experience, and 2. finding similar questions/issues documented on StackOverflow.)
Note: that PhantomJS is often used in tandem with CasperJS; it's possible that some of what follows is made easier with Casper, namely, navigating webpages. But I think these gotchas are still valid, even in the face of Casper.
Perhaps the first gotcha is that there are really two contexts in your PhantomJS program: firstly, the PhantomJS program itself; secondly, the webpage open in your headless browser, i.e., access to the DOM. (This is important for subsequent gotchas.)
A litmus test: if you're using jQuery, you're in the latter context.
page is the variable representing the current page "open" in your headless browser).
page.evaluate takes, as argument, a function to-be executed in the context of the webpage. This is incredibly useful. Further,
page.evaluate can return a result from the webpage back to your PhantomJS program. Say, for example, that you'd like to grab the text of an element on the current page with ID "foo":
You could then use
foo in your PhantomJS program, successfully extracting the value from the webpage. Note: return values are limited to simple objects, rather than, say, functions.
Actually, the code snippet above might not work as expected. I'll repeat it here for clarity:
The problem? The active webpage might not have jQuery loaded.
page.includeJs. The difference between the two is quite nuanced. There's a good discussion here for those interested, but essentially,
page.injectJs pauses execution until the script is loaded, while
page.includeJs loads the script like any other. Note: both accept callbacks.
Here's the revised code (credit to the PhantomJS docs):
Similarly, I was often frustrated by the inability to display information logged by my webpage. Recall the split between the context of your PhantomJS program and the webpage open in your headless browser. Well, if you type
console.log("Hello, World!") in your PhantomJS program, that will be printed to your terminal. If, however, your webpage tries to log the same message, it will pass by unnoticed! So if your webpage prints a bunch of traces to the console, you'll never see 'em.
Specifically, the following code does nothing because "Hello, World!" is printed in the context of the browser:
So, what if you want to log messages to your terminal from within your webpage? The trick is to use the
page.onConsoleMessage event and echo any messages printed in the browser out to your terminal. Try this:
For more, see my StackOverflow answer.
PhantomJS beginners constantly ask how they can wait for something to appear on their webpage before acting. For example, maybe they want a banner to appear and then extract some text from it. Say "#foo" is now a div that loads a few seconds after the page has appeared. If you simply use the following code, you'll get unexpected results, as the banner may not be loaded at the time of query:
There's a lot more to PhantomJS than I've managed to go through in this post. And I'm personally excited to check out CasperJS at some point in the future, which seems great as well (in particular, for unit testing). But hopefully the tips and gotchas described in this post can be helpful for beginners.
Posted on September 13, 2013.