BT

Intro to knysa: Async-Await Style PhantomJS Scripting

| Posted by Bo Zou Follow 0 Followers on Jul 21, 2016. Estimated reading time: 11 minutes |

Key takeaways

  • knysa enables async-await style asynchronous programming for PhantomJS
  • knysa eliminates the need for currying
  • knysa supports try/catch/finally
  • knysa has better support for browser-side AJAX call
  • knysa program flows naturally

 

PhantomJS is a modern headless (no GUI) browser scriptable with a JavaScript API.  It’s perfect for page automation and testing.  The JavaScript API is brilliant, offering many advantages but it also suffers from the same “callback hell” problem with JavaScript, i.e. deep nested callbacks.  

There are many libraries and frameworks to help deal with this problem.  For PhantomJS, CasperJS is one such solution that is very popular, but it only mitigates the problem and does not solve it.  knysa, on the other hand, solves the problem elegantly.  Like CasperJS, it allows you to put steps in sequence. Unlike CasperJS, it does not add a lot of boilerplate code (e.g. casper.then(), etc.).  

More importantly, it allows you to use code constructs like if/else/while/break/try/catch/finally to control program flow naturally.

Let’s use an example to illustrate the nesting problem and the idea of knysa.  The following is a CasperJS script to do a Google search of 'CasperJS' and then check if each linked page contains the keyword 'CasperJS':

  • (line 9) open the Google website and wait for the site to be loaded
  • (line 11) after the site is loaded, fill and submit the search form and wait for the response
  • (line 13) process the response:
    • (line 16, 17) visit each link in the response and wait for the linked page to load
    • (line 18-23) when the linked page is loaded, check if the word ‘CasperJS’ exists

The above description is simple and straightforward, but CasperJS’s nesting makes the code flow much more complicated.

 1 var links = [];
 2 var casper = require('casper').create();
 3 function getLinks() {
 4     var links = document.querySelectorAll('h3.r a');
 5     return Array.prototype.map.call(links, function(e) {
 6         return e.getAttribute('href');
 7     });
 8 }
 9 casper.start('http://google.com/', function() {
10     // search for 'CasperJS' from google form
11     this.fill('form[action="/search"]', { q: 'CasperJS' }, true);
12 });
13 casper.then(function() {
14     // aggregate results for the 'CasperJS' search
15     links = this.evaluate(getLinks);
16     for (var i = 0; i < links.length; i++) {
17         casper.thenOpen(links[i]);
18         casper.then(function() {
19             var isFound = this.evaluate(function() {
20                 return document.querySelector('html').textContent.indexOf('CasperJS') >= 0;
21             });
22             console.log('CasperJS is found on ' + links[i] + ':' + isFound);
23         });
24     }
25 });
26 casper.run();
 

As we can see, casper.then() on line 18 is nested inside another casper.then() on line 13.  Such nesting obscures the programming logic and make the program flow tangled. When the script is executed, instead of flowing forward only, the program flow has 3 intermingled phases:

  1. phase #1 (line 9,13,26) builds a list of steps (anonymous functions) using casper.start() (line 9) and casper.then() (line 13).  The list of steps are then executed by invoking capser.run() (line 26).
  2. phase #2 (line 11,15,16,17,18): as the list of steps are executed, the code in each step (the anonymous functions) is executed.
  3. phase #3 (line 19,20,21,22): more steps are added to the original list of steps and executed.

Hence each level of nesting adds one phase of execution.

Because of the intermingled phases, the order that each line of code appears in the script no longer matches the script execution order, for example, line 13 is executed before line 11.  This makes it harder to read, reason and thus program.  Another problem is that it is hard to do “if/else” logic or handle any exception.  And a third problem: links[i] on line 22 always prints 'undefined'! 

Why?

Because the variable 'i' has already been incremented to links.length in phase #2 by the time line 22 is executed in phase #3.  To fix this problem, we have to resort to currying (18a/18b and 22a). We use a variable ‘link’ to hold the value of links[i] (line 18a) and partly evaluate an anonymous function to return another anonymous function (line 18b):

18         casper.then(function() {
18a            var link = links[i];
18b            return function() {
19                 var isFound = this.evaluate(function() {
20                     return document.querySelector('html').textContent.indexOf('CasperJS') >= 0;
21                 });
22                 console.log('CasperJS is found on ' + link + ':' + isFound);
22a            }
23         }());
 

We see that, with currying, ‘link’ now has the correct value, but currying added more nesting and  boilerplate code.  Too bad. Can we do better?

The answer is YES.

In fact, with knysa, we can do much better: we can completely eliminate nesting, along with currying.  The script will be much cleaner and easier to read and the program flows naturally.

Below is the equivalent knysa script for the same problem (note we introduced an implicit variable ‘kflow’ and some functions on ‘kflow’, and interestingly some of these functions have prefix ‘knysa_’, which we will explain later):

  • (line 9) open Google site and waits for the site to load
  • (line 10) after the site is loaded, fill out and submit the search form and waits for the response to come back
  • (line 13) process the response:
    • (line 14) visit each link in the response and wait for the linked page to load
    • (line 15-18) when the linked page is loaded, check if the word ‘CasperJS’ exists

Nesting and currying are gone!  Now the order that each line of code is executed matches the order that each line of code appears in the script.  This order also matches the above description.  There is only 1 phase of code flow.  The code is much easier to read, reason and thus program.

 
 1 var links = [];
 2 var i, num, isFound;
 3 function getLinks() {
 4     var links = document.querySelectorAll('h3.r a');
 5     return Array.prototype.map.call(links, function(e) {
 6         return e.getAttribute('href');
 7     });
 8 }
 9 kflow.knysa_open('http://google.com/');
10 kflow.knysa_fill('form[action="/search"]', { q: 'CasperJS' });
11 links = kflow.evaluate(getLinks);
12 i = -1;
13 while (++i < links.length) {
14     kflow.knysa_open(links[i]);
15     isFound = kflow.evaluate(function() {
16         return document.querySelector('html').textContent.indexOf('CasperJS') >= 0;
17     });
18     console.log('CasperJS is found on ' + links[i] + ':' + isFound);
19 }
20 phantom.exit();

What is the magic?  The magic lies in the fact that although each function call prefixed with 'knysa_' (as in lines 9, 10, and 14) is asynchronous (async), knysa waits (await) for the underlying asynchronous call to finish before continuing to the next line.

knysa treats each script as a flow and assigns it an ID when executing it.  The flow object is exposed through the implicit variable 'kflow'.  The flow ID can be obtained by kflow.getId().

kflow provides a few async-await style browser navigation functions, i.e. knysa_open, knysa_fill, knysa_click and knysa_evaluate.  knysa_open, knysa_fill and knysa_click will wait for the new web page to finish loading:

      a. knysa_open(url): navigate to a page.
      b. knysa_click(selector): click to trigger navigation
      c. knysa_fill(formSelector, values): fill and submit form

knysa_evaluate(func, kflowId[, arg0, arg1, ...]): just like PhantomJS page.evaluate(), can execute arbitrary JavaScript including AJAX on the browser side (sandboxed).  Compared to PhantomJS page.evaluate(), knysa_evaluate has improved support for AJAX.  It suspends script execution.  To resume execution, the code inside 'func' (usually in AJAX success/failure callbacks) must call 'window.callPhantom(data)' with 'data.kflowId' set to 'kflowId'.  Here is an example taken from opl.kns: AJAX is used to renew a book and the script execution is resumed only when the renewal response is received:

oneRenewResult = kflow.knysa_evaluate(renew, kflow.getId(), ...);

 

and the sandboxed function 'renew' has the following lines:

 1    $.ajax({
 2         dataType: 'json',
 3         inline_messaging: 1,
 4         url: form.attr("action"),
 5         data: form.serialize(),
 6         success: function(e) {
 7             console.log("success: " + JSON.stringify(e));
 8             window.callPhantom({kflowId : kflowId, status: 'success', data: e});
 9         },
10         failure: function(e) {
11             console.log("failure: " + JSON.stringify(e));
12             window.callPhantom({kflowId : kflowId, status: 'failure', data: e});
13         }
14    });

Execution is only resumed after AJAX call finished.  Depending on the result of the AJAX call, oneRenewResult is set to different value:

   - on AJAX success, line 8 resumes execution and oneRenewResult is set to:

{kflowId : kflowId, status: 'success', data: e}

   - on AJAX faillure, line 12 resumes execution and oneRenewResult is set to:

{kflowId : kflowId, status: 'failure', data: e}

Note: it is the whole data passed into window.callPhantom() that is set as the return value of knysa_evaluate().

kflow.sleep(milliseconds) is another async-await function but it is handled specially by knysa.

kflow also provides a few regular (non async-await) functions.  They are taken directly from the CasperJS API:

- open(url)
- click(selector)
- fill(selector)
- getHTML(selector, outer)
- exists(selector)
- download(url, path, method, data)
- getElementAttr(selector, attrName)
- render(path)
- evaluate(func[, arg0, arg1...])

Implement your own async-await style function
To do so, prefix your function name with 'knysa_'.  This informs knysa that it is an async-await style function.  When such a function is called, script execution is suspended.  It is the responsibility of your new async-await function to resume the execution by calling kflow.resume(data).  When execution is resumed, the 'data' passed into kflow.resume will become the return value of the async-await function call.  Here is an example taken from resume.kns: it sleeps 1 second before returning the input ‘num’ multiplied by 100:

 1 function knysa_f1(kflow, num) {
 2     setTimeout(function() {
 3         kflow.resume(num * 100);
 4     }, 1000);
 5     // return num + 10;
 6 }

The return value of this function is the data passed into kflow.resume(), i.e. num * 100.
Important Note 1: In such async-await functions, regular return values are ignored. For example, even if line 5 is uncommented, the result of 'return num + 10' is simply thrown away.

Important Note 2: The call of an async-await style function must be a statement by itself.  Either:

     knysa_my_func(...);
or
     ret = knysa_my_func(...);

object call is supported, i.e.

     myObj.knysa_my_func(...);
or
     ret = myObj.knysa_my_func(...);

The following are not supported:

   1. if (knysa_my_func(...)) ...
      instead please rewrite it as:
          val = knysa_my_func(...);
          if (val) ...
   2. var1 = abc * knysa_my_func(...)
      instead please rewrite it as:
          val = knysa_my_func(...);
          var1 = abc * val;

Here is an example to call knysa_f1 defined earlier and assign the returned value to a variable:

    ret = knysa_f1(5);

When this line is executed, ret will be set to 500 after a 1-second delay.

Exception handling
knysa's exception handling mechanism is surprisingly simple: plain old try/catch/finally constructs.  Such facilities are notably missing in CasperJS.  Example: try.kns.

'catch' example: the following code renders a debugging image upon any exception

    var err;  // variables must be declared at beginning
    ...
    try {
        ...
    } catch (err) {
        kflow.render(image_path);
        console.log(err.stack);
    }

'finally' example: the following code guarantees log-out even on exception:

    // fill and submit form to log in to a web site
    kflow.knysa_fill(...);
    try {
       ...
    } finally {
       // go to the sign out link to log out
       kflow.knysa_open(logout_link);
    }

Caveats:

  1. 'else if' construct is NOT supported, please use nested 'if/else' instead
  2. 'for' loop body may NOT have async-await function calls or 'break' statement, please use 'while' loop instead
  3. all variables must be declared at the beginning, including 'err' in catch(err).
  4. the implicit variable 'kflow' can NOT be used in variable declaration.

Inner working details for the curious:
knysa script is first transformed into JavaScript before execution.  The converted script is a flow of many steps; each step is a function.  Each function name is encoded with flow control information:

  • Each function is numbered (to determine the execution order)
  • Suffix ‘_async’ means script execution will be suspended.  Script execution will resume when certain condition is met: e.g. page response received or AJAX response received etc.  Each async-await statement is converted into such a function.
  • Suffix ‘_while’ with no ‘_endwhile_’ in the middle of the function name signals the beginning of a while loop
  • Suffix ‘_while’ with ‘_endwhile_’ in the middle of the function name signals the end of a while loop
  • although not shown here, 'if/else/try/catch/finally/break' statements are transformed similarly to ‘while’

The following is the transformed JavaScript for the Google search knysa script presented ealier:

var knysa = require("./knysa.js");
function knycon_search_casperjs_10001() {
    var links = [];
    var i, num, isFound;
    function getLinks() {
        var links = document.querySelectorAll("h3.r a");
        return Array.prototype.map.call(links, function(e) {
            return e.getAttribute("href");
        });
    }
    this.n50002_async = function(kflow) {
        kflow.knysa_open("http://google.com/");
    }
    this.n50003_async = function(kflow) {
        kflow.knysa_fill('form[action="/search"]', {
            q: "CasperJS"
        });
    }
    this.n50004 = function(kflow) {
        links = kflow.evaluate(getLinks);
        i = -1;
    }
    this.n50005_while = function(kflow) {
        return ++i < links.length;
    };
    this.n50006_async = function(kflow) {
        kflow.knysa_open(links[i]);
    }
    this.n50007 = function(kflow) {
        isFound = kflow.evaluate(function() {
            return document.querySelector("html").textContent.indexOf("CasperJS") >= 0;
        });
        console.log("CasperJS is found on " + links[i] + ":" + isFound);
    }
    this.n50008_endwhile_n50005_while = function() {};
    this.n50009 = function(kflow) {
        phantom.exit();
    }
}

knysa.knysa_exec(new knycon_search_CasperJS_10001);

Note 1: The above converted JavaScript is only for your reference as it is an implementation detail.  knysa implementation may change. For example, a future version might use Promises.  And of course, when PhantomJS fully supports JavaScript generators in ES6 or async/await in ES7, knysa may not be needed.

Note 2: Although knysa eliminates the need for using callback to do program flow control in knysa script, knysa itself does use the callback mechanism of PhantomJS, i.e. page.onCallback() and page.onLoadFinished().

Action Time
Now that you see how easy and natural to program with kynsa for PhantomJS, why not try it out yourself?  knysa is hosted on github.  You can start with the examples.  I would also like to hear your feedback. Since knysa is new and has lots of room to improve, you are warmly welcomed to contribute.  There are different ways to contribute:

  1. work on an improvement ticket.
  2. provide more example scripts, big or small
  3. or even better, share your knysa scripts that help you do your daily chores so that others can save time and be more productive too.

Acknowledgement:

  1. uglifyjs1 is used to parse knysa script and generate corresponding javascript
  2. many ‘kflow’ functions were taken directly from CasperJS

About the Author

Bo Zou is a seasoned software developer.  He had experimented with many tools to do web automation, including Perl, HttpUnit, HtmlUnit, Watij etc.  Recently he has been focusing on PhantomJS and Android.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Excellent writeup by GluedTo TheScreen

I cannot wait to try this and must say, the writing quality of this article is excellent. In particular, the grouping of additional info and conversational style. Nice work. Thank you.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT