Beyond Webdriver

Thoughts on improving website testing design

Steve Stagg


I was going to start this by mis-quoting Churchill:

Selenium is the worst tool for website testing, except for all those other tools that have been tried from time to time.

Except that Selenium (webdriver) isn’t really a website testing tool at all, it’s a browser automation tool, and the Selenium IDE has definitely been improved on by other projects. By this point, the joke had lost most of its fun, and I decided not to bother.

Instead, I’ll just make my point: Webdriver is very good at what it does, but the extra tooling needed to build a really powerful testing environment on top of it doesn’t quite exist yet. There are some good bits out there already, but nothing ticks all the boxes. I think we could put together something amazing without too much work.

From my experience, for web testing to be really useful, a number of specific requirements have to be met:


1. Code Independence

One of the things that Selenium offers is a Page Object Model. This to an engineer may sound great. I’ve previously implemented something similar myself. The basic concept is to provide an automation API on top of each page of your app to allow tests to interact with it. This idea becomes more seductive when you have non-developer testers around who have to test your app. System details and complexity can behind a simple API and the testers can just call simple functions. This is flawed for a number of reasons.

Firstly, it limits the scope of what you’re testing. If an engineer has mis-understood the ‘point’ of a particular feature they implement, and that engineer also writes the automation API for it, then the test is just going to hide that fact behind another abstraction. High-level integration testing, which is what web testing usually is, should check that things work, but also that things work in a sensible fashion.

In GMP environments, integration tests should (ideally) be written in isolation of the implementation, by a third party, and executed together at the end. If this test has to use a page-specific API to do so, then the isolation is undermined. The selenium POM example page gives a good example of this:

public HomePage loginAs(String username, String password) { … }

This seems simple, the contents of this function will type the username and password into some text fields, and click the login button (maybe). What happens if the login button is actually labelled to the user as ‘Initiate user login session’ however? An independent tester should immediately pick that up as a problem, but the person who just added that button may well not see what the issue is, and the test script that calls loginAs doesn’t even know about the login button so can’t check it.

There’s also the problem of re-use in testing. The traditional example of a perfectly logical, but totally useless test:

FIVE = 5
FOUR = 6
def enterprise_max(a, b):
Takes two T_INTEGER and returns the one with the higher
return b if a is FOUR else a
def test_enterprise_max():
assert enterprise_max(FIVE, FOUR) == FIVE

In this example, the code is almost definitely wrong, and yet the test happily passes and test coverage is 100%. This is an extreme example, but I’ve seen more subtle variants of this many times. When your POM is written as a module attached to the implementation code, then this sort of mistake is trivial to make.

The real-world cost of adding a POM is deceptively high, especially if you have a separate test team. It’s a new, separate formal API for automating your system, that means that every feature has to be automated, and exposed over it. If your app has a simple three-layer code model:

The POM is a formal API that must be implemented and maintained by (expensive) engineers

Then by adding a Page model, you’ve just increased the number of layers in your code by 1/3. And every change in functionality has to be kept up-to-date in every layer (in practice layering code doesn’t often isolate functional changes to individual layers, it helps structure the logic).

The POM also has customers (the test engineers/other devs), if you have a separate testing team, these customers likely have no direct control over the API, so any new functionality requests have to be dealt with by the implementation team, and if a code change breaks a tests script, then the debugging is likely to involve someone who wrote the API to help work out, and fix, the problem. Non-developer testers loose a lot of independence and control if their scripts become subordinate to a developer-defined API.

All of the above issues are resolvable, and can be worked around, usually by throwing resource at the problem, maybe adding some dedicated automation API engineers, or similar, but it’s costly to do, and takes a lot of time. I believe there are cheaper and much better alternatives.

2. Resumability

I prefer my unit test runners to start in less than a second, and to run a single unit test in around 0.1s (on average). This is important for producing high quality code fast, and allowing fearless development.

For browser based testing, this isn’t possible, or practical. Typically browser tests are exercising your full application stack, and are by nature slow. Fixture setup can also be exceedingly slow, (mitigation strategies for fixtures are a separate topic here)

Typically, in my experience, a single selenium test script will run in the order of minutes, or tens of minutes, and I don’t see that as a problem. They are testing realistic workflows, and complex processes, so having long-running tests is not trivial to avoid. A Continuous Integration box can churn through these sorts of tests without too much trouble.

The problems arise when initially writing these tests, or debugging them. If you’re trying to finish off the end of test script that takes 20 minutes to run, adding a single new operation is extremely expensive to do if you have to re-run your script from the start just to test it out each change. Likewise, investigating a failure that only happens 10 minutes into a test run can be painful in the extreme.

In order to work well with these larger scripts (fast fixture management only partly helps here) it’s important to be able to stop and interact with the test script during runtime as easily as possible.

Ideally, it should be possible, if an assertion fails 1/2 way through a test case, to update the steps and continue the script from where it failed without loosing any context.

3. Understandable Scripts

This is an easy one, all tests should be understandable right? But for browser testing, this is arguably much more important than for other types of tests.

It’s acceptable to assume that if you’re unit testing network protocol handlers, then someone with someone with some in-depth knowledge of networking might be needed to read the tests, but functional testing should be about the ‘if I click the red X, then window goes away’ type of statement. In the formal world, this is proving that the overall vision, and functional specs have been adequately satisfied.

There should be no reason why a business owner, or non-technical user shouldn’t understand ‘if I click the red X, then window goes away’, so it should be possible to describe the tests in a similar language. The fact that the red X has an id=’ui_toolbar_close_button_faded’ is not relevant and should be hidden from the person reading the test script. Selenium and BDD testing are natural allies here, and I’ll talk about Cucumber below.

The ideal is that the business user (manager, product owner) should be able to write such scripts themself, but this is a far harder problem to solve than you might think. It may be possible, but I’ve yet to see it happen.

4. Test Semantics

Selenium tests should be about proving (for example) ‘a user can log-in to the site’ and not ‘the site can log-in a user’. Logically these are almost identical, but semantically they’re very different.

By the time you get to the stage of writing browser tests, you should be fairly confident that the system can handle logins, but it may well not be clear that the user can do this in a sane way.

What this means for tests is that class names, and element IDs should be avoided wherever possible. If a dev goes in and changes an element id from ‘login’ to ‘authenticate’ that shouldn’t affect the user at all, and it shouldn’t affect the tests. If that dev changed the button text from ‘Login’ to ‘Initiate user session’, then it’s entirely reasonable for both the user and the test to break horribly.

So not only do you want your tests to be readable by people, without knowing about system internals, the best tests will be runnable without knowing about the internals.

Sometimes this is unavoidable, it’s quite hard to find a test runner that can understand that toolbar means ‘the grey horizontal box near the top of the screen’. But often you’ll need to cater to users who may only have a fuzzy notion of what a menubar is anyway. Most people think “click save” not “click save that’s in the toolbar”.

This fuzzy way of thinking is often alien to those working with software, we like absolutes and definites, but when you’re emulating a user who doesn’t care about all the beautiful well-defined UI models you’ve built, but just wants to save his damn file, it’s important to address these ambiguities. If the test script says ‘click save’ but there are 3 ‘save’ buttons on the page, it’s probably better to fix having 3 save buttons on the same page than updating your test to identify the button you know is the correct one.

As always, there are times when this approach can’t be made to work, and you have to fall back to finding something by ID, or CSS, or X-path, but in that situation, it should be possible (and easiest) to give that system identifier a human-understandable name:

menubar = thing.find_by_css_selector(“#menubar”)


I went into some detail on these requirements to try and outline why I think they are important, if you have any comments or questions please leave them, or tweet me @stestagg.

Next I will look at some of the tools and kits currently being used, and how they fit in with the above requirements.

Existing Tools

Native Selenium (Webdriver)

Webdriver provides a lot, it covers the various servers for driving web browsers, and various client APIs. It’s great at automating browsers, but not so great at writing more complex tests:

  • Code independence — The client libraries do nothing to address this (rightfully in my opinion)
  • Resumability — As a raw driver, any resumability functionality are delegated to the client language, and while there are some methods to achieve this, they don’t really count in this context.
  • Understandable scripts — I’ve written many selenium scripts in Python, and none of them have been things I would want to share with anyone non-technical.
  • Test Semantics — The webdriver API actually has a few methods that I would call semantically relevant: things like click and back, but also: find_element_by_link_text is a good one. The number of useful methods however is too small to write good tests with. It’s too easy to just revert to using the find_element_by_css_selector or similar


The various BDD testing tools out there allow you to define user-readable scripts with ease, and many people use them on top of webdriver to write beautiful test cases, but I feel that there’s still some things missing:

  • Code independence — This is hard to get right with BDD tools. Typically it’s up to the implementor to define the vocabulary used, and designing an abstract set of verbs is hard. More often, it’s used as a method for annotating tests with human-readable labels (i.e. one action description matches one test step. Lines such as:
When(/^I enter "(.*?)" into the search field$/) do |arg1|

are common, where the defined steps are really a glorified POM implementation. There are some attempts to rectify this, and the lettuce-webdriver project looks like an interesting attempt at that.

  • Resumability — I have yet to see a test runner that includes runtime context, and supports decent debugging and resumability. They may exist, but I haven’t used one.
  • Understandable scripts — This is where BDD tools really shine, the test output is clear, and readable (usually). Getting non-technical people to write tests using BDD tools is theoretically possible, but actually quite complex, the language is more strict than people may be used-to, grammars tend to be sensitive to whitespace etc, and unless you have a truly abstracted BDD language, you’re really just using a POM with all the associated communication issues.
  • Test Semantics — Good but not great, because steps are behind-the-scenes implemented in native app-specific code, they will naturally tend to interact with the app internals more than they should.

On the whole, I think BDD type languages have a part to play in the final picture, but I think more is needed.

Selenium IDE

Selenium is great, webdriver is greater(?), and I wish the Selenium IDE was never released. It was a bold project to build a nice non-technical click-and-drag interface for writing tests that helped to kick-start the browser testing movement, unfortunately it’s not suitable for testing platforms under development at all, and it provides a big negative experience for people who turn to the IDE first when implementing browser tests.

The basic problem is that an automated recording tool cannot apply semantic logic to a user’s actions. We’re decades away from a computer understanding what a user is trying to do without lots of prompting. What results is a test script that is tightly bound to a particular page structure, and test scripts that break on every run because of this coupling. There is no state tracking (you can’t really store dynamic variables and refer to them later), and the feature set is small.

To evaluate each requirement:

  • Code independance — Great, the IDE doesn’t know anything about particular implementations.
  • Resumability — Not bad, the script can be edited dynamically, and individual steps re-run, but the UI is flawed and it doesn’t seem to let you easily resume a partially complete test run without running each step individually
  • Understandable scripts — The test scripts can be output in a variety of formats, but the IDE never knows the things that are needed to be able to describe what is happening in a useful way. clickAndWait id=btnG will never mean anything to someone not involved in writing the system.
  • Test Semantics — The methods available in the IDE are the same as provided by the underlying Selenium API, so semantically they are poor, elements are referenced by IDs and paths.

In many ways, the Selenium IDE is a ‘get what you pay for’ technology (not in money terms). It looks great, you can install it and immediately record a test script, no learning needed! It’s only when it comes to maintaining these scripts, documenting them, and debugging issues that the real costs of using the IDE become apparent.


I haven’t used Watir before, and it looks like a great tool for doing this sort of thing. But from my brief overview of the examples and documentation, it seems to help with, but not concretely address the requirements listed above.


That somewhat lengthy introduction paints a picture of my understanding of the current selenium landscape. I think that with some careful design and a few lines of code, something amazing could emerge. Here are my thoughts so far, however unorganised they may seem.

The idea is to wrap the basic Webdriver API in an adapter that provides user-oriented actions, add a simple (execution) language-independent way of specifying a test script, and then develop some different test runners for different scenarios:

Basic system organisation

1. An action list

Each test script should be composed of nested list of operations, in some abstract language. I’m proposing a JSON array.

Each item in the array describes a step in the script, and each step is of a specific type, and contains contextual information. This may sound very similar to a BDD list, but the crucial points here are, the list is nested, allowing for grouping and contextual operations, and the available actions are based off a built-in vocabulary that is oriented around user-centric concepts. For example, to click on a standard disclaimer label on the page, and store the full text for later reference:

action: "find",
startswith: "I agree to the terms",
and: [
action: "click"
action: "remember_text",
as: "disclaimer text"

The advantages of this approach are many:

  1. This file could be hand-written without any additional tools (but the idea is to develop scripts with tools)
  2. It can be easily serialised and parsed by any language or runner
  3. It defines a common, well-defined API that can be publicly shared
  4. By designing the actions carefully, very complex scripts can be built-up with reusable components without compromising on the overall readability and structure.

A web-based designed

This is a crucial element of the system, by building a web-based tool for developing scripts, the system can really easily be made interactive, and cross-platform.

For example, editing, resuming, and running individual tests is simple in this environment, all communication between the editor and the system running the selenium driver are based on the JSON spec above, and can be very simple.

A 5-minute mockup of the interface:

Example of how the Web based editor may look, with live-updating browser screenshots, and ability to run steps on-demand

The UI can be managed purely in Javascript, with drag-drop, wizards, and helpers to help non-technical people build up simple scripts. With live-updating browser displays and easy run buttons, it’s possible to easily build up complex scripts one step at a time.

The UI can (largely) be auto-generated from a static definition of the action types and their arguments, with nesting being automatically added to actions that support that.

A command-line runner

Because the language is well-defined, and self-contained, a script can be passed to a very simple command-line runner that can execute the script against different browsers, and without human interaction.

The output can be very easy to understand, given the example script above:

A more complex example might be:

In this case, each step is readable in its own right, the failure can be traced all the way back, and extra contextual comments have been added to outline the broader direction of the script. It would be easy to filter the output to include more or less detail based on who is interested. Alternately, an X-unit compatible output is trivial to add.

If a test run fails, then it is simple to open the script up in the editor and reproduce the problem interactively. Adding in video capture and regular screenshots to the command-line runner is trivial.

Cucumber adapter

It would also be possible to write an adapter that converted the built-in actions into BDD style grammar, and then translated between that and the underlying JSON format.

A User-oriented vocabulary

A fairly small collection of ‘things that people might want to do’ should cover most cases, this list has a lot of overlap with the selenium API, but the differences are crucial.

Finding elements

Elements can be found using different criteria: exact text match, text starts with, contains text.

In cases where that doesn’t suffice, then also filtering by tag name or attributes is possible.

The find action can also be customised to be case sensitive, return a single element, or all elements, include hidden or disabled elements (by default excluded)

Interacting with element

click — Click on the current element (only makes sense when nested below a find action)

click on <text> — Find any element (below the current one) that is clickable (link, button, span with click event, etc..) that has the matching text, and click on it.

type <text> — Alias for ‘send_keys’, types text into current element

fill in <label> with <value> — Find a <label> element with text matching <label>, click on it, then type <value> into the input with focus

select <option> — Find a selection option with text matching <option> and select it (typically used after finding a <select> element)

Identify <name> by css: <css> — stores the CSS selector under <name> allowing hard-to-find elements to be identified.

In <name> — any nested commands will be scoped by an element that matches the CSS above. I.e.:

Identify [toolbar] by css [#header #toolbar]
In [toolbar]:
Click [Save]

Structural actions

The above actions cover a lot of simple cases, but quickly problems get more complex, and domain-specific. In that case, it may not be possible to avoid having more technical actions in play, but these can be carefully encapsulated to help maintain isolation between non-technical test writers and devs.

Firstly, all executions run in a context that can be used to store variables and state, allowing for more complex tracking of dynamic values, this context is maintained during interactive sessions, and the user can see stored (simple) values in the editor and change them. Variables are simple key->values where the key can be any string.

Include <script> — Most people develop a set of test scripts, and accompanying files into a suite. Where this happens, test scripts can include other scripts inline, allowing for common settings and utility libraries to be imported

Define <procedure> with summary: [Text with <inline> <placeholders> <indicated> for caller to fill in] and actions: [nested list of actions] — Defines a stored procedure (list of actions) which will be run with added variables passed in by other callers. In the UI, this may look like (text in [] would be entered by user into textbox, text in italics indicates variable placeholders):

Define [fast login] with summary: [Quickly login as <user> with <password>] and actions:
Browse to [root url]
Fill in [username] with [user]
Fill in [password] with [password]
Click on [login]

Then a script could call the procedure at any point, with the Web IDE pr

oviding placeholders and hints dynamically based on the summary:

Do [fast login]: Quickly login as [bill] with [Passw0rd]

The test runner would then dynamically expand that to be:

Finally, if all else fails, then a get-out-of-jail clause would allow (trusted) scripts to run inline code from a script directly:

Evaluate: <statement> — Evaluate code in the test-runner directly, with access to the variable context, underlying selenium driver, and other infrastructure.

Evaluate [assert int(context[“current year”]) ==]

These more advanced usages would be present but not encouraged for use except in more advanced cases. By allowing them to be encapsulated in human-readable BDD style statements, the flexibility is retained without forcing readability to be sacrificed.

Comparing to the Requirements

  • Code independance — The basic verbs and language used here has no knowledge of individual implementations. Where such links are to be added, they are isolated within the test scripts, this makes coupling test code to implementations hard, which I consider a very good feature
  • Resumability — Runtime context can be directly edited, steps can be run dynamically and on-demand from the editor, so developing and debugging scripts is trivial. The command-line runner may not have any of this functionality (except PDB?)
  • Understandable scripts — The system can convert any script into human readable output, with support for added plain-text context in the process. It should also be possible (defining the grammar may be hard) to write a full BDD style script definition language that compiles to the internal script format.
  • Test Semantics — This is down to the verbs used and defined, but by providing a built-in set of verbs that set the scene for a user-action oriented test script, it should be much simpler and more natural to write scripts that make sense to a user. This is about directionality more than anything.


None of this system I’ve described is particularly ground-breaking, most of the ideas above are spread around other projects out there, but bringing this all together produces a really powerful, simple to use system.

I’ve put together a prototype which unfortunately I have lost the code for, but the concept is solid, and the prototype proved this system’s usefulness.

By writing up this spec, I’m hoping to promote some debate, and hone ideas before getting stuck into a full implementation that may be useful to a wider audience. If anyone wants to write this with/for me please get in touch, or just start!

If you’ve stuck it this far, then thank you, you’re very patient, and here’s a small present: