Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Robotic Testing of Mobile Apps for Truly Black-Box Automation

Robotic Testing of Mobile Apps for Truly Black-Box Automation

Key Takeaways

  • A prototype of Axiz has been built using commodity hardware components, using 3D vision=based self-calibration, and a four-axis Arduino-based robotic arm.  An external CMOS 1,080-pixel camera monitors the test execution.
  • Robotic testing can address the profound shift from desktop to mobile computation and make it possible to test more easily test features like UI gestures and sensor input.
  • Frameworks such as Appium, Robotium, and UIAutomator can partly support automatic test execution. However, they rely on human test script design, thereby creating a bottleneck.
  • Handheld devices require rethinking what black-box testing really means, including factors such as increased realism, device independence, cost-benefit ratio, and reduced reliance on assumptions.
  • It turns out, Axiz accurately executed each test event specified in the generated robotic-test cases and passed the required oracle checkpoints.


This article first appeared in IEEE Software magazine. IEEE Software offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.


Robots are widely used for many repetitive tasks. Why not software testing? Robotic testing could give testers a new form of testing that’s inherently more black-box than anything witnessed previously. Toward that end, we developed Axiz, a robotic-test generator for mobile apps. Here, we compare our approach with simulation-based test automation, describe scenarios in which robotic testing is bene cial (or even essential), and tell how we applied Axiz to the popular Google Calculator app. 

Why Do Robotic Testing?

Robotic testing can address the profound shift1,2 from desktop to mobile computation. This trend is projected to gather steam,3 accelerated by a concomitant shift from desktop to mobile-device ownership. Automated software testing is needed more than ever in this emerging mobile world. However, we might need to rethink some of the principles of software testing.

Mobile devices enable rich user interaction inputs such as gestures through touchscreens and various signals through sensors (GPS, accelerometers, barometers, neareld communication, and so on). They serve a wide range of users in heterogeneous and dynamic contexts 

such as geographical locations and neworking infrastructures. To adequately explore and uncover bugs, testing must be able to take into account complex interactions with various sensors uder a range of testing contexts. A suvey of mobile-app development indcated that practical mobile-app testing currently relies heavily on manual tesing, with its inherent inef ciencies and biases.4 Frameworks such as Appium (, Robotium ( /RobotiumTech/robotium), and UIAtomator ( /libraries/testing-support-library/index .html#UIAutomator) can partly support automatic test execution. However, they rely on human test script design, thereby creating a bottleneck.

Fortunately, many advances in atomated Android testing research have recently occurred.5–8 However, these techniques use intrusive (partly or fully white-box) approaches to execute the generated test cases. They also assume that testing tools will enjoy developelevel permissions, which isn’t always the case.

Many such techniques need to modify the app code or even the mobile OS, while even the most black-box of approaches communicate with the app under test  (AUT) through a test harness. This isn’t truly black-box because it relies on a machine-to-machine interface between the test harness and AUT.

A truly black-box approach would make no assumptions, relying only on the device-level cyber-physical iterface between the human and app. Testing at this abstraction level also more closely emulates the experience of real users and thus might yield more realistic test cases. Furthemore, such an approach is inherently device independent, a considerable benefit in situations that might ivolve more than 2,000 different dvices under test.9

A Robotic-Testing Manifesto

Handheld devices require rethinking what black-box testing really means. Their user experience is so different from that of desktop applications that existing machine-to-machine black-box test generation lacks the realism, usage context sensitivity, and cross-platform exibility needed to quickly and cheaply generate ationable test cases.

This section sets out a manifesto for robotic testing in which the geerated test cases execute in a truly black-box (entirely nonintrusive) manner. Table 1 compares manual, simulation-based, and robotic testing.

Increased Realism

For Android testing, MonkeyLab generates test cases based on app usage data.10 Researchers have also published several approaches to geerating realistic automated test input for web-based systems.11 However, these automated test-input-based systems don’t target mobile plaforms, and the overall body of lierature on automated test input geeration has paid comparatively little attention to test case realism. 

A developer won’t act on a test sequence that reveals a crash if he or she believes that the sequence is unrealistic. Also, all automated test data generation might suffer from unrealistic tests owing to iadequate domain knowledge. Mbile computing introduces an aditional problem: a human simply might not be able to perform the tests. For example, they might require simultaneous clicking with more than five fingers.

In comparison, a robotic test harness can physically simulate hman hand gestures. Although there might be some human gestures a robot can’t make (and others that a robot can make but no human can replicate), the robotic gestures will at least be physical gestures. As such, those gestures will be closer to true human interaction than the virtual gestures simulated by current nonrbotic test environments, which siply “spit” a generated sequence of events at the AUT.

Device Independence

Existing white-box and (claimed) black-box automated testing rquires modifying the behavior of the AUT, the platform, or both. Even techniques regarded as blacbox communicate with apps though simulated signals rather than signals triggered through real sensors (for example, touchscreens or gravity sensors) on mobile devices.

As we mentioned before, robotic testing uses the same cyber-physical interface as the human user. It’s also less vulnerable to changes in the underlying platform, API intefaces, and implementation details. In a world where time to market is critical, the ability to quickly dploy on different platforms is a cosiderable advantage. 

A Better Cost–Benefit Ratio

Human-based testing is consideably expensive yet enjoys much ralism and device independence. In contrast, current automated test data generation is relatively inexpensive, relying only on computation time, yet it lacks realism and device indpendence. Robotic testing seeks the best cost–bene t ratio and combines the best aspects of human-based testing and machine-to-machine atomated testing.

Although robotic technology has historically proven expensive, we’re witnessing a rapid decrease in robotic technology’s cost. Crowsourcing, too, is reducing the cost of human-based testing12 but is ulikely to ultimately be cheaper than robotic testing.

Reduced Relianceon Assumptions

Traditional automated testing makes a number of assumptions about the system under test, whereas humabased test data generation makes fewer assumptions. Robotic testing is much closer to human-based tesing in the number of assumptions made, yet its ability to generate large numbers of test cases cheaply is much closer to existing autmated testing.


Figure 1 shows the Axiz architeture, which contains two high-level components: the robotic-test genertor and robotic-test executor.

The Robotic-Test Generator

The robotic-test generator analyzes the AUT and uses the extracted iformation (including app categories, static strings, and APIs) to adjust a realism model. This model uses prviously collected empirical data cotaining known realistic test cases.

Tabel 1: Criteria to consider when choosing manual, simulation-based, or robotic testing 

On the basis of observations of human usage, we compute a coprehensive list of properties (for eample, the delay between two adja- cent events, event types, and event patterns) that capture the underlying real-world test cases’ characteristics and properties. We hope these characteristics capture what it is to be ralistic, so that Axiz can use them to guide and constrain automated test data generation.

Figure 1. The architecture of the Axiz robotic-testing system. The robotic-test generator generates realistic tests. The robotic-test executor lters out unexecutable tests and executes the rest. 

The robotic-test generator passes the realism model and AUT to the evolutionary-search component, which generates and evolves test cases. These test cases’ realism derives from two aspects of our approach. First, by reusing and extending ralistic test cases (for example, Rbotium or Appium test scripts), we draw on previous tests manually written by the app testers. Second, by searching a solution space costrained by the realism model, we focus on generating test cases that meet the constraints identi ed ealier from crowdsourced tests.

We evaluate the generated test cases’ tness on the basis of their performance (such as code coverage and fault revelation) and realism as assessed by the realism model.

The Robotic-Test Executor

We further validate the test case cadidates by executing them on a physcal device so that they interact with it in much the same way users or maual testers might do. The robotitest executor translates the coded test scripts into machine-executable commands for the robot and then eecutes them on a robotic arm.

The arm interacts with the mbile device nonintrusively, just as a human would. This process requires inverse kinematics and calibration components to make the manipultor act accurately. A camera montors the mobile-device states. The robotic-test executor further prcesses image data from a camera through computer vision techniques, which perform object detection and oracle comparison.

Finally, the robotic-test executor sends the overall process data logged during the execution process to the test lter to determine whether the candidate test case is executable in a real-world setting. If not, the executor filters it out. Otherwise, Axiz saves the test for reuse. 

A Prototype Implementation

We implemented a prototype of Axiz to demonstrate the system’s feasibiity (see Figure 2). We built our implementation entirely from commoity hardware components, which are inexpensive, widely available, and interchangeable. We use 3D visiobased self-calibration13 to help calbrate and adjust the robotic maniulator to keep the system working reliably and to serve as input to the oracle comparator.

The manipulator is a four-axis Arduino-based robotic arm. It’s driven by stepper motors with a psition repeatability of 0.2 mm. The maximum speed of movement for each axis ranges from 115 to 210 dgrees per second (when loaded with a 200-g load, a sufficient maximum for most mobile devices). At the arm’s end is a stylus pen that simlates nger-based gestures.

An external CMOS 1,080-pixel camera monitors the test execution. We run the test generator and robot controller on a MacBook Pro laptop with a 2.3-GHz CPU and 16 Gbytes of RAM.

We employ inverse kinematics (in Python) for robotic-arm control. The object detector and oracle compartor are implemented on top of the OpenCV library. The robotic-test generator employs NSGA-II (Nodominated Sorting Genetic Algorithm II), a widely used multi-objective gnetic algorithm, for multi-objective search-based software testing, uing our (currently state-of-the-art) tool Sapienz.8 This tool generates sequences of test events that achieve high coverage and fault revelation with minimized test sequence length.

Axiz and the Google Calculator App

The Google Calculator app has had 5 to 10 million installs.14 Although it’s simple, it’s a nontrivial real-world app and thus illustrates the potential for truly black-box robotic testing.

We used the robotic-test genertor to generate realistic tests, which we executed using the robotic mnipulator. The device under test was a Nexus 7 tablet, with normal user pemissions and the of cial Android OS (without modi cation). For comparson, we introduced another Nexus 7 on which we allowed more tradtional intrusive testing. The second Nexus 7 was directly connected to the robot controller on the MacBook. The test tool for it had developer-level privileges and could modify the OS.

Figure 3 illustrates this process. The MacBook’s interpreter compnent translated the event instructions into motion speci cations for the robotic-arm controller. That cotroller then transformed the specifications into joint angle instructions on the basis of inverse kinematics. As Figure 3 shows, the robotic arm touched the buttons on the rst Nexus 7 to perform testing. The oacle comparator witnessed each test event. After each step of the test eecution, it captured images through the external camera and validated the mobile-GUI states.

Axiz accurately executed each test event speci ed in the generated robotic-test cases and passed the rquired oracle checkpoints, faithfully maximizing Sapienz’s abilities.

Figure 2. Testing mobile apps with a four-axis robotic arm. We built our implementation entirely from commodity hardware components, which are inexpensive, widely available, and interchangeable. 

Avideo of Axiz perforing this testing is here. In it, we demostrate Axiz side by side with a trditional automated-testing tool that doesn’t use a robot arm but simply produces a sequence of events. The video demonstrates that the robotic arm, built from cheap commodity hardware, can physically produce the same set of events, but more ralistically, thereby achieving greater device independence and realism.


We thank Andreas Zeller for his invited talk at the 36th CREST (Centre for Rsearch on Evolution, Search and Testing) Open Workshop,15 during which he prsented a playful video of a disembodied synthetic human hand automatically iteracting with a mobile device. This was one of the inspirations for our research. 


  1. F. Richter, “Global PC Sales Fall to Eight-Year Low,” 14 Jan. 2016.
  2. Global Smartphone Shipments Forcast from 2010 to 2019 (in Million Units)”; /263441/global-smartphone-shipments-forecast.
  3. “Worldwide Device Shipments to Grow 1.9 Percent in 2016, While End-User Spending to Decline for the First Time,” Gartner, 20 Jan. 2016;
  4. M.E. Joorabchi, A. Mesbah, and P. Kruchten, “Real Challenges in Mobile App Development,” Proc. 2013 ACM/IEEE Int’l Symp. Empirical Software Eng. and Measurement (ESEM 13), 2013, pp. 15–24.
  5. A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An Input Genertion System for Android Apps,” Proc. 9th Joint Meeting Foundations of Software Eng. (ESEC/FSE 13), 2013, pp. 224–234.
  6. D. Amal tano et al., “Using GUI Ripping for Automated Testing of Android Applications,” Proc. 27th IEEE/ACM Int’l Conf. Automated Software Eng. (ASE 12), 2012, pp.258–261. 10. 
  7. W. Choi, G. Necula, and K. Sen, “Guided GUI Testing of Android Apps with Minimal Restart and Aproximate Learning,” Proc. 2013 ACM SIGPLAN Int’l Conf. Object Oriented Programming Systems Languages & Applications (OOPSLA 13), 2013, pp. 623–640.
  8. K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective Automated Testing for Android Applications,” Proc. 25th Int’l Symp. Software Tesing and Analysis (ISSTA 16), 2016, pp. 94 –105.
  9. A. Reversat, “The Mobile Device Lab at the Prineville Data Center,” Facbook, 13 July 2016;
  10. M. Linares-Vasquez et al., “Mining Android App Usages for Generating Actionable GUI-Based Execution Scnarios,” Proc. 12th Working Conf. Mining Software Repositories (MSR 15), 2015, pp. 111–122.
  11. M. Bozkurt and M. Harman, “Autmatically Generating Realistic Test Input from Web Services,” Proc. IEEE 6th Int’l Symp. Service Orented System Eng. (SOSE 11), 2011, pp. 13 –24.
  12. K. Mao et al., “A Survey of the Use of Crowdsourcing in Software Engineeing,” J. Systems and Software, 2016;
  13. J.M.S. Motta, G.C. de Carvalho, and R. McMaster, “Robot Calibration Using a 3D Vision-Based Measurment System with a Single Camera,” Robotics and Computer-Integrated Manufacturing, vol. 17, no. 6, 2001, pp. 487–497.
  14. “Calculator,” Google, 2016;
  15. A. Zeller, “Where Does My Sensitive Data Go? Mining Apps for Abnormal Information Flow,” presentation at 36th CREST Open Workshop (COW 36), 2014;

About the Authors

Ke Mao is a research student at the Centre for Research on Evolution, Search and Testing (CREST) at University College London. Contact him at

Mark Harman is the director of the Centre for Research on Evolution, Search and Testing (CREST) at University College London. Contact him at

Yue Jia is a lecturer of software engineering at the Centre for Research on Evolution, Search and Testing (CREST) at University College London. Contact him at

This article first appeared in IEEE Software magazine. IEEE Software offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.

Rate this Article