The Combinatorial Design Approach to Automatic Test GenerationBy David M. Cohen, Siddhartha R. Dalal, Jesse Parelius, and Gardner C. Patton Appeared in IEEE Software, September 1996, pp. 83--87. Copyright © Telcordia Technologies and IEEE 1996. TABLE OF CONTENTSAbstractMemo starts The Combinatorial Design Paradigm System Testing Coverage Measurements Fault Detection Comparison of Methods Acknowledgments References ABSTRACTThe combinatorial design method substantially reduces testing costs. The authors describe an application in which the method reduced test plan development from one month to less than a week. In several experiments, the method demonstrated good code coverage and fault detection ability. Designing a system test plan is difficult and expensive. It can easily take several months of hard work.A moderate-size system with 100,000 lines of code can have an astronomical number of possible test scenarios. Testers need a methodology for choosing among them. The ISO 9000 process gives some guidance, specifying that each requirement in the requirements document must be tested. However, testing individual requirements does not guarantee that they will work together to deliver the desired functionality.The combinatorial design method, a relatively new software testing approach, can reduce the number of tests needed to check the interworking of system functions. Combinatorial designs [1] are mathematical constructions widely used in medical and industrial research to construct efficient statistical experiments. [2] At Telcordia, we have developed a system called AETG tm, (Automatic Efficient Test Generator) [1,4] which uses combinatorial design theory to generate tests. Several Telcordia and other groups are using the AETG system for unit, system, and interoperability testing. We performed experiments using AETG to test the effectiveness of the combinatorial design approach. We tested some standard Unix commands and found that the combinatorial design test sets gave good code coverage. In another experiment, we tested modules from two releases of a Telcordia product. The combinatorial design tests uncovered faults in the code and in the requirements that were not detected by the standard test process. In this experiment, the product's developers and requirement writers created the tests, demonstrating that practicing engineers can use the combinatorial design approach to test software effectively. The Combinatorial Design ParadigmTo design a test plan, a tester identifies parameters that determine possible scenarios for the system under test (SUT). Examples of such test parameters are SUT configuration parameters, internal SUT events, user inputs, and other external events. For example, in testing the user interface software for a screen-based application, the test parameters are the fields on the screen. Each different combination of test parameter values gives a different test scenario. Since there are often too many parameter combinations to test all possible scenarios, the tester must use some methodology for selecting a few combinations to test. In the combinatorial design approach, the tester generates tests that cover all pairwise, triple, or n-way combinations of test parameters specified in formal test requirements. Covering all pairwise combinations means that for any two parameters p1 and p2 and any valid values v1 for p1 and v2 for p2, there is a test in which p1 has the value v1 and p2 has the value v2. Using the AETG system the tester specifies a system's test requirements as a set of relations. Each relation has a set of test parameters and a set of values for each parameter. The set of possible test scenarios specified by the relation is the Cartesian product of the value sets for its test parameters. The tester specifies constraints between test parameters either by using multiple relations or by prohibiting a subset of a relation's tests. For each relation, the tester specifies the degree of interaction (for example, pairwise or triple) that must be covered. AETG then uses combinatorial designs to generate tests covering the specified degree of interaction. Testers usually generate test sets with either pairwise or triple coverage. An empirical study of user interface software at Telcordia found that most field faults were caused by either incorrect single values or by an interaction of pairs of values. Our code coverage study also indicated that pairwise coverage is sufficient for good code coverage. The seeming effectiveness of test sets with a low order of coverage such as pairwise or triple is a major motivation for the combinatorial design approach.
Table 1: Relation For A Voice Response Unit Table 1 shows a relation with four parameters that defines test scenarios for a voice response unit, a device that plays voice announcements and collects dialed digits. We abstracted this example from a test plan generated for a telephone network service called AIN (Advanced Intelligent Network). The first parameter specifies the type of announcements to play. It has three values: none, interruptible, and noninterruptible. The second parameter specifies that the user should input no digits, a fixed number of digits, or a variable number terminated by the pound key. The third parameter specifies whether the voice response unit is to make a billing record for the transaction. The final parameter indicates whether the user is accessing the unit by a local phone line or a long-distance trunk. To prevent vacuous test scenarios, the relation has a constraint that prohibits any combination using none for the announcement parameter and none for the digits-wanted parameter. Testing all combinations of the four parameters in Table 1 that satisfy the constraint requires 32 different possible test scenarios. The eight tests shown in Table 2 cover all pairwise combinations of the parameter values, for a 75 percent reduction. Complex testing situations often give more dramatic reductions. For example, 120 binary parameters require 10 tests for pairwise coverage, instead of 2 120 for exhaustive testing.
Table 2: Tests For A Voice Response Unit System TestingThe first and perhaps most important step in defining test requirements is to identify the test parameters. In general, the parameters for unit testing are low-level entities, such as screen fields or message components, to which the tester must assign values. The system documentation usually makes clear what the unit testing parameters should be. The way to define test parameters for system testing, however, is less obvious. The key to defining these parameters is to model the system's functionality, not its interface. This reduces the model's complexity and focuses the model on the system aspects important to testing. Our test plan for a network performance-monitoring system serves as an example. We generated tests for two releases of this equipment. Creating the test requirements for the first release took one week; modifying it for the second release took an hour. Generating a test plan for the second release would normally take well over a month. We based the test requirements on the system's user manual. The test requirements for the final release had a total of 75 parameters with 1029 possible test combinations. The combinatorial design approach required only 28 tests to cover all pairwise parameter combinations for the 75 test parameters. This equipment includes several monitors that track the number of corrupted ATM (asynchronous transfer mode) cells in transmission facilities. The system software contains commands to configure the monitors and display the collected data. To test the system, the tester enters a series of configuration commands and uses an attenuator to corrupt the transmission facility's ATM cells. Then the tester enters display commands to check that the system has correctly collected the data. The configuration commands and the attenuator setting specify all of each test's tester-controllable parameters. Users interface with the system via screens on a data input terminal. The system has several screens for command entry and data display. The screens have several hundred fields. The tester enters a command's name on the first screen and then enters its arguments on the following screens. The command determines which screens follow the initial screen. For example, to give the command that turns a monitor on or off, the tester enters the command's name on the initial screen. The next screen displays each monitor's current on/off status. On this screen, the tester enters either on or off. The following screen confirms the changes. Although the tester gives the system commands by typing into screens, the test relation parameters do not model screen fields; rather, they model the commands and the attenuator. Several test parameters determine each command's arguments. A back-end script translates the test parameters into the appropriate keystrokes. Table 3 shows the test specification of two monitor commands. One command turns the monitor on and off. The other sets the time units and thresholds for the monitor's alarms. The full test specification of the two commands has a copy of these parameters for each monitor. The active status parameter has three values: on, off, and no change. The back-end script translates the test parameter value on into an on entry in the monitor's field on the status screen and a yes entry in the monitor's field on the confirm screen. The value no change becomes either a yes or a no entry in the monitor's field on the status screen and a no entry in the monitor's field on the confirm screen. The back-end script similarly translates the other test parameters.
Table 3: Test Specification of Two Monitor Commands Coverage MeasurementsSeveral recent studies indicate that good code coverage correlates with effectiveness. [5, 6, 7] We measured the coverage of combinatorial design test sets for 10 Unix commands: basename, cb, comm, crypt, sleep, sort, touch, tty, uniq, and wc. We used the ATAC test tool [8] to provide the block, decision, C-uses, and P-uses metrics for the test sets. The pairwise tests gave over 90 percent block coverage. We performed several experiments on the sort command. We selected the sort command because, with 842 lines, it has substantially more code that the other commands. W.E. Wong and his colleagues [8] also experimented with the sort command. They generated a pool of 1,000 random tests from which they created multiple, distinct test sets with various amounts of block coverage. The maximum block coverage provided by a test set was 80-85 percent. The average number of tests in the 29 test sets with 80-85 percent block coverage was 26.97. We based the sort command's test requirements on the documentation in the user manual. Sort has many flags. Several can occur more than once in a command line. Some flags are incompatible. Others cause the test to bypass much of the code. For example, the -c flag specifies that the sort program should only check the input and not sort it. Any test with this flag will not execute the sort code. After studying the documentation, a tester would normally write a test plan including only a few tests with such flags. Since we wanted to automate the process as much as possible, we initially did not partition the flags to avoid these tests. There are many ways to model the sort command. We tried several different models. Table 4 shows the coverage data for two of them. Our models differed from each other in two ways. First, we varied the number of parameters, which we call width, and the maximum number of values for a parameter, which we call height. In general, the number of pairwise tests is at least the product of the number of values for the two parameters with the most values. Consequently, rewriting a model to decrease the maximum number of values per parameter can decrease the number of tests, even if it increases the number of parameters.
Table 4: Coverage of Pairwise Tests for Sort The other factor we varied was whether or not values such as the -c flag were valid or invalid parameter values. A parameter's valid values are tested pairwise against the other parameters' valid values. Consequently, they occur in many tests. For example, if a parameter has three valid values, each will occur in about a third of the tests. A parameter's invalid values are not tested pairwise with the other parameter values but are tested only once. If a value is known to cause a visible fault, making it invalid can reduce the number or tests causing that fault. Although having many tests causing an easily checked fault may not greatly increase the testing cost, the fault may mask the effect of the test's valid values. Consequently, testing the invalid value pairwise with the valid values may decrease coverage. We made two changes to model A, creating an intermediate model not shown in Table 4. First, we made the two illegal output files invalid instead of valid parameter values. As a result, they would be tested individually and would not occur in so many tests. Second, we decreased the maximum number of values per test parameter--in other words, decreased the model's height. We did this by adding additional test parameters and distributing the flags among them. This model had 23 parameters and a height of 4. It required 37 valid tests and two invalid tests (one for each of the two invalid test parameters). It covered 86 percent of the code. The intermediate model generated several tests that caused the sort program to core dump. Nine tests core dumped with a bus error message, and two with a segmentation fault message. Two input files caused the core dumps. We also had 13 tests that exited with the error message "sort: can check only 1 file." Since 24 of the 39 tests were causing these three faults, we decided to rewrite the model. In the next model, B in Table 4, we made the two input files that caused the core dumps invalid instead of valid values. We also made the -c option an invalid value for the parameters specifying the sorting options. These changes did not affect the number of fields or the height, so like the intermediate model, this model also had 23 test parameters and a height of 4. Model B generated 41 tests: 29 pair-wise tests of the valid values and 12 tests of the invalid values. The 29 valid tests gave 90 percent block and 82 percent decision coverage. Including the 12 invalid tests, the 41 tests for model B gave 95 percent block and 86 percent decision coverage. This is better coverage than that of the 126 tests for model A or the 39 tests for the intermediate model. It is also better than the random test coverage described by Wong and his colleagues. These experiments illustrate an important optimization method for the combinatorial design approach. By making some values invalid instead of valid, we increased the block coverage from 86 percent to 95 percent while increasing the number of tests by only two, from 39 to 41 tests. We identified these invalid values either from the error messages produced by the sort program or from its user documentation. By basing our optimizations on the program's error messages and user documentation, we took a "black box" approach to testing. If the source code is available, testers can also use "white box" optimization methods. Fault DetectionWe used the combinatorial design method to test user interface modules from two releases of a Telcordia product. The product's developers and requirements writers wrote the test requirements based on the development requirements. Because we used the Telcordia Mynah TM system to run the tests and compare the results automatically, we made little effort to minimize the number of tests. Some of the problems we found were faults in the code, and others were faults in the requirements. Finding faults in the requirements is important because the requirements play a part in maintenance and future development. Table 5 shows the details of our testing.
Table 5: Fault Detection Experiments on Two Releases of a Telcordia Product Because these tests are generated from high-level test requirements, they can be easily modified when the code is updated. In fact, some of the tests for the 1995 release were revisions of tests for the 1994 release. When we ran our tests, the software was nearing the end of system test, and most defects had already been found. Nevertheless, the combinatorial design test sets found defects the normal test process missed. Several were so severe that the module needed extensive rewriting. One of the developers, who tested his own software, reported that the combinatorial design approach uncovered all the faults (nine code faults in three modules) that he had found during extensive unit testing. Comparison of MethodsBecause the combinatorial design testing approach is relatively new, the literature reports on only two test generation systems using it: Telcordia AETG and CATS (Constrained Array Test System), developed by G. Sherwood at Bell Labs. [9] Although published descriptions of CATS applications and results are scant, the AETG combinatorial design algorithms seem to generate fewer test cases than CATS. For example, Sherwood reports that in its largest known run, CATS generated 240 test cases for all pairwise combinations of 20 factors with 10 values each. Mallows [10] reports that he reduced this by hand analysis to 206 tests. The AETG system requires only 180 tests for this configuration. A related approach [11, 12, 13] is the use of orthogonal arrays, a pairwise combinatorial design with the additional property of requiring every pair of values to occur exactly the same number of times. This requirement is severe and makes orthogonal arrays impractical for testing software. For example, for 100 parameters with two values each, the orthogonal array requires at least 101 tests, while 10 test cases are sufficient to cover all pairs. The combinatorial design approach differs from traditional input testing by giving testers greater ability to use their knowledge about the SUT to focus testing. Testers define the relationships between test parameters and decide which interactions must be tested. Using the combinatorial design method, testers often model the system's functional space instead of its input space. The combinatorial design approach to automatic test generation combines an expressive interaction description language with algorithms that reduce the number of tests needed to cover the specified interactions. Testers can create effective and efficient test plans, often faster than by traditional methods entailing hand optimization. This approach can reduce the testing cost while preserving or increasing test quality. AcknowledgmentsThe authors thank Chris Bailey, Mike Dobiesz, Peter Duchnowski, Eleanor Kaldon, and Ajay Kajla for their participation in the fault detection experiments. We also thank Mary Griffin, Charles Hallowell, Adam Irgon, and Isaac Perelmuter for their support. Finally, we thank Jon Kettenring without whose support this project would not have been undertaken. REFERENCES1. Hall Jr., M., Combinatorial Theory, Wiley Interscience, New York, 1986. 2. Taguchi, G., System of Experimental Design, Quality Resources, 1987. Translation of Jikken keikakuho, Maurzen Co., Tokyo, 1976. 3. Cohen, D. M., S. R. Dalal, A. Kajla, and G. C. Patton, "The Automatic Efficient Test Generator," Proc. Fifth Int'l Symposium on Software Reliability Eng., IEEE CS Press, Los Alamitos, Calif., 1994, pp. 303-309. 4. Cohen, D. M., S. R. Dalal, M. L. Fredman, and G. C. Patton, "The AETG System: a new approach to testing based on combinatorial design," Tech. Report TM-25261, Bell Communications Research, Morristown, N.J., 1995. 5. Plwowarski, P., M. Ohba, and J. Caruse, "Coverage Measurement Experience During Function Test," Proc. 15th Int'l Conf. on Software Engineering, IEEE CS Press, Los Alamitos, Calif., 1993, pp 287-303. 6. Wong, W. E., J. R. Horgan, S. London, and A. P. Mathur, "Effect of Test Set Minimization on Fault Detection Effectiveness," Proceedings of the 17th International Conference on Software Engineering, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 41-50. 7. Wong, W. E., J. R. Horgan, S. London, and A. P. Mathur, "Effect of Test Size and Block Coverage on the Fault Detective Effectiveness," Fifth Int'l Symposium on Software Reliability Engineering, IEEE CS Press, Los Alamitos, Calif., 1994, pp. 230 - 238. 8. Horgan, J. R., and S. London, "ATAC: A Data Flow Coverage Testing Tool for C," Proc. IEEE Conf. Assessment of Quality Software Development Tools, IEEE CS Press, Los Alamitos, Calif., 1992, pp. 2-10. 9. Sherwood, G., "Effective Testing of Factor Combinations," Proc. 3rd Int'l Conf. Software Testing, Analysis and Review, Software Quality Eng., Jacksonville, Fla., 1994. 10. Mallows, C., "Covering Designs in Random Environments," to be published in Festschrift for J. W. Tukey, S. Morgenthaler, ed., 1996. 11. Mandl, R., "Orthogonal Latin Squares: An Application of Experimental Design to Compiler Testing," Comm. ACM, Oct. 1985, pp.1054-1058. 12. Brownlie, R., J. Prowse, and M. Phadke, "Robust Testing of AT&T PMX/StarMail using OATS," AT&T Technical J., Vol. 71, No. 3, 1992, pp. 41-47. 13. Heller, E., "Using Design of Experiment Structures to Generate Software Test Cases," Proc. 12th Int'l Conf on Testing Computer Software, ACM, New York, 1995, pp. 33-41. AETG
is a trademark of Telcordia Technologies MYNAH is a trademark of Telcordia Technologies | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © Telcordia Technologies and IEEE 1996.