Review of "An Active Approach to Voting Verification"

Review of "An Active Approach to Voting Verification"
by Ellen Theisen, Executive Director, VotersUnite.Org

Abstract

On May 8, 2005, Jim Drinkard of USA Today published an article referencing "a new Massachusetts Institute of Technology study that found problems with paper backup for electronic voting machines." Mr. Drinkard reported that Ted Selker, Associate Professor of Media Arts and Sciences at MIT and head of Media Lab's Context-Aware Computing, oversaw the study.

A report of the study was written by Dr. Selker and Sharon Cohen, one of Dr. Selker's graduate students, as part of the Caltech/MIT Voting Integrity Project. Among the many severe weaknesses in the study are four fundamental flaws, any one of which is sufficient to discredit the conclusions.

First, Dr. Selker and Ms. Cohen began their experiment with a clear bias. The experiment, according to the paper, was a comparison of two voting verification methods: voter verified paper audit trail (VVPAT)* and voter verified audio audit transcript trail (VVAATT). Dr. Selker, who has opposed VVPAT for years, introduced VVAATT as an alternative to VVPAT in 2004. The authors do not explain any steps they took to minimize the influence of their obvious bias in comparing a system Dr. Selker opposes to one he invented.

Second, Dr. Selker and Ms. Cohen did not compare two existing audit trail systems. They say, "we created our own VVPAT and VVAATT systems instead of using a commercial DRE." However, the operation of the VVPAT system they created is substantially different from the operation of any existing VVPAT system. Since there are no VVAATT systems presently in use, their study is a comparison of two hypothetical audit systems. Their results can tell us little about a comparison between real verification systems.

Third, the experiment did not test what they claimed to be testing. Even the brief descriptions of their methodology reveal that what they actually compared was the subjects' short-term memories of meaningless data to their long-term memories of meaningless data. What they showed was that the subjects' short-term memories were better.

Finally, their conclusion — that VVAATT is more accurate and useful — is illogical when the results are evaluated against the criteria they defined for the experiment. They set out to compare the systems on the basis of three criteria: number of feedback errors caught by the subjects, time required to vote, and usability. Their conclusion states that while more errors were caught on the VVAATT system, it required more time to vote, and the subjects were less comfortable with it. Even though the paper system outperformed the audio system in two of their three criteria, they declare the audio system superior.

In fact, the authors' choice of these criteria betrays their lack of understanding of the key issues surrounding a voter-verified audit trail. They declare that usability, time to vote, and the subjects' ability to catch errors in the audit trail are "some of the most important factors to voting and audit systems," ignoring the basic purpose of audit trails, which is to provide a way to audit the machines and recount authenticated records of the ballots. As a result, their experiment fails to compare the usability of the two types of audit trails when they are used for their fundamental purpose.

Furthermore, Dr. Selker and Ms. Cohen fail to give any serious attention to a remarkable and unexpected result: in their evaluation of results, they report that 90% of the subjects said they would recommend VVPAT to their county leaders. Since the authors conclude that VVAATT is the superior system, this incongruous result warrants further exploration. Rather than asking the subjects the reason for their preference, the authors choose instead to hypothesize the cause (unfamiliarity with VVAATT) and then include their hypothesis in the conclusion as a fact.

Dr. Selker's and Ms. Cohen's paper is not a study of voting verification systems. The respect given to their paper by USA Today assumes a level of scholarship that is severely lacking in the authors' work.

* Note that Dr. Selker and Ms. Cohen limited the definition of VVPAT to paper audit trails produced by Direct Record Electronic (DRE) voting systems. Voting systems with an inherent paper audit trail, such as optical ballot scanners were not included in their experiment.

1 Introduction

Dr. Selker and Ms. Cohen state that they conducted a comparison of two voting verification systems: a voter verified paper audit trail (VVPAT) and a voter verified audio audit transcript trail (VVAATT). Their goal was to assess the following factors: general usability of the system, the time needed for voters to use the system, and the number of errors voters were able to catch in each audit trail.

They constructed both systems themselves and paid 36 test subjects to cast ballots in eight elections each — four on each system. Each ballot contained eleven races. The VVAATT system provided audio feedback of each selection immediately after the subject made the selection. The VVPAT system provided visual feedback for all races on paper after the subject finished making all selections. It is not clear from the report whether the subjects could compare the paper record with the review screen.

The feedback was designed to contain errors in three of the four ballots cast on each system. The authors recorded the number of errors the subjects reported and noted also when subjects seemed to notice an error they did not report. Dr. Selker and Ms. Cohen found that 14 errors were reported on the VVAATT system; no errors were reported on the VVPAT system. They noted that 25 errors appeared to be noticed but not reported on the VVAATT system; three on the VVPAT system. They found that the VVAATT system took "a third more time" than the VVPAT system.

Their questionnaire at the end of the experiment showed that 85% of the subjects agreed there were errors in the audio feedback; 8% agreed there were errors in the paper feedback. In addition, 90% of the subjects said they would recommend the VVPAT system to their county decision-makers.

2 Flaws in The Method

Here are a few of the serious problems with Dr. Selker's and Ms. Cohen's experimental method:

1) They both began the experiment with a strong bias favoring VVAATT. The experiment they chose was a comparison between two vote verification systems: one that Dr. Selker has opposed for years (VVPAT), and one that he has recently invented (VVAATT).
Experienced scientists acknowledge that their own biases can inadvertently skew the results of an experiment. In this experiment, the bias was evident since the authors were attempting to determine whether Dr. Selker's invention would perform better or worse than a system he opposes. Yet, their report is void of any discussion of the bias or any methods they might have used to minimize its impact on the results.
In their brief explanations of the systems, their bias is immediately evident. Their description of the fourth step of the VVPAT verification includes the following:

If the voter accepts the printout then the paper is deposited in a secured ballot box. Otherwise, if the voter rejects the printout, they will have to begin voting again.

On the other hand, their description of the VVAATT system begins with praise:

The voter verified audio audit transcript trail (VVAATT) is a new idea for an audit mechanism that fits more naturally into the voting process.

As they compare the differences between the two systems, their preference for the VVAATT system is becomes even more clear (highlighting added):

There are two important differences between the VVPAT and VVAATT systems. One important difference is the timing of the verification process. When using a VVPAT, all verification is delayed till the end of the voting process, however, with a VVAATT verification is done throughout the voting process. Eliminating this time delay means that the voter does not have to rely on their memory to properly verify their votes, in addition, accidental mistakes such as pressing the wrong candidate may be more immediately identified with the audio feedback.
The other main difference between the two systems is that VVAATT provides a multimodal form of verification while VVPAT does not. The VVAATT audio verification complements the visual verification that the voter receives from the DRE. Instead of competing for the voter’s attention, the two forms of feedback actually combine to heighten the voter’s attention and create a more engaging voting process.

2) Dr. Selker and Ms. Cohen did not compare real-world systems.
They designed and created the VVPAT and VVAATT systems which were used in the experiment. In fact, the way voters interact with the VVPAT system they created is different from a voter's interaction with the VVPAT systems currently on the market.
They claim to be comparing two systems proposed for auditing elections, but they are comparing two systems created specifically for this experiment. While it is clear from their descriptions that the VVPAT system differs from those currently in use, it is impossible to know the extent of the differences. Because of the differences between real-world systems and the systems they compared, the results of their experiment cannot be extrapolated to a comparison of a real VVPAT system and Dr. Selker's VVAATT system.

3) The report provides little information about the methodology.
For example, did the authors choose familiar historical characters or invented names for the candidates on the ballots? How did they ensure that the ballots for one system were comparable to the ballots for the other? What size was the typeface on the VVPAT system? Did they use synthesized speech or human voices for the VVAATT system?
What instructions were the subjects given? Were they told that it was a comparison between two vote verification systems? What incentive did they have to report errors in the verification systems' feedback?
How did Dr. Selker and Ms. Cohen ensure that they did not give the subjects accidental clues that revealed their own bias?
Did half the subjects vote on the VVAATT system first and the other half vote on the VVPAT system first? If so, were there any differences in the results of the two groups?
Or did they all vote on the VVPAT system first, or the VVAATT system first? Were the same four elections used for each system? If the elections were different, were they carefully devised to be equivalent, or were they selected randomly for each subject and each voting event?
The report states that the review screen was always correct. Was it still on the screen when the subjects reviewed the VVPAT? If so, then the results of the experiment indicate that the subjects failed to compare two simultaneously displayed summaries and find the discrepancies between them.

4) Dr. Selker's and Ms. Cohen's experiment fails to replicate the real-life situation they were attempting to evaluate. As a result, they tested short-term memory vs. long-term memory, rather than VVAATT vs. VVPAT.
The only information the report includes about the ballots is that each had "11 races with a mixture of single selection races and multiple selection races." Sometimes the experimenters told the subjects how to vote; sometimes the subjects chose their own candidates. The VVAATT system gave immediate audio feedback after each selection. The VVPAT system gave visual feedback all at once at the end.
A subject making 11 choices, each time choosing among several meaningless (or even historically meaningful) names, is quite likely to forget at least some of their choices by the time they have finished making all 11 selections. On the other hand, they should certainly remember each selection immediately after making it and before going on to the next one.
However, when voters go to the polls to vote for real candidates, they are extremely unlikely to forget how they voted — even days later or possibly years later, because the choices are meaningful.
What Dr. Selker and Ms. Cohen describe is an experiment that compares people's short-term memory of meaningless choices with their long-term memory of meaningless choices. What they discovered is that the subjects' short-term memories seem to be slightly better.

3 Flaws in the Evaluation of Results

Dr. Selker and Ms. Cohen fail to address anomalies in the results.

1) They fail to explore the fact that in 108 elections with feedback errors, with the wrong choices read back to the subjects immediately after they pressed a button, the subjects reported the errors only 14 times (less than 13%).
This means there were 94 times when a subject chose a candidate, and the audio immediately stated the name of a different candidate, but the subject failed to report the difference, and in most cases apparently failed to notice the difference.
This suggests serious flaws in either the instructions given to the subjects or the clarity of the audio feedback in the system the experimenters created, yet the experimenters express no concern.

2) The authors report:

Finally, in our post-survey data 85 percent of participants agreed with the statement that there were errors in the audio verification ...

Out of 36 subjects, 14 audio errors were reported, 25 audio errors appeared to be noticed but might not have been reported, and about 30 people (85%) agreed there were audio errors. If about 30 people agreed there were errors, why didn't they report them? Did they understand the instructions? These questions remain unanswered by the report.

3) Dr. Selker and Ms. Cohen point out that over 90% of the subjects said they would recommend the VVPAT system to their county election commission, but they dismiss the importance of that result.
Since the subjects did not report any of the 108 feedback errors provided by VVPAT, this incongruous result warrants further exploration. However, the authors simply hypothesize the reason:

We believe that this preference is a result of voters’ familiarity with paper records in other situations and inexperience with audio records.

One of the authors' criteria for evaluating the systems was their usability. Since the subjects' preferences could be considered a significant indicator of the systems' usability, and since in the conclusion the authors themselves suggest that it was, their failure to explore this important result is a serious omission in their evaluation of results.

4 Flaws in the Conclusion

The conclusion is naïve, confusing, unwarranted, and illogical.

1) In their conclusion, Dr. Selker and Ms. Cohen make the following naïve statement:

Both VVPAT and VVAATT offer reliable and secure audit technology so evaluating which is preferable must be based on how voters are able to interact with the two systems.

Both authors appear to be uneducated about the complex issues involved in electronic voting. They assert that VVPAT and VVAATT are both reliable and secure, in spite of the fact that within the community of people concerned about an audit trail, there is much serious debate on questions such as how to make it reliable and secure, how to make it usable, and how to make it accessible.
Neither do the authors appear to understand the purpose of an audit trail. Underlying their entire report is the assumption that an audit trail is simply a verification mechanism to ensure that voters record their intent correctly. They seem to be unaware that the purpose of an audit trail is to provide a check on the accuracy of the machines. Their experiment fails to compare the usability of the two systems for recounts and audits.
Since Dr. Selker's and Ms. Cohen's assumption about the purpose of verification systems is false, their efforts reveal nothing about the relative real-world value of the two systems they compared.

2) Dr. Selker and Ms. Cohen claim:

Our studies indicated that VVAATT serves as a much more accurate a[nd] useful audit trail with voters able to identify significantly more errors.

Their use of the word "significantly" exaggerates the actual number of errors reported.

Subjects who heard inaccurate data reported back to them immediately after they made a choice were able to identify it as inaccurate less than 13% of the time.

Subjects who visually inspected inaccurate data reported back to them in a block after making a block of choices were never able to identify it as inaccurate.

It is inappropriate to call this slight difference "significant." Instead, the overwhelming number of errors that went unreported on both systems strongly suggests that their experiment was flawed.

3) The authors end their conclusion with the following:

... voters are less familiar and comfortable with an audio record.

However, this explanation is only their supposition about why 90% of the subjects said they would recommend the VVPAT system to their county officials. This conclusion is unwarranted since it is not based on any evidence provided by the experiment.

4) Dr. Selker's and Ms. Cohen's conclusions do not respect the criteria they defined for evaluating the systems. According to their concluding remarks, the VVAATT system was superior in only one of the three criteria they defined at the start of the experiment.
Therefore, based on their own criteria for evaluating the results, their conclusion is illogical. They point out that VVPAT took less time to vote and that the subjects found it more usable. In the one area where VVAATT performed better, it performed remarkably badly — subjects reported only 13% of the feedback errors they heard immediately after making their selections. Yet Dr. Selker and Ms. Cohen find VVAATT to be the superior system.

Conclusion

Dr. Selker and Ms. Cohen start with the presupposition that VVAATT is preferable to VVPAT. They conduct an experiment using seriously flawed methodology. They dismiss significant results that call their hypothesis into question, and — contrary to the evidence provided by the subjects during the experiment — they conclude that VVAATT is preferable.

This report is a discredit to Dr. Selker, Ms. Cohen, the Caltech-MIT Voting Technology Project, and the Massachusetts Institute of Technology.

Thanks to John Gideon and Warren Stewart for their valuable comments.