
The law of politics and the politics of law: election law, campaign finance, legislation, voting rights, initiatives, redistricting, and the Supreme Court nomination process
Rick Hasen's web log
April 20, 2005
Cherry-Picking Data to Bolster Statistical Argument in Washington State Recount?
Following up on this post, see this Seattle Times article. Michael McDonald also questions whether the GOP-expert statistical report has failed to include measures of statistical uncertainty in the calculations. UPDATE: McDonald provides more information here. Update 2:: You can find the Katz and Gill reports posted here. FINAL UPDATE: Michael McDonald has now analyzed the Katz report and sends along the following analysis:
I have reviewed the method used by Jonathan Katz. I am glad to report that he did indeed calculate a measure of uncertainty for his estimate presented
on p.8 of his report. However, I believe that he used the wrong measure of uncertainty that greatly understates the uncertainty of his estimates, as I
will elaborate. The correct measure of uncertainty throws the validity of
his conclusions into doubt. As Dr. Katz mentions, it is unfair to attack
his work when he can't respond because of legal considerations. He may have
a reasonable explanation why he used his approach. So, please keep this in
mind until he has had a chance to defend his work. However, I note that all
I present below is elementary statistics that can be found in any
introductory statistics book.
(you can see the remainder of the analysis by clicking the link below)
The task here is to estimate how many 'improper votes' each of the
candidates received so that the election outcome could be known if these
votes were removed. The most appropriate method is to assume the 'improper
votes' are random draws from the population of voters, much like as is done
in polling. In the familiar framework of polling, the best guess, or
expected value, of the percentage of support for a candidate is the
percentage from the sample. However, because of the error that occurs from
the random draws from a distribution, the value will be in error, what we
familiarly know as the margin of error of the poll.
The method of random sampling can be used in this situation, where we might
think of the 'improper votes' as draws from the population of voters. The
only significant difference from the polling framework in this case is that
the true percentage support for a candidate within a jurisdiction is known:
it is the election result. However, we can still apply the same formula to
describe the margin of error of a poll in this situation (in statistical
lingo, we use p in the place of p-hat).
Here is an the example of how to calculate the margin of error around the
expected value drawn from p.7 of Dr. Katz's report, where 157 'invalid
votes' are alleged to have occurred among the 57.7% vote for Gregoire in
King County.
Step 1) 157 'invalid votes' were alleged to have been cast by felons in King
County, where Gregoire won 57% of the vote. The expected votes for Gregoire
is Np = 0.577*157 = 90.59 (Dr. Katz finds the same number).
Step 2) The standard error of the sample of 'invalid votes' (from any
introductory statistics book) is: s.e. = [p(1-p)/N]^.5 =
[(.577)*(.423)/157]^.5 = .039 (unreported by Dr. Katz).
Step 3) The 95% confidence interval, or margin of error = s.e.*1.96 or ±
.0772. In terms of 'invalid votes,' .0772*157 = 12.13 (unreported by Dr.
Katz).
Step 4) The 95% confidence interval on the vote by felons for Gregoire is
therefore 90.59 ± 12.13, or 78.46 to 102.72 (unreported by Dr. Katz).
Note that half of 157 is 78.5 (unreported by Dr. Katz).
Thus, the conclusion is that within King County we cannot state with a
reasonable degree of statistical certainty that Gregoire won a majority,
much less enough to tip the election in favor of Rossi, of the 'invalid
votes.' Right away, this example casts doubt on the remainder of Dr. Katz's
analysis.
On p.8, Dr. Katz states that the 95% confidence interval for the 'invalid
votes' statewide is 508.03 to 576.03 (I calculate the upper range from the
information on p.8 of the report). Reversing the math above, the margin of
error is 0.0387 = (542.03-508.03)/(542.03+336.97).
The margin of error that Dr. Katz claims for all 'invalid votes' is .0387.
The statewide margin of error is apparently derived using the statewide
total number of 'invalid votes' [(.50)*(.50)/879]^.5 = .0169*1.96 = 0.0331.
The numbers don't replicate exactly, but there may be some rounding issues;
one clue is that Dr. Katz's reported statewide margin of error is 34.00.
Now, note the King County margin of error was larger than the statewide
margin of error (.0772 for King County vs. .0331 for statewide). Why is
this? The margin of error is related to the sample size. Thus, the margin
of error is larger for the smaller number of 'invalid votes' cast in King
County, and it is smaller for the larger number of 'invalid votes' counted
statewide. Again, in familiar polling terms, the margin of error is smaller
for larger polls, by a factor of (N)^(-.5).
It is important to understand that working with standard errors can be
counterintuitive. The overall standard error obtained by summing together
all jurisdictions is the square root of the sum of the squares of the
standard errors for each jurisdiction. Unfortunately, as it turns out using
this formula, adding together standard errors cannot result in a smaller
total standard error than any single component standard error. Thus,
sampling within subgroups using smaller samples, such as precincts,
increases the error and that error doesn't automatically reduce when summing
across precincts, as might be intuitively assumed when the law of large
numbers if typically applied.
Why is this important for Dr. Katz's report? If the statewide expected
value of 50% vote for Gregoire was applied to the total number of 'invalid
votes,' this would result in Gregoire and Rossi splitting the votes evenly
and the case that Rossi really won if the 'invalid votes' were removed
cannot be made. Instead, the 'invalid votes' must be assigned to some other
unit; Dr. Katz chooses the precinct. Since precincts where alleged 'invalid
votes' occurred tend to be in more Democratic jurisdictions, the expected
value works in Rossi's favor when the assigned 'invalid votes' are summed
across all precincts.
However, the standard error across precincts IS NOT THE STATEWIDE VALUE. It
is the square root of the sum of the squares of the standard errors within
precincts. I don't have the data to do the full replication, but the
resulting standard error, and the associated margin of error, will be much
larger than reported on p.8 of Dr. Katz's report. The King County example
alone shows that the margin of error will at least be on the order of 7.72%.
Thus, I am very certain that using Dr. Katz's method, it is impossible to
state with a reasonable degree of statistical certainty than Rossi would
have won the election if all of the alleged 'invalid votes' were removed
from the election results.
More simply put, Dr. Katz can't have his cake and eat it, too. He cannot
simultaneously use the statewide measure of uncertainty AND the precinct
level expected values of 'invalid votes' for Gregoire and Rossi. The
statistics must be calculated all at the statewide or all at the precinct
level, neither of which will support a claim that Rossi truly won the
election.
Again, I want to caution that I do not have the replication data to confirm
my strong suspicions and I would like to give Dr. Katz the benefit of the
doubt that it is I who am in (standard) error.
Posted by Rick Hasen at April 20, 2005 11:44 AM
|