Eight Laws of Statistics

Jim Lewis, PhD and Jeff Sauro, PhD

Statistics doesn’t have a Magna Carta, constitution, or bill of rights to enumerate laws, guiding principles, or limits of power. There have been attempts to articulate credos for statistical practice. Two of the most enduring ones are based on the work by Robert P. Abelson, a former statistical professor at Yale. If Abelson wasn’t the Thomas Jefferson of statistical laws, he might be the James Madison (author of the Bill of Rights).

In 1995, Abelson published Statistics as Principled Argument, a book targeted at students taking graduate statistics classes to help them understand how to use statistics to develop research narratives using principled arguments informed by statistical findings.

In an earlier article, we discussed Abelson’s styles of statistical rhetoric, reviewing the four he defined and adding a fifth one.

Abelson’s other organizing strategy, presented at the beginning of his book and distributed through the chapters, was a list of eight “laws.” We review these laws in this article.

Abelson’s Eight Laws

1. Chance is lumpy.

Consider the following two strings of heads (H) and tails (T). Which one seems more random—A or B?

Most people pick A, but it was produced by a human trying to make a sequence that looked random. B was produced using the Excel RAND() function to simulate coin tosses.

It seems unlikely that in a random sequence you would get, for example, seven heads in a row or even five tails in a row. Maybe that random sequence was unusual? To check, we ran another 50 sets of simulated coin tosses, noting in each one the longest run of consecutive heads or tails, summarized in Figure 1.

Figure 1: Maximum strings of heads or tails in 50 sets of 50 fair coin tosses.

In 50 sets of 50 fair coin tosses, there were always at least four heads or tails in a row, sometimes more (but in this experiment, never more than 12). The giveaway that Sequence A above is not random is that there are never more than three heads or tails in a row—the very thing that makes it look more random.

UX researchers usually do not encounter data that we expect to have a 50–50 binomial distribution, but this exercise shows how chance is lumpier than we think. One exception that does come up in our research is when we test two conditions and randomly assign participants to either condition A or B using MUIQ®. The random assignment often looks a lot less random because of the lumps.

“People generally fail to appreciate that occasional long runs of one or the other outcome are a natural feature of random sequences” (Abelson, 1995, p. 21).

2. Overconfidence abhors uncertainty.

Consistent with the expectation that chance is more regular than its actual lumpiness implies, people (including researchers) tend to underestimate the extent to which measurements can vary from one sample to another.

“Psychologically, people are prone to prefer false certitude to the daunting recognition of chance variability” (Abelson, 1995, p. 27).

That’s why it’s important to compute confidence intervals around estimated values, whether means or percentages, especially when sample sizes are small. This reveals to researchers the actual precision (or lack thereof) in their measurements. It’s easy to get fooled by the randomness of data, especially with small sample sizes. This is one of the reasons we encourage researchers planning studies to make comparisons (e.g., benchmark studies), so they have a large enough sample size to differentiate the signal of a real difference from the noise of sampling error.

3. Never flout a convention just once.

This rule is tightly connected to the concept of styles of statistical rhetoric. Whether by inclination or by consideration of the requirements for a particular research context, researchers should be stylistically consistent for the duration of that series of studies.

“Either stick consistently to conventional procedures, or better, violate convention in a coherent way if informed consideration provides good reason for so doing” (Abelson, 1995, p. 70).

For example, we’ve written about the conventional practice of using p < .05 as a criterion when conducting tests of significance. The history of setting alpha to .05 shows that it’s a century-old convention that has some empirical basis, but it is just a convention, not a law of nature. When we adopt our typical liberal style for industrial UX research, we usually use p < .10 as the criterion for statistical significance. We find that this less stringent criterion for p results in a higher level of statistical power.

Once you have decided about the appropriate rejection criterion for a study, stick with it. Avoid floating between p < .05 and p < .10 just to select the findings you like and reject those that you find inconvenient.

4. Don’t talk Greek if you don’t know the English translation.

Abelson brings up this rule in his discussion on dealing with multiple dependent variables. One statistical method, multivariate analysis of variance (MANOVA), combines multiple dependent variables to allow for an overall assessment of statistical significance. In most research settings, however, Abelson argues against “MANOVA mania.”

“The output tables from a MANOVA are replete with omnibus tests, and unless the investigator is sophisticated enough to penetrate beyond this level, the results remain unarticulated blobs” (Abelson, 1995, p. 128).

MANOVA uses linear modeling to combine a set of dependent variables to maximize the distance between the means of these combinations (centroids), so a significant difference indicates only that there is some way to combine the dependent variables that could be statistically significant. There is no consideration of whether that combination makes any practical sense in the real world.

The fundamental problem with MANOVA in applied UX research is that we are often working with measurements that may be somewhat correlated but present different views of the user experience. It makes more sense to analyze such measurements separately (e.g., success rates, ease ratings, completion times) than to blindly combine them and try to make sense of that combination (as opposed to using a principled combination like the Single Usability Metric).

In general, don’t use complex analytical methods (“talk Greek”) unless you know how to dig into the practical details (“know the English translation”). It’s often the case that you don’t need to “talk Greek” in the first place; all you need to do is conduct the appropriate fundamental analyses.

5. If you have nothing to say, don’t say anything.

It is rare for a study to have no statistically significant outcomes, but it can happen due to bad luck or problems with the research conception. When it does happen, don’t torture the data to extract a false confession: move on to the next study.

“When nothing works, nothing works” (Abelson, 1995, p. 130).

We help analyze large studies to help explain relationships between variables such as future usage intention and purchase rates. Often we’ve found either no statistical differences or very modest predictive ability. It can be hard for those commissioning and running an analysis to have nothing to say, but it’s better to say nothing than make false statements.

6. There is no free hunch.

Based on the saying, “There is no such thing as a free lunch,” this law applies to the generalizability of research findings.

Strictly speaking, the results of any single experiment cannot be generalized outside of the bounds of the context established by its sampling strategy, independent variables, and dependent measures. The willingness to overgeneralize results from specific sets of stimuli and tasks may be a major contributor to the replication crisis.

“This is simply the way the research life is. One does not deserve a general result by wishing it” (Abelson, 1995, p. 142).

One fairly common way to partially address this in UX research is to use a mixed-method approach, especially by combining qualitative and quantitative studies.

7. You can’t see the dust if you don’t move the couch.

The seventh law is also related to the concept of generalizability, specifically the difficulty of generalizing from one context to another (e.g., from Intro to Psychology students to the U.S. population at large, or from the U.S. population to the population of the world). In any given study, it’s difficult to manipulate more than a few context variables, moving the metaphorical couch to see what happens.

At MeasuringU, we conduct a lot of research to help us understand the effects of different contexts that are of interest to UX researchers. We often find that manipulating a context has no discernible effect on outcomes, for example, when changing the wording of item endpoints or styles of response options in rating scales. Sometimes, however, the manipulation does matter; for example, manipulating the culture from which a sample of respondents is drawn.

“The only sure way to have knowledge of a context variable is to vary it” (Abelson, 1995, p. 155).

8. Criticism is the mother of methodology.

Any domain of research has to start somewhere. Because designing a single study that answers all questions isn’t possible, that study will be open to some sort of criticism. Once a plausible criticism is raised, researchers start planning how to conduct additional research that will be immune to that criticism.

“As research cumulates under pressure from the exchange of counterarguments, previous theoretical generalizations will be supported, modified, or abandoned, and new generalizations may emerge. … Thus, principled statistical argument is not only unavoidable, it is fundamental” (Abelson, 1995, p. 198).

For example, researchers sometimes do not vary the order of presentation of experimental conditions, which confounds the passage of time and previous experience (nuisance variables) with the variable of interest. One way to conduct research that is immune from that specific criticism is to use a Greco-Latin experimental design, which systematically manipulates order effects and other context variables to avoid confounding.

Summary

Abelson’s eight laws are not only witty, but they’re also useful constructs for researchers who use statistical analysis to guide their research narratives.

The Eight Laws

  • Chance is lumpy.
  • Overconfidence abhors uncertainty.
  • Never flout a convention just once.
  • Don’t talk Greek if you don’t know the English translation.
  • If you have nothing to say, don’t say anything.
  • There is no free hunch.
  • You can’t see the dust if you don’t move the couch.
  • Criticism is the mother of methodology.

 

 

 

0
    0
    Your Cart
    Your cart is emptyReturn to Shop
    Scroll to Top