Hello my name is Jeevan Padiyar. This is my personal and professional blog.

Is your data lying to you?

As the book industry continues to change, we are inundated with statistics about user behavior:

  • 49% of e-book readers are bought as gifts [Bowker]
  • 28% of US adults are avid (5+ hours/week) readers [Verso] - 64MM avid readers
  • The heart of the U.S. romance novel readership is women aged 31-49 who are currently in a romantic relationship. [Romance Writers of America]

These statistical nuggets are great because in isolation they give us a glimpse into why people do what they do, and how we can adjust our business to match market needs. But how often do we blindly accept data because it comes with pretty graphs and sound bites that seem to make sense? Probably more often than we'd like to admit.

The best way to ensure that we are not led astray, is to look at what biases have been introduced into a study before using its data to make a decision.   Bias is systematic favoritism in the data collection process which causes misleading results.  Two types of bias are hazards in studies: selection bias and measurement bias.

  • Selection Bias can occur when the group that is surveyed does not accurately reflect the target of the study, or is simply too small to matter. For example, if a study claims to describe the behavior of all readers in the U.S. but only surveys 30 stay-at-home moms in Indiana, it is hardly representative of every reader in the country.
  • Measurement Bias occurs when the questions asked favor a specific outcome. A survey question like "Do you agree that e-books are replacing print books as the preferred medium?" will deliver very different results than one that asks readers to choose their preferred medium from among e-books, purchased p-books or books checked out from the library.

As you read a study, ask yourself the following questions to determine if the authors tried to mitigate bias. Remember: the target population is the group that you want to generalize about, and the sample is the group that you actually survey in order to make those generalizations.

1.       Is the target population (sometimes called the sampling frame) well-defined? If it isn't, the study may contain people outside the target, or it may exclude people who are relevant. In researching e-book reader purchase behavior, a well-defined population could be American consumers who purchased an e-book reader either online or in a physical store over the last 2 years. But if a study only looked at online shoppers at Christmas, the results could be skewed towards gift givers, and they could not be generalized to consumers who bought e-book readers in stores.

2.       Is the sample randomly selected from the target population?  In a truly random sample, every member of the target population has the same chance of being included in the study. When asking this question be wary of surveys that are conducted exclusively on the web, but draw generalizations about all people. These types of studies have participants that are not randomly selected, as they only capture a slice of the traffic to a given domain, and at best can only ever speak to the habits of the users of the particular site conducting research.

3.       Does the sample represent the target population? Here it is important to look at all of the characteristics of the target population to see if they are mirrored in the sample. If you are looking to figure out the book purchase habits of Americans, make sure the sample has the same diversity of ethnicity, geographic distribution and age as is reported in the latest U.S. census. 

4.       Is the sample large enough? The larger the sample the more accurate the results. A quick way to estimate if a sample is large enough to produce a reasonably small margin of error is to divide 1 by the square root of the sample size (Margin of Error=√Sample Size). So a 1,500 person survey would produce a margin of error of 2.58%. It is also important that the sample size in this calculation be the number of people who responded to survey, not the number of survey requests that were sent out.

5.       What is the response rate for the survey? The response rate is defined as the number of people in a target population who actually responded to a given survey.  If the response rate is too low, a study may only reflect people who have a strong opinion about the topic, making the results biased toward their opinions and not the larger and less vociferous target population. A "good" response rate is dependent upon the margin of error that a study is looking to achieve (or that it claims), and the size of the target population being studied. There are two factors to consider here. The first one is a no-brainer: The higher the response rate, the more accurate the study. The second is a little more subtle. The larger the target poplulation being examined, the lower the response rate required for the same level of accuracy. The  linked figure helps explain the correlation graphically. (According to the chart, for a study that is looking to achieve a margin of error of +/- 5%, and is studying a population of 2000 people, the response rate needs to approach 20% to achieve the desired result.) At the end of the day, know the response rate and make sure it closely matches the stated margin of error that study purports to achieve.

6.       Do the questions appear to be leading the respondents into a particular answer? If they do, run the other way! This means that the researchers' agenda is adding a measurement bias and the results aren't worth the paper they are printed on. Also be wary of any study that doesn't share its sampling method, sample characteristics and survey questions.

In the end, the goal of a survey is to accurately describe a larger population. This can only be done if great care is taken to 1) ensure that the results wouldn't change much if another sample was taken under the same conditions and to 2) reduce biases that can be introduced into the system.