# Why Select a Sample?

| Home | | Advanced Mathematics |

## Chapter: Biostatistics for the Health Sciences: Defining Populations and Selecting Samples

Often, it is too expensive or impossible to collect information on an entire population.

WHY SELECT A SAMPLE?

Often, it is too expensive or impossible to collect information on an entire population. For appropriately chosen samples, accurate statistical estimates of population parameters are possible. Even when we are required to count the entire population as in a U.S. decennial census, sampling can be used to improve estimates for important subpopulations (e.g., states, counties, cities, or precincts).

In the most recent national election, we learned that the outcome of a presidential election in a single state (Florida) was close enough to be in doubt as a consequence of various types of counting errors or exclusion rules. So even when we think we are counting every vote accurately we may not be; surprisingly, a sample estimate may be more accurate than a “complete” count.

As an example of a U.S. government agency that uses sampling, consider the Internal Revenue Service (IRS). The IRS does not have the manpower necessary to review every tax return for mistakes or misrepresentation; instead, the IRS chooses a selected sample of returns. The IRS applies statistical methods to make it more likely that those returns prone to error or fraud are selected in the sample.

A second example arises from reliability studies, which may use destructive testing procedures. To illustrate, a medical device company often tests the peel strength of its packaging material. The company wants the material to peel when suitable force is applied but does not want the seal to come open upon normal handling and shipping. The purpose of the seal is to maintain sterility for medical products, such as catheters, contained in the packages. Because these catheters will be placed inside patients’ hearts to treat arrhythmias, maintenance of sterility in order to prevent infection is very important. When performing reliability tests, it is feasible to peel only a small percentage of the packages, because it is costly to waste good packag ing. On the other hand, accurate statistical inference requires selecting sufficiently large samples.

One of the main challenges of statistics is to select a sample in an efficient, appropriate way; the goal of sample selection is to be as accurate as possible in order to draw a meaningful inference about population characteristics from results of the sample. At this point, it may not be obvious to you that the method of drawing a sample is important. However, history has taught us that it is very easy to draw incorrect inferences because samples were chosen inappropriately.

We often see the results of inappropriate sampling in television and radio polls. This subtle problem is known as a selection bias. Often we are interested in a wider target population but the poll is based only on those individuals who listened to a particular TV or radio program and chose to answer the questions. For instance, if there is a political question and the program has a Republican commentator, the audience may be more heavily Republican than the general target population. Consequently, the survey results will not reflect the target population. In this example, we are assuming that the response rate was sufficiently high to produce reliable results had the sample been random.

Statisticians also call this type of sampling error response bias. This bias often occurs when volunteers are asked to respond to a poll. Even if the listeners of a particular radio or TV program are representative of the target population, those who respond to the poll may not be. Consequently, reputable poll organizations such as Gallup or Harris use well-established statistical procedures to ensure that the sample is representative of the population.

A classic example of failure to select a representative sample of voters arose from the Literary Digest Poll of 1936. In that year, the Literary Digest mailed out some 10 million ballots asking individuals to provide their preference for the up-coming election between Franklin Roosevelt and Alfred Landon. Based on the survey results derived from the return of 2.3 million ballots, the Literary Digest predicted that Landon would be a big winner.

In fact, Roosevelt won the election with a handy 62% majority. This single poll destroyed the credibility of the Literary Digest and soon caused it to cease publication. Subsequent analysis of their sampling technique showed that the list of 10 million persons was taken primarily from telephone directories and motor vehicle registration lists. In more recent surveys of voters, public opinion organizations have found random digit dialed telephone surveys, as well as surveys of drivers, to be acceptable, because almost every home in the United States has a telephone and almost all citizens of voting age own or lease automobiles and hence have drivers licenses. The requirement for the pollsters is not that the list be exhaustive but rather that it be representative of the entire population and thus not capable of producing a large response or selection bias. However, in 1936, mostly Americans with high incomes had phones or owned cars.

The Literary Digest poll selected a much larger proportion of high-income families than are typical in the voting population. Also, the high-income families were more likely to vote Republican than the lower-income families. Consequently, the poll favored the Republican, Alf Landon, whereas the target population, which contained a much larger proportion of low-income Democrats than were in the survey, strongly favored the Democrat, Franklin Roosevelt. Had these economic groups been sampled in the appropriate proportions, the poll would have correctly predicted the outcome of the election.

Related Topics