# How Samples Can be Selected

| Home | | Advanced Mathematics |

## Chapter: Biostatistics for the Health Sciences: Defining Populations and Selecting Samples

1. Simple Random Sampling 2. Convenience Sampling 3. Systematic Sampling 4. Stratified Random Sampling 5. Cluster Sampling 6. Bootstrap Sampling

HOW SAMPLES CAN BE SELECTED

### 1. Simple Random Sampling

Statisticians have found that one of the easiest and most convenient methods for achieving reliable inferences about a population is to take a simple random sample. Random sampling ensures unbiased estimates of population parameters. Unbiased means that the average of the sample estimates over all possible samples is equal to the population parameter. Unbiasedness is a statistical property based on probability theory and can be proven mathematically through the definition of a simple random sample.

The concept of simple random sampling involves the selection of a sample of size n from a population of size N. Later in this text, we will show, through combinatorial mathematics, the total number of possible ways (say Z) to select a sample of size n out of a population of size N. Simple random sampling provides a mechanism that gives an equal chance 1/Z of selecting any one of these Z samples. This statement implies that each individual in the population has an equal chance of selection into the sample.

In Section 2.4, we will show you a method based on random number tables for selecting random samples. Suppose we want to estimate the mean of a population (a parameter) by using the mean of a sample (a statistic). Remember that we are not saying that the individual sample estimate will equal the population parameter. If we were to select all possible samples of a fixed size (n) from the parent population, when all possible means are averaged we would obtain the population parameter. The relationship between the mean of all possible sample means and the population parameter is a conceptual issue specified by the central limit theorem (discussed in Chapter 7). For now, it is sufficient to say that in most applications we do not generate all possible samples of size n. In practice, we select only one sample to estimate the parameter. The unbiasedness property of sample means does not even guarantee that individual estimates will be accurate (i.e., close to the parameter value).

### 2. Convenience Sampling

Convenience sampling is just what the name suggests: the patients or samples are selected by an arbitrary method that is easy to carry out. Some researchers refer to these types of samples as “grab bag” samples.

A desirable feature of samples is that they be representative of the population, i.e., that they mirror the underlying characteristics of the population from which they were selected. Unfortunately, there is no guarantee of the representativeness of convenience samples; thus, estimates based on these samples are likely to be biased.

However, convenience samples have been used when it is very difficult or impossible to draw a random sample. Results of studies based on convenience samples are descriptive and may be used to suggest future research, but they should not be used to draw inferences about the population under study.

As a final point, we note that while random sampling does produce unbiased estimates of population parameters, it does not guarantee balance in any particular sample drawn at random. In random sampling, all samples of size n out of a population of size N are equally possible. While many of these samples are balanced with respect to demographic characteristics, some are not.

Extreme examples of nonrepresentative samples are (1) the sample containing the n smallest values for the population parameter and (2) the sample containing the largest values. Because neither of these samples is balanced, both can give poor estimates.

For example (regarding point 2), suppose a catheter ablation treatment is known to have a 95% chance of success. That means that we expect only about one failure in a sample of size 20. However, even though the probability is very small, it is possible that we could select a random sample of 20 individuals with the outcome that all 20 individuals have failed ablation procedures.

### 3. Systematic Sampling

Often, systematic sampling is used when a sampling frame (a complete list of people or objects constituting the population) is available. The procedure is to go to the top of the list and select the first person or start at an arbitrary but specified initial point in the table. The choice of the first point really does not matter, but merely starts the process and must be specified to make the procedure repeatable. Then we skip the next n people on the list and select the n + 2 person. We continue to skip n people and select the next one after n people are skipped. We continue this process until we have exhausted the list.

Here is an example of systematic sampling: suppose a researcher needs to select 30 patients from a list of 5000 names (as stated previously, the list is called the sampling frame and conveniently defines the population from which we are sampling). The researcher would select the first patient on the list, skip to the thirty-second name on the list, select that name, and then skip the next 30 names and select the next name after that, repeating this process until a total of 30 names has been selected. In this example, the sampling interval (i.e., number of skipped cases) is 30.

In the foregoing procedure, we designated the sampling interval first. As we would go through only slightly more than 800 of the 5000 names, we would not ex-haust the list. Alternatively, we could select a certain percentage of patients, for example, 1%. That would be a sample size of 50 for a list of 5000. Although the choice of the number of names to skip is arbitrary, suppose we skip 100 names on the list; the first patient will be 1, the second 102, the third 203, the fourth 304, the fifth 405, and so on until we reach the final one, the fiftieth number, 4950. In this case, we nearly exhaust the list, and the samples are evenly selected throughout the list.

As you can see, systematic sampling is easy and convenient when such a complete list exists. If there is no relationship between the order of the people on the list and the characteristics that we are measuring, it is a perfectly acceptable sampling method. In some applications, we may be able to convince ourselves that this situation is true.

However, there are situations in which systematic sampling can be disastrous. Suppose, for example, that one of the population characteristics we are interested in is age. Now let us assume that the population consists of 50 communities in Southern California. Each community contains 100 people.

We construct our sampling frame by sorting each member according to age, from the youngest to the oldest in each community, and then arranging the communities in some order one after another, such as in alphabetical order by community name. Here N = 5,000 and we want n = 50. One way to choose a systematic sample would be to select the first member from each community.

We could have obtained the sample by selecting the first person on the list and then skipping the next 99. But, thereby, we would select the youngest member from each community, thus providing a severely biased estimate (on the low side) of the average age in the population. Similarly, if we were to skip the first 99 people and always take the hundreth, we would be biased on the high side, as we would select only the oldest person in each community.

Systematic sampling can lead to difficulties when the variable of interest is periodic (with period n) in the sequence order of the sampling frame. The term periodic refers to the situation in which groups of elements appear in a cyclical pattern in the list instead of being uniformly distributed throughout the list. We can consider the sections of the list in which these elements are concentrated to be peaks, and the sections in which they are absent to be troughs. If we skip n people in the sequence and start at a peak value, we will select only the peak values. The same result would happen for troughs. For the scenario in which we select the peaks, our estimate will be biased on the high side; for the trough scenario, we will be biased on the low side.

Here is an example of the foregoing source of sampling error, called a periodic or list effect. If we used a very long list such as a telephone directory for our sampling frame and needed to sample only a few names using a short sampling interval, it is possible that we could select by accident a sample from a portion of the list in which a certain ethnic group is concentrated. The resulting sample would not be very representative of the population. If the characteristics of interest to us varied considerably by ethnic group, our estimate of the population parameter could be very biased.

To realize that the foregoing situation could happen easily, recall that many Caucasians have the surnames Jones and Smith, whereas many Chinese are named Liu, and many Vietnamese are named Nguyen. So if we happened to start near Smith we would obtain mostly Caucasian subjects and mostly Chinese subjects if we started at Liu!

### 4. Stratified Random Sampling

Stratified random sampling is a modification of simple random sampling that is used when we want to ensure that each stratum (subgroup) constitutes an appropriate proportion or representation in the sample. Stratified random sampling also can be used to improve the accuracy of sample estimates when it is known that the variability in the data is not constant across the subgroups.

The method of stratified random sampling is very simple. We define m sub-groups or strata. For the ith subgroup, we select a simple random sample of size ni. We follow this procedure for each subgroup. The total sample size n is then Σni=1ni.

The notation Σ stands for the summation of the individual ni’s. For example, if there are three groups, then Σ3i=1ni = n1 + n2 + n3. Generally we have a total sample size “n” in mind.

Statistical theory can demonstrate that in many situations, stratified random sampling produces an unbiased estimate of the population mean with better precision than does simple random sampling with the same total sample size n. Precision of the estimate is improved when we choose large values of ni for the subgroups with the largest variability and small values for the subgroups with the least variability.

### 5. Cluster Sampling

As an alternative to the foregoing sampling methods, statisticians sometimes select cluster samples. Cluster sampling refers to a method of sampling in which the element selected is a group (as distinguished from an individual), called a cluster. For example, the clusters could be city blocks. Often, the U.S. Bureau of the Census finds cluster sampling to be a convenient way of sampling.

The Bureau might conduct a survey by selecting city blocks at random from a list of city blocks in a particular city. The Bureau would interview a head of household from every household in each city block selected. Often, this method will be more economically feasible than other ways to sample, particularly if the Census Bureau has to send employees out to the communities to conduct the interviews in person.

Cluster sampling often works very well. Since the clusters are selected at random, the samples can be representative of the population; unbiased estimates of the population total or mean value for a particular parameter can be obtained. Sometimes, there is loss of precision for the estimate relative to simple random sampling; how-ever, this disadvantage can be offset by the reduction in cost of the data collection.

See Chapter 9 of Cochran (1977) for a more detailed discussion and some mathematical results about cluster sampling. Further discussion can be found in Lohr (1999) and Kish (1965). While clusters can be of equal or unequal size, the mathematics is simpler for equal size. The three aforementioned texts develop the theory for equal cluster sizes first and then go on to deal with the more complicated case of unequal cluster sizes.

Thus far in Section 2.3, we have presented a brief description of sampling techniques used in surveys. For a more complete discussion see Scheaffer, Mendenhall, and Ott (1979), Kish (1965), Cochran (1977), or Lohr (1999).

### 6. Bootstrap Sampling

Throughout this text, we will discuss both parametric and nonparametric methods of statistical inference. One such nonparametric technique is the bootstrap, a statistical technique in which inferences are made without reliance on parametric models for the population distribution. Other nonparametric techniques are covered in Chapter 14. Nonparametric methods provide a means for obtaining sample estimates or testing hypotheses without making parametric assumptions about the distribution being sampled.

The account of the bootstrap in this book is very elementary and brief. A more thorough treatment can be obtained from the following books: Efron and Tibshirani (1993), Davison and Hinkley (1997), and Chernick (1999). An elementary and abbreviated account can be found in the monograph by Mooney and Duval (1993).

Before considering the bootstrap in more detail, let us review sampling with replacement and sampling without replacement. Suppose we are selecting items in sequence from our population. If, after we select the first item from our population, we allow that item to remain on the list of eligible items for subsequent selection and we continue selecting in this way, we are performing sampling with replacement. Simple random sampling differs from sampling with replacement in that we remove each item from the list of possible subsequent selections. So in simple random sampling, no observations are repeated. Simple random sampling uses sam-pling without replacement.

The bootstrap procedure can be approximated by using a Monte Carlo (random sampling) method. This approximation makes the bootstrap a practical, though computationally intensive, procedure. The bootstrap sampling procedure takes a random sample with replacement from the original sample. That is, we take sam-ples from a sample (i.e., we resample).

In Section 2.4, we describe a mechanism for generating a simple random sample (sampling without replacement from the population). Because bootstrap sampling is so similar to simple random sampling, Section 2.5 will describe the procedure for generating bootstrap samples.

The differences between bootstrap sampling and simple random sampling are first, that instead of sampling from a population, a bootstrap sample is generated by sampling from a sample, and, second, that the sampling is done with replacement instead of without replacement. These differences will be made clear in Section 2.5.