### Data Analysis and Probability Standard for Grades 9–12

Expectations
Instructional programs from prekindergarten through grade 12 should enable all students to— In grades 9–12 all students should—
Formulate questions that can be addressed with data and collect, organize, and display relevant data to answer them
 • understand the differences among various kinds of studies and which types of inferences can legitimately be drawn from each; • know the characteristics of well-designed studies, including the role of randomization in surveys and experiments; • understand the meaning of measurement data and categorical data, of univariate and bivariate data, and of the term variable; • understand histograms, parallel box plots, and scatterplots and use them to display data; • compute basic statistics and understand the distinction between a statistic and a parameter.
Select and use appropriate statistical methods to analyze data
 • for univariate measurement data, be able to display the distribution, describe its shape, and select and calculate summary statistics; • for bivariate measurement data, be able to display a scatterplot, describe its shape, and determine regression coefficients, regression equations, and correlation coefficients using technological tools; • display and discuss bivariate data where at least one variable is categorical; • recognize how linear transformations of univariate data affect shape, center, and spread; • identify trends in bivariate data and find functions that model the data or transform the data so that they can be modeled.
Develop and evaluate inferences and predictions that are based on data
 • use simulations to explore the variability of sample statistics from a known population and to construct sampling distributions; • understand how sample statistics reflect the values of population parameters and use sampling distributions as the basis for informal inference; • evaluate published reports that are based on data by examining the design of the study, the appropriateness of the data analysis, and the validity of conclusions; • understand how basic statistical techniques are used to monitor process characteristics in the workplace.
Understand and apply basic concepts of probability
 • understand the concepts of sample space and probability distribution and construct sample spaces and distributions in simple cases; • use simulations to construct empirical probability distributions; • compute and interpret the expected value of random variables in simple cases; • understand the concepts of conditional probability and independent events; • understand how to compute the probability of a compound event.

Students whose mathematics curriculum has been consistent with the recommendations in Principles and Standards should enter high school having designed simple surveys and experiments, gathered data, and graphed and summarized those data in various ways. They should be familiar with basic measures of center and spread, able to describe the shape of data distributions, and able to draw conclusions about a single sample. Students will have computed the probabilities of simple and some compound events and performed simulations, comparing the results of the simulations to predicted probabilities.

In grades 9–12 students should gain a deep understanding of the issues entailed in drawing conclusions in light of variability. They will learn more-sophisticated ways to collect and analyze data and draw conclusions from data in order to answer questions or make informed decisions in workplace and everyday situations. They should learn to ask questions that will help them evaluate the quality of surveys, observational studies, and controlled experiments. They can use their expanding repertoire of algebraic functions, especially linear functions, to model and analyze data, with increasing understanding of what it means for a model to fit data well. In addition, students should begin to understand and use correlation in conjunction with residuals and visual displays to analyze associations between two variables. They should become knowledgeable, analytical, thoughtful consumers of the information and data generated by others.

As students analyze data in grades 9–12, the natural link between statistics and algebra can be developed further. Students' understandings of graphs and functions can also be applied in work with data.

Basic ideas of probability underlie much of statistical inference. Probability is linked to other topics in high school mathematics, especially counting techniques (Number and Operations), area concepts (Geometry), the binomial theorem, and relationships between functions and the area under their graphs (Algebra). Students should learn to determine the probability of a sample statistic for a known population and to draw simple inferences about a population from randomly generated samples.

#### Formulate questions that can be addressed with data and collect, organize, and display relevant data to answer them

p. 325

Students' experiences with surveys and experiments in lower grades should prepare them to consider issues of design. In high school, students should design surveys, observational studies, and experiments that take into consideration questions such as the following: Are the issues and questions clear and unambiguous? What is the population? How should the sample be selected? Is a stratified sample called for? What size should the sample be? Students should understand the concept of bias in surveys and ways to reduce bias, such as using randomization in selecting samples. Similarly, when students design experiments, they should begin to learn how to take into account the nature of the treatments, the selection of the experimental units, and the randomization used to assign treatments to units. Examples of situations students might consider are shown in figure 7.22. »

 Fig. 7.22. Three kinds of situations in which statistics are used

There are many reasons to be careful in conducting surveys and analyzing the results. In the survey example, the ambiguity of the question about computer usage makes it impossible to interpret the results meaningfully. Students designing surveys must also deal with sampling procedures. The goal of a survey is to generalize from a sample to the population from which it is drawn. Students must understand that a sample is most likely to be representative when it has been randomly chosen from the population.

p. 326

Nonrandomness in sampling may also limit the conclusions that can be drawn from observational studies. For instance, in the observational study example, it is not certain that the number of people riding trains reflects the number of people who would ride trains if more were available or if scheduling were more convenient. Similarly, it would be inappropriate to draw conclusions about the percentage of the population that ice skates on the basis of observational studies done either » in Florida or in Quebec. Students need to be aware that any conclusions about cause and effect should be made very cautiously in observational studies. They should also know how certain kinds of systematic observations, such as random testing of manufacturing parts taken from an assembly line, can be used for purposes of quality control.

In designed experiments, two or more experimental treatments (or conditions) are compared. In order for such comparisons to be valid, other sources of variation must be controlled. This is not the situation in the tire example, in which the front and rear tires are subjected to different kinds of wear. Another goal in designed experiments is to be able to draw conclusions with broad applicability. For this reason, new tires should be tested on all relevant road conditions. Consider another designed experiment in which the goal is to test the effect of a treatment (such as getting a flu shot) on a response (such as getting the flu) for older people. This is done by comparing the responses of a treatment group, which gets treatment, with those of a control group, which does not. Here, the investigators would randomly choose subjects for their study from the population group to which they want to generalize, say, all males and females aged 65 or older. They would then randomly assign these individuals to the control and treatment groups. Note that interesting issues arise in the choice of subjects (not everyone wants to or is able to participate—could this introduce bias?) and in the concept of a control group (are these seniors then at greater risk of getting the flu?).

#### Select and use appropriate statistical methods to analyze data

Describing center, spread, and shape is essential to the analysis of both univariate and bivariate data. Students should be able to use a variety of summary statistics and graphical displays to analyze these characteristics.

The shape of a distribution of a single measurement variable can be analyzed using graphical displays such as histograms, dotplots, stem-and-leaf plots, or box plots. Students should be able to construct these graphs and select from among them to assist in understanding the data. They should comment on the overall shape of the plot and on points that do not fit the general shape. By examining these characteristics of the plots, students should be better able to explain differences in measures of center (such as mean or median) and spread (such as standard deviation or interquartile range). For example, students should recognize that the statement "the mean score on a test was 50 percent" may cover several situations, including the following: all scores are 50 percent; half the scores are 40 percent and half the scores are 60 percent; half the scores are 0 percent and half the scores are 100 percent; one score is 100 percent and 50 scores are 49 percent. Students should also recognize that the sample mean and median can differ greatly for a skewed distribution. They should understand that for data that are identified by categories—for example, gender, favorite color, or ethnic origin—bar graphs, pie charts, and summary tables often display information about the relative frequency or percent in each category.

p. 327

Students should learn to apply their knowledge of linear transformations from algebra and geometry to linear transformations of data. They should be able to explain why adding a constant to all observed values in a sample changes the measures of center by that constant but does not » change measures of spread or the general shape of the distribution. They should also understand why multiplying each observed value by the same constant multiplies the mean, median, range, and standard deviation by the same factor (see the related discussion in the "Reasoning and Proof" section of this chapter).

The methods used for representing univariate measurement data also can be adapted to represent bivariate data where one variable is categorical and the other is a continuous measurement. The levels of the categorical variable split the measurement variable into groups. Students can use parallel box plots, back-to-back stem-and-leaf, or same-scale histograms to compare the groups. The following problem from Moore (1990, pp. 108–9) illustrates conclusions that can be drawn from such comparisons:

U.S. Department of Agriculture regulations group hot dogs into three types: beef, meat, and poultry. Do these types differ in the number of calories they contain? The three boxplots below display the distribution of calories per hot dog among brands of the three types. The box ends mark the quartiles, the line within the box is the median, and the whiskers extend to the smallest and largest individual observations. We see that beef and meat hot dogs are similar but that poultry hot dogs as a group show considerably fewer calories per hot dog.

Analyses of the relationships between two sets of measurement data are central in high school mathematics. These analyses involve finding functions that "fit" the data well. For instance, students could examine the scatterplot of bivariate measurement data shown in figure 7.23 and consider what type of function (e.g., linear, exponential, quadratic) might be a good model. If the plot of the data seems approximately linear, students should be able to produce lines that fit the data, to compare several such lines, and to discuss what best fit might mean. This analysis includes stepping back and making certain that what is being done makes sense practically.

 Fig. 7.23. Fitting a line to the data displayed in a scatterplot (Adapted from Burrill et al. [1999, pp. 14-15, 20])

p. 328

The dashed vertical line segments in figure 7.23 represent residuals—the differences between the y-values predicted by the linear model and » the actual y-values—for three data points. Teachers can help students explore several ways of using residuals to define best fit. For example, a candidate for best-fitting line might be chosen to minimize the sum of the absolute values of residuals; another might minimize the sum of squared residuals. Using dynamic software, students can change the position of candidate lines for best fit and see the effects of those changes on squared residuals. The line illustrated in figure 7.23, which minimizes the sum of the squares of the residuals, is called the least-squares regression line. Using technology, students should be able to compute the equation of the least-squares regression line and the correlation coefficient, r.

Students should understand that the correlation coefficient r gives information about (1) how tightly packed the data are about the regression line and (2) about the strength of the relationship between the two variables. Students should understand that correlation does not imply a cause-and-effect relationship. For example, the presence of certain kinds of eye problems and the loss of sensitivity in people's feet can be related statistically. However, the correlation may be due to an underlying cause, such as diabetes, for both symptoms rather than to one symptom's causing the other.

#### Develop and evaluate inferences and predictions that are based on data

Once students have determined a model for a data set, they can use the model to make predictions and recognize and explain the limitations of those predictions. For example, the regression line depicted in figure 7.23 has the equation y = 0.33x – 93.9, where x represents the number of screens and y represents box-office revenues (in units of \$10 000). To help students understand the meaning of the regression line, its role in making predictions and inferences, and its limitations and possible extensions, teachers might ask questions like the following:

1. Predict the revenue of a movie that is shown on 800 screens nationwide. Of a movie that is shown on 2500 screens. Discuss the accuracy and limitations of your predictions.

2. Explain the meaning of a slope of 0.33 in the screen-revenue context.

3. Explain why the y-intercept of the regression line does not have meaning in the box-office-revenue context.

4. What other variables might help in predicting box-office revenues?

A parameter is a single number that describes some aspect of an entire population, and a statistic is an estimate of that value computed from some sample of the population. To understand terms such as margin of error in opinion polls, it is necessary to understand how statistics, such as sample proportions, vary when different random samples are chosen from a population. Similarly, sample means computed from measurement data vary according to the random sample chosen, so it is important to understand the distribution of sample means in order to assess how well a specific sample mean estimates the population mean.

p. 329

Understanding how to draw inferences about a population from random samples requires understanding how those samples might be distributed. Such an understanding can be developed with the aid of simulations. Consider the following situation: »

Suppose that 65% of a city's registered voters support Mr. Blake for mayor. How unusual would it be to obtain a random sample of 20 registered voters in which at most 8 support Mr. Blake?

Here the parameter for the population is known: 65 percent of all registered voters support Mr. Blake. The question is, How likely is a random sample with a very different proportion (at most 8 out of 20, or 40%) of supporters? The probability of such a sample can be approximated with a simulation. Figure 7.24 shows the results of drawing 100 random samples of size 20 from a population in which 65 percent support Mr. Blake.

 Fig. 7.24. The results of a simulation of drawing 100 random samples of size 20 from a population in which 65 percent support Mr. Blake

Only 2 percent of the samples had 8 or fewer registered voters supporting Mr. Blake. The value 8 occurs well out on the left tail of the histogram. One can reasonably conclude that a sample outcome of 8 or fewer supporters out of 20 randomly selected voters is a rare event when sampling from this population. This kind of exercise can be used to develop the concept of hypothesis testing for a single proportion or mean.

p. 330

In the situation just described, a parameter of the population was known and the probability of a particular sample characteristic was estimated in order to understand how sampling distributions work. However, in applications of this idea in real situations, the information about a population is unknown and a sample is used to project what that information might be without having to check all the individuals in the population. For example, suppose that the proportion of registered voters supporting Mr. Blake was unknown (a realistic situation) and that a pollster wanted to find out what that proportion might be. If » the pollster surveyed a sample of 20 voters and found that 65 percent of them support the candidate, is it reasonable to expect that about 65 percent of all voters support the candidate? What if the sample was 200 voters? 2000 voters? As indicated above, the proportion of voters who supported Mr. Blake could vary substantially from sample to sample in samples of 20. There is much less variation in samples of 200. By performing simulations with samples of different sizes, students can see that as sample size increases, variation decreases. In this way, they can develop the intuitive underpinnings for understanding confidence intervals.

A similar kind of reasoning about the relationship between the characteristics of a sample and the population from which it is drawn lies behind the use of sampling for monitoring process control and quality in the workplace.

#### Understand and apply basic concepts of probability

In high school, students can apply the concepts of probability to predict the likelihood of an event by constructing probability distributions for simple sample spaces. Students should be able to describe sample spaces such as the set of possible outcomes when four coins are tossed and the set of possibilities for the sum of the values on the faces that are down when two tetrahedral dice are rolled.

High school students should learn to identify mutually exclusive, joint, and conditional events by drawing on their knowledge of combinations, permutations, and counting to compute the probabilities associated with such events. They can use their understandings to address questions such as those in following series of examples.

The diagram below shows the results of a two-question survey administered to 80 randomly selected students at Highcrest High School.

p. 331
• Of the 2100 students in the school, how many would you expect to play a musical instrument? »

• Estimate the probability that an arbitrary student at the school plays on a sports team and plays a musical instrument. How is this related to estimates of the separate probabilities that a student plays a musical instrument and that he or she plays on a sports team?

• Estimate the probability that a student who plays on a sports team also plays a musical instrument.

High school students should learn to compute expected values. They can use their understanding of probability distributions and expected value to decide if the following game is fair:

You pay 5 chips to play a game. You roll two tetrahedral dice with faces numbered 1, 2, 3, and 5, and you win the sum of the values on the faces that are not showing.

Teachers can ask students to discuss whether they think the game is fair and perhaps have the students play the game a number of times to see if there are any trends in the results they obtain. They can then have the students analyze the game. First, students need to delineate the sample space. The outcomes are indicated in figure 7.25. The numbers on the first die are indicated in the top row. The numbers on the second die are indicated in the first column. The sums are given in the interior of the table. Since all outcomes are equally likely, each cell in the table has a probability of 1/16 of occurring.

 Fig. 7.25. The sample space for the roll of two tetrahedral dice

Students can determine that the probability of a sum of 2 is 1/16; of a 3, 1/8; of a 4, 3/16; of a 5, 1/8; of a 6, 3/16; of a 7, 1/8; of an 8, 1/8; of a 10, 1/16. The expected value of a player's "income" in chips from rolling the dice is

If a player pays a five-chip fee to play the game, on average, the player will win 0.5 chips. The game is not statistically fair, since the player can expect to win.

Students can also use the sample space to answer conditional probability questions such as "Given that the sum is even, what is the probability that the sum is a 6?" Since ten of the sums in the sample space are even and three of those are 6s, the probability of a 6 given that the sum is even is 3/10.

The following situation, adapted from Coxford et al. (1998, p. 469), could give rise to a very rich classroom discussion of compound events.

In a trial in Sweden, a parking officer testified to having noted the position of the valve stems on the tires on one side of a car. Returning later, the officer noted that the valve stems were still in the same position. The officer noted the position of the valve stems to the nearest "hour." For example, in figure 7.26 the valve stems are at 10:00 and at 3:00. The officer issued a ticket for overtime parking. However, the owner of the car claimed he had moved the car and returned to the same parking place.

p. 332

The judge who presided over the trial made the assumption that the wheels move independently and the odds of the two valve stems returning to their previous "clock" positions were calculated as 144 to 1. The driver was declared to be innocent because such odds were considered insufficient—had all four valve stems been found to have returned to their previous positions, the driver would have been declared guilty (Zeisel 1968). »

 Fig. 7.26. A diagram of tires with valves at the 10:00 and 3:00 positions

Given the assumption that the wheels move independently, students could be asked to assess the probability that if the car is moved, two (or four) valve stems would return to the same position. They could do so by a direct probability computation, or they might design a simulation, either by programming or by using spinners, to estimate this probability. But is it reasonable to assume that two front and rear wheels or all four wheels move independently? This issue might be resolved empirically. The students might drive a car around the block to see if its wheels do rotate independently of one another and decide if the judge's assumption was justified. They might consider whether it would be more reasonable to assume that all four wheels move as a unit and ask related questions: Under what circumstances might all four wheels travel the same distance? Would all the wheels travel the same distance if the car was driven around the block? Would any differences be large enough to show up as differences in "clock" position? In this way, students can learn about the role of assumptions in modeling, in addition to learning about the computation of probabilities.

Students could also explore the effect of more-precise measurements on the resulting probabilities. They could calculate the probabilities if, say, instead of recording markings to the nearest hour on the clockface, the markings had been recorded to the nearest half or quarter hour. This line of thinking could raise the issue of continuous distributions and the idea of calculating probabilities involving an interval of values rather than a finite number of values. Some related questions are, How could a practical method of obtaining more-precise measurements be devised? How could a parking officer realistically measure tire-marking positions to the nearest clock half-hour? How could measurement errors be minimized? These could begin a discussion of operational definitions and measurement processes.

Students should be able to investigate the following question by using a simulation to obtain an approximate answer:

How likely is it that at most 25 of the 50 people receiving a promotion are women when all the people in the applicant pool from which the promotions are made are well qualified and 65% of the applicant pool is female?

Those students who pursue the study of probability will be able to find an exact solution by using the binomial distribution. Either way, students are likely to find the result rather surprising.