Statistics
Practice Problems
Answer Key

Table of Contents

Mean

The mean (sometimes referred to the average) of a data set is the sum of all data points, divided by the amount of data points. Let's look at an example to clarify things a bit better).

Lets say you’re given the data set:

4,6,9,11,5,8,9

If you are asked to find the mean, the first step is to add every single data point.

4+6+9+11+5+8+9=52

Next, take the sum of that data set (52) and divide it by the amount of data points youre given. In the set above, simply count the amount of you have. In this case its 7.

52/7=7.43

The mean (average) of your data set is approx. 7.43

Median

The median of your data set is the middle number when your data is sequentially organized. Let's look at the previous data set we had

4,6,9,11,5,8,9

The first step to finding the median is to organize your data. 

4,5,6,8,9,9,11

Now that your data is organized, to find the middle value (when the data set is odd) add 1 to the amount of data points you have and divide by 2.

Here we had 7 data points, so:

7+1=4

The median value will be the 4th number in the data set

4,5,6,8,9,9,11

Our median here is then 8.

If your data set is even though, you have to take the average of the two middle numbers

To locate the two middle values, divide the number of data points by two to find where your first middle value is, then add one to find the second, for example:

4,5,6,8,9,10,11,14

Here we have 8 values

8/2=4, so our first middle value is the 4th number, and our second is the 5th

4,5,6,8,9,10,11,14

Our two middle values are 8 and 9. Add them and divide by two to find your median

8+92=8.5

The median of this data set is 8.5

Mode

The mode of the data set is the numerical value that occurs the most. In the data set

4,5,6,8,9,9,11

9 occurs twice while all other values only occur once, meaning the mode is 9.

If all numbers only occur once, there is no mode to the data set.

If two numbers occur more than the rest of the data set, they are both modes. This is referred to as Bimodal. Though, this is not tested on the SAT.

Range

The range of a data set is the largest number subtracted from the smallest number. For example

4,5,6,8,9,9,11

The range of this data set would be 11-4 which is equal to 7. 

Note: if the largest or smallest data point is removed from a data set, the range will most likely be the value that is affected the most. Though, always check if you have the time.

Standard Deviation

On the SAT, you will never be asked to calculate the standard deviation of a data set. Though you will be tested with questions that show whether you understand what it is or not. 

The standard deviation is a value the measures how much the data set deviates from the mean. The larger the standard deviation, the more the data deviates. 

Simply put: The more spread out, the greater the standard deviation. 

Example 1
The dot plots shown each represents a data set. Which of the following statements best compares the means and the standard deviations of the two data sets?

In the first data set, the farther you go out, the lower the frequency of the data points become. Most of the data points are in the center

In the second data set, the points are all spread out evenly. They deviate farther from the center.

Meaning that Data set A has a smaller standard deviation.

If you add up all there values and find the means of the data sets, you'll find that they’re equal. 

So the answer to this problem is A.

Frequency distributions

A Frequency distribution is another way of showcasing data. In the previous examples we used lists to show our data, but it is much more common that on the SAT you will get some sort of a frequency distribution. 

FrequencyHeight (CM)51659170131757180

In the table above, the frequency side tells us how often a number occurs in the data set. So in the sample above, five people have a height of 165, nine have a height of 170, thirteen have a height of 175, and seven have a height of 180.

To find the mode, look for the data point with the height frequency. Since thirteen is our highest frequency here, the mode is 175

For the mean, multiply each value by its frequency, add up all of those products, and divide by the total number of data points (all frequencies added up). If we want to find the mean of put example above:

5[freq.]165[data point]+(9170)+(13175)+(7180)=5890

You'd then divide that number by the amount of data points you have (All frequencies added up)

589034173.24

Our mean (average) height in this sample is approx. 173.24 cm

Finding the median is quite simple too, though it takes a bit of computation. 

The first step is to find your ‘locator’ number, that is, which number will be in the middle. Since in our past example we had 34 data points, we divide by two to find the first middle value, then add one to find the second. 342=17.  So our middle values are going to be the 17th and 18th values.

Now go back to your frequency table, add all the frequencies top to bottom until you reach, or go past, the locator values (17 and 18).

Using our example above, the first frequency is 5, which is less than 17, so add the next frequency

Our next freq is 9. 9+5=14, this is still less than 17, meaning that out median does not lie in the first two rows.

The next one is 13. 9+5+13=27. Our value has exceeded both of out locator values. This means that our median lies within this row. In this case its the data point that has a value of 175.

Meaning our median is 175 cm

Box Plots

A box plot is simply another way of showcasing data. While they might look weird to you, they're incredibly simple to understand. Refer to the image below

image via Khan Academy

The first and last points (which may also be draw out as lines) represent the minimum and maximum of your data set. In the example above, our minimum is 24 and our maximum is 34. 

The lines that make up both sides of the box represent the first quartile (Q1) and the third quartile (Q3). Here, our Q1 is 27, and out Q3 is 33

The line that resides within the box represents the second quartile, which is also the median of the data set.

If you don't know what the terms first, second, third quartile mean, check out the videos below. Though, the SAT will never ask you a direct question about them. 

Link vids here

For the SAT, all you need to know is which value is the median (again, the middle line) and where the minimum and maximums are. 

While these questions are rare, you will most likely be asked to compare two box and whisker plots. The test may ask you which data set has a larger median or to compare the ranges (max - min value). No matter what it is, it’ll be quite simple as long as you know how to read one. 

Interpreting results

The study of statistics is mostly used to make predictions about a large group of people, classified as a population. 

Since it’s almost impossible to survey every single person in a population (for example, if I wanted to know how many Americans drink soda, I wouldn't have time to ask every individual person.), populations can be generalized by a sample, a small subset of people within a population used to predict information about a full group.

For example, instead of surveying every single American to see if they drink soda, a good sample might be to survey 1000 Americans from each of the 50 states. 

If 25% of people in the sample say that they drink soda on a regular basis, it would be a decent prediction to say that approximately 25% of all Americans drink soda on a regular basis. 

REMEMBER, this is not an exact figure. Even if exactly 25% of people who reply say they drink soda this does NOT mean that exactly 25% of ALL Americans drink soda. Many questions on the SAT will have an answer choice with words such as exactly or must, these are almost aways wrong, avoid words that have 100% certainty During the test, the safest option is to choose the most general claim. 

Taking a good sample

For a sample to have any merit, it must properly represent the population. Let's go back to our soda example.  

 If I wanted to figure out the proportion of people who drank soda in all of America, and took a sample of 50,000 people from a single town in Illinois, I would have a invalid and nonrepresentative sample.

One town in one state cannot represent all of America; if that town was particularly healthy or didn't have access to these drinks, the proportion would be way lower than the true amount. 

If the town had one of the greatest soda consumption rates across the country, well then the proportion would be way higher. 

A better sample would be the one we talked about above: surveying 1000 people from 50 states. Even though it would end up being the exact same amount of people, this sample represents a much larger group. Having representatives from different states, towns, and demographic backgrounds gives us a more accurate reading of the true proportion. 

Example 1

After reading the question, analyze the answer choices. Any definitive claims should be immediately thrown out.

C and D both fall into this category. They both claim that Treatment X “will improve the eyesight” or “will cause a substantial improvement” They don't claim that it “might improve eyesight” or “possibly improve eyesight” but instead state with confidence that they will help no matter what, which simply cannot be proved, meaning we can cross out C and D

B also falls into this category, but also compares Treatment X to other treatments. The study itself said nothing about putting it up side by side to other forms of treatments, they simply compared it to people who didn't take treatment at all. Meaning B can be crossed out.

This leaves us with A. The answer uses the word “likely” meaning it isn't making a 100% definitive claim. It also doesn't compare it to anything else, meaning there are no real red flags here. Choice A is the correct answer. 

Example 2

The issue here is the sample. The city council wants to assess the opinions of all city residents, but only sampled city residents who owned dogs. People who own dogs would be much more likely to be in favor of a dog park than people who don't. 

Answer choice A would be wrong for this exact reason. We can't say that a majority of city residents are in favor of a dog park if the sample does not properly represent them. So you can cross out A. 

For choice B, there would be no reason to include more dog owners. 500 people is more than enough for a sample, the issue is that it's simply not representative. Adding more dog owners to the same would be “adding more fuel to the fire”.

Choice C takes the other extreme, claiming that the whole survey shouldn't have included any dog owners. This would still not be representative of the whole town, since there still is a number of people who own dogs. Ideally you'd have multiple dog owners and non-dog owners. 

D here is the correct answer. The sample does not represent all residents as it is biased towards only dog owners. 

Margin of Error & Estimating

On the SAT,  you may be asked to estimate the proportion or average value of a population based on a sample. Though, as you may already know now, the average value of a sample will not be exactly the same as a whole population. That’s where margin of error comes in. 


Going back to our soda example, if a sample of all 50 states was taken and 25% drink soda, with a margin of error of 4%, then we can confidently say that between 21% (25% - 4%) and 29% (25% + 4%) of all Americans drink soda. The larger and more representative your sample size, the lower the margin of error becomes. The smaller your sample size, the greater the margin of error becomes. 

Example 1

Firstly, find any answer choices that make definitive claims. In this case, choice C uses the word “exactly”. There is no way to prove that the proportion is exactly 0.49, so that can be crossed out. 

Now let's go back to the question. We’re told that the proportion is 0.49 with a margin of error of 0.04. So add and subtract 0.04 from 0.49 to get your boundaries. 

This gets us 0.45 and 0.53. We can now pretty confidently say that the true proportion is in between 0.45 and 0.53.  

Choice B’s claim is very UNLIKELY. For the true proportion to be lower than the lower bound is much less possible then residing within it. Choice D is the exact same answer, but with the upper bound. 

This leaves us with the correct answer, Choice A. 

In some cases, you’ll have to find an actual value using the estimated sample proportion/percentage. This is shown in the example below

Example 2

First step here is to obviously read the question, but make sure that's done in a careful manner. The percentage given is how many students who have AT LEAST two siblings, while the question wants us to find the estimate of students who have FEWER than two siblings. 

Since 34.6% of the students have at least two siblings, then the rest have fewer than two. Therefore, the percentage of people who have fewer than two siblings is 100 - 34.6 or  65.4%

We’re also told that the 26 student sample is representative of all the state students in 8th grade. Meaning we can assume that the fact that 65.4% of 8th grade students in the state have fewer than 2 siblings to be a good prediction and accurate. 

To finally calculate the estimate of the amount of students who have fewer than too siblings, we multiply the percentage (in decimal form) by the total amount of students, which would be 1800 classes * average amount of students per class. 

x=0.654(180026)

x=30,607.2

The closest answer  to this value is choice C, making it the correct answer.

Additional Resources