Chapter 1

Sections 1.1 and 1.2

I. Observation vs. Experiment

A. Observational study: Record data on individuals without attempting to influence the responses. We typically cannot prove anything this way.

B. Experimental study: Deliberately impose a treatment on individuals and record their responses. Influential factors can be controlled.

C. Confounding

1. Two variables (explanatory variables or lurking variables) are confounded when their effects on a response variable cannot be distinguished from each other.

2. Observational studies of the effect of one variable on another often fail because the explanatory variable is confounded with lurking variables.

II. Population vs. Sample

A. Population: The entire group of individuals in which we are interested but can’t usually assess directly

1. A parameter is a number describing a characteristic of the population.

B. Sample: The part of the population we actually examine and for which we do have data

1. A statistic is a number describing a characteristic of a sample.

III. Bad Sampling Methods

A. Convenience sampling: Just ask whoever is around.

B. Voluntary Response Sampling: Individuals choose to be involved. These samples are very susceptible to being biased because different people are motivated to respond or not. They are often called “public opinion polls” and are not considered valid or scientific.

1. Bias: Sample design systematically favors a particular outcome and people have to care enough about an issue to bother replying.

C. Wording effects: Questions worded like “Do you agree that it is awful that…” are prompting you to give a particular response.

D. Undercoverage: Undercoverage occurs when parts of the population are left out in the process of choosing the sample.

IV. Types of Bias

A. Selection bias: Tendency for samples to differ from the corresponding population as a result of systematic exclusion of some of the population.

B. Measurement or response bias: Tendency for samples to differ from the corresponding population because the method of observation tends to produce values that differ from the true value.

C. Nonresponse bias: Tendency for samples to differ from the corresponding population because data are not obtained from all individuals selected for inclusion in the sample.

V. Good Sampling Methods

A. Probability or random sampling: Individuals are randomly selected. No one group should be over-represented.

1. Random samples rely on the absolute objectivity of random numbers.

VI. Simple random samples

A. The simple random sample (SRS) is made of randomly selected individuals. Each individual in the population has the same probability of being in the sample. All possible samples of size n have the same chance of being drawn.

B. How to choose an SRS of size n from a population of size N:

1. Label. Give each member of the population a numerical label of the same length.

2. Table. To choose an SRS, read from Table B successive groups of digits of the length you used as labels. Your sample contains the individuals whose labels you find in the table.

C. Sampling with replacement: Sampling in which an individual or object, once selected, is put back into the population before the next selection.

1. A sample selected with replacement might include any particular individual from the population more than once

D. Sampling without replacement: Sampling in which an individual or object, once selected, is not put pack into the population before the next selection.

1. A sample selected without replacement always includes n distinct individuals from the population

VII. Other sampling designs

A. A stratified random sample is essentially a series of SRS performed on subgroups of a given population. The subgroups are chosen to contain all the individuals with a certain characteristic.

1. The SRS taken within each group in a stratified random sample need not be of the same size.

B. Multistage samples use multiple stages of stratification. They are often used by the government to obtain information about the U.S. population.

1. Data are obtained by taking an SRS for each substrata.

2. Statistical analysis for multistage samples is more complex than for an SRS.

C. Cluster Sampling divides the population into subgroups (clusters) and then selects a random sample of clusters.

1. All individuals in the selected cluster are included in the sample.

2. Cluster sampling is most effective when the clusters are heterogenous

D. 1 in k systematic sampling selects from an ordered arrangement of a population be choosing a starting point at random from the first k individuals and then selecting every kth individual after that.

VIII. Learning about populations from samples (inference)

A. The techniques of inferential statistics allow us to draw inferences or conclusions about a population from a sample.

B. Your estimate of the population is only as good as your sampling design à Work hard to eliminate biases.

C. Your sample is only an estimate—and if you randomly sampled again, you would probably get a somewhat different result.

D. The bigger the sample the better.

Sections 1.3 - 1.5

I. Terminology

A. The individuals in an experiment are the experimental units. If they are human, we call them subjects.

B. The explanatory variables in an experiment are often called factors.

C. A treatment is any specific experimental condition applied to the subjects. If an experiment has several factors, a treatment is a combination of specific values of each factor.

D. If the experiment involves giving two different doses of a drug, we say that we are testing two levels of the factor.

E. A response to a treatment is statistically significant if it is larger than you would expect by chance (due to random variation among the subjects). We will learn how to determine this later.

F. The explanatory variables are those variables that have values that are controlled by the experimenter.

1. Also called factors.

G. The response variable is the variable that the experimenter thinks may be affected by the explanatory variables.

1. Measured as part of the experiment but is not controlled by the experimenter.

II. How to experiment badly

A. Subjects Treatment Measure response

B. In a controlled environment of a laboratory (especially if human subjects are not being used), a simple design like this one, where all subjects receive the same treatment, can work well.

C. Field experiments and experiments with human subjects are exposed to more variable conditions and deal with more variable subjects.

III. Randomized comparative experiments

A. Experiments are comparative in nature: We compare the response to a treatment versus to:

1. another treatment

2. no treatment (a control)

3. a placebo

4. or any combination of the above

B. A control is a situation in which no treatment is administered. It serves as a reference mark for an actual treatment (e.g., a group of subjects does not receive any drug or pill of any kind).

C. A placebo is a fake treatment, such as a sugar pill. It is used to test the hypothesis that the response to the treatment is due to the actual treatment and not to how the subject is being taken care of.

IV. Getting Rid of Sampling Bias

A. The best way to exclude biases in an experiment is to randomize the design. Both the individuals and treatments are assigned randomly.

B. A double-blind experiment is one in which neither the subjects nor the experimenter know which individuals got which treatment until the experiment is completed.

C. Another way to make sure your conclusions are robust is to replicate your experiment—do it over. Replication ensures that particular results are not due to uncontrolled factors or errors of manipulation.

V. Completely Randomized Designs

A. In a completely randomized experimental design, individuals are randomly assigned to groups, then the groups are randomly assigned to treatments.

VI. Matched Pairs Design

A. Matched pairs: Choose pairs of subjects that are closely matched— e.g., same sex, height, weight, age, and race. Within each pair, randomly assign who will receive which treatment.

1. It is also possible to just use a single person and give the two treatments to this person over time in random order. In this case, the “matched pair” is just the same person at different points in time.

VII. Block Designs

A. In a block design, subjects are divided into groups, or blocks, prior to the experiment to test hypotheses about differences between the groups.

VIII. Categorical variables: variables that can be put into categories and cannot be divided

IX. Quantitative variables: variables that can be described with numbers and can be divided

Chapter 2

I. Individuals and Variables

A. Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things.

B. A variable is any characteristic of an individual. A variable can take different values for different individuals.

II. Categorical and Quantitative Variables

A. Quantitative: Something that can be counted or measured for each individual and then added, subtracted, averaged, etc., across individuals in the population.

B. Categorical: Something that falls into one of several categories. What can be counted is the count or proportion of individuals in each category.

III. Ways To Chart Categorical Data

A. Because the variable is categorical, the data in the graph can be ordered any way we want (alphabetical, by increasing value, by year, by personal preference, etc.).

B. Bar graphs: Each category is represented by a bar.

C. Pie charts: The slices must represent the parts of one whole.

IV. Ways To Chart Quantitative Data

A. Histograms and stemplots: Summary graphs for a single variable. They are very useful to understand the pattern of variability in the data.

B. Line graphs and time plots: Use when there is a meaningful sequence, like time. The line connecting the points helps emphasize any change over time.

V. Interpreting histograms

A. When describing a quantitative variable, we look for the overall pattern and for striking deviations from that pattern. We can describe the overall pattern of a histogram by its shape, center, and spread.

B. A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other.

C. A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side.

D. An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.

VI. How To Make A Stemplot

A. Separate each observation into a stem, consisting of all but the final (rightmost) digit, and a leaf, which is that remaining final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.

B. Write the stems in a vertical column with the smallest value at the top, and draw a vertical line at the right of this column.

C. Write each leaf in the row to the right of its stem, in increasing order out from the stem.

VII. Time Plots

A. Time always goes on the horizontal, or x, axis.

B. The variable of interest goes on the vertical, or y, axis.

C. How you stretch the axes and choose your scales can give a different impression.

Chapter 3

Sections 3.1 - 3.4

I. Measures of Center

A. Describe where the data distribution is located along the number line. It provides information about what is “typical.”

B. The mean or arithmetic average: To calculate the average, or mean, add all values, then divide by the number of individuals. It is the “center of mass.”

C. The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger.

D. The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not.

II. Measures of Spread

A. Describe how much variability there is in a data distribution. It provides information about how much individual values tend to differ from one another.

B. The first quartile, Q1, is the value in the sample that has 25% of the data at or below it.

C. The third quartile, Q3, is the value in the sample that has 75% of the data at or below it.

III. The five-number summary and boxplots

A. The Five-Number Summary

1. Minimum

2. First Quartile

3. Median

4. Third Quartile

5. Maximum

B. Boxplots remain true to the data and clearly depict symmetry or skewness.

IV. IQR and outliers

A. The interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot)

B. An outlier is an individual value that falls outside the overall pattern.

V. Standard Deviation

A. The standard deviation is used to describe the variation around the mean.

B. First calculate the variance s2.

C. Then take the square root to get the standard deviation s.

VI. Choosing among summary statistics

A. Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers. à Plot the mean and use the standard deviation for error bars.

B. Otherwise, use the median in the five-number summary, which can be plotted as a boxplot.

Sections 3.5 - 4.6

3.5 – z-scores and Percentiles

I. The z-score

A. z-score =

B. Tells you how many standard deviations the data value is from the mean

C. Positive when the data value is greater than the mean and negative when the data value is less than the mean

D. Useful when the data distribution is mound shape and approximately symmetric

II. Percentiles

A. For a number r between 0 and 100, the rth percentile is a value such that r percent of the observations in the data set fall at or below that value

Normal Distributions

I. Density Curves

A. A density curve is a mathematical model of a distribution

B. It is always on or above the horizontal axis

C. The total area under the curve, by definition, is equal to 1, or 100%

D. The area under the curve for a range of values is the proportion of all observations for that range

II. Normal Distributions

A. Normal—or Gaussian—distributions are a family of symmetrical, bell- shaped density curves defined by a mean (mu) and a standard deviation (sigma): N (,)

B. Because all Normal distributions share the same properties, we can standardize our data to transform any Normal curve N (,) into the standard Normal curve N (0,1)

III. Calculating z-scores

A. A z-score measures the number of standard deviations that a data value x is from the mean

B. To calculate the x value when you know the percentile, use:

C. Find z-score with the percentile, then solve for x

3.6 – Avoiding Common Mistakes

I. Watch out for categorical data that look numerical

A. Often, categorical data is coded numerically, but don’t behave like numbers

B. Categorical data cannot be summarized using the mean and standard deviation or median and interquartile range

II. Measures of center don’t tell all

A. They are only one characteristic of the data set

B. Without other information about the variability and distribution shape, you don’t know much about the behavior of the variable

III. Data distributions with different shapes

A. They can have the same mean and standard deviation

IV. Mean and standard deviation sensitivity

A. Both are sensitive to extreme values in a data set, especially if the data size is small

B. If the data has outliers or markedly skewed, use the median and interquartile range

V. Measures of center and variability

A. Only describe values of a variable, nothing else

VI. Box plots

A. Be careful with small sample sizes

VII. Mound shape

A. Not all distributions are mound shaped

B. Do not use the Empirical Rule if the data is skewed

VIII. Outliers

A. Unusual observations in a data set often provide important information about the variable under study

B. It is important to consider outliers in addition to considering what is typical

C. The values of some summaries can be greatly influenced by outliers

4.1 – Scatterplots and Correlation

I. Bivariate data

A. Consists of measurements or observations on two variables, x and y

B. Can show a linear or nonlinear relationship

C. Can be either positive or negative

D. Explanatory and response variables

1. A response variable measures or records an outcome of a study

a. Planted on the x axis

2. An explanatory variable explains changes in the response variable

a. Planted on the y axis

II. Interpreting Scatterplots

A. After plotting two variables on a scatterplot, we describe the relationship by examining the form, direction, and strength of the association

1. Form – linear, curved, clusters, no pattern

2. Direction – positive, negative, no direction

3. Strength – how closely the points fit the form

B. We also look at the deviations from that pattern

1. Outliers

C. Associations

1. Positive association – high values of one variable tend to occur together with high values of the other variable

2. Negative association – high values of one variable tend to occur together with low values of the other variable

3. No relationship – x and y vary independently; knowing x tells you nothing about y

4. The strength of the association between the two variables can be seen by how much variation, or scatter, there is around the main form

III. Outliers

A. An outlier is a data value that has a very low probability of occurrence

1. In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship

IV. The correlation coefficient: r

A. The correlation coefficient is a measure of the direction and strength of a relationship

B. It is calculated using the mean and the standard deviation of both the x and y variables:

C. Correlation can only be used to describe quantitative variables

1. Categorical variables don’t have means and standard deviations

D. Part of the calculation involves finding z, the standardized score we used when working with the normal distribution:

A. Standardization – allows us to compare correlations between data sets where variables are measured in different units or when variables are different

B. r has no unit

1. Changing the units of variables does not change the correlation coefficient r because we get rid of all our units when we standardize (get z-scores)

C. r ranges from -1 to +1

1. r quantifies the strength and direction of a linear relationship between two quantitative variables

2. Strength – how closely the points follow a straight line

3. Direction is positive when individuals with higher x values tend to have higher values of y

4. When variability in one or both variables decreases, the correlation coefficient gets stronger (closer to +1 or −1)

V. Influential Points

A. Correlations are calculated using means and standard deviations and thus are not resistant to outliers

B. The outliers that, if removed, would dramatically change the correlation and best fit line are called influential points

4.2 – Regression

I. Regression Lines

A. The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible

B. Distances between the points and line are squared so all are positive values. This is done so that distances can be properly added

II. The Least Squares Regression Line

A. The distinction between explanatory and response variables is essential in regression

1. If you exchange y for x in calculating the regression line, you will get the wrong line

2. Regression examines the distance of all points from the line in the y direction only

B. There is a close connection between correlation and the slope of the least-squares line

C. The least-squares regression line always passes through the point

D. The correlation r describes the strength of a straight-line relationship

E. The square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x

F. First we calculate the slope of the line, b, from statistics we already know:

G. Once we know b, the slope, we can calculate a, the y-intercept:

1. where x and y are the sample means of the x and y variables

H. This means that we don’t have to calculate a lot of squared distances to find the least-squares regression line for a data set. We can instead rely on the equation

III. Correlation and Regression

A. The correlation is a measure of spread (scatter) in both the x and y directions in the linear relationship

B. In regression we examine the variation in the response variable (y) given change in the explanatory variable (x)

IV. Making Predictions

A. The equation of the least-squares regression allows you to predict y for any x within the range studied, which is called interpolating

B. A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables

C. Two variables are confounded when their effects on a response variable cannot be distinguished from each other

1. The confounded variables may be either explanatory variables or lurking variables

4.3 – Assessing the Fit of a Line

I. Residuals

A. The distances from each point to the least squares regression line give us potentially useful information about the contribution of individual data points to the overall pattern of scatter, which are called residuals

B. The sum of the residuals is always 0

C. If residuals are scattered randomly around 0, chances are that your data fit a linear model, were normally distributed, and you don’t have any outliers

II. The Coefficient Determination

A. The coefficient determination, determined by r2, is the proportion of variability in y that can be attributed to an approximate linear relationship between y and y

B. The value of r2 is often converted to a percentage

C. r2 represents the fraction of the variance in y (vertical scatter from the regression line) that can be explained by the changes in x

III. Interpreting the Values of Se and r2

Chapter 5

Sections 5.1 - 5.2

I. Two events are independent if the probability that one event occurs is not affected by the occurrence of the other

A. The trials are only independent if there is sampling with replacement

II. S = Sample Space = (number of variables)(number of trials)

A. Ex. Sample space of a coin tossed 3 times = (two coins)(3 trials) = 23

B. This is only applicable if the order matters (i.e. HT is different than TH), if it does not, then:

C. Sample space = (number of variables) (number of trials)

III. Two events are disjoint if they have no outcomes in common and can never happen together.

A. This is written: P(A or B) = P(AB) = P(A) + P(B)

IV. Theoretical probability: from understanding the phenomenon and symmetries in the problem

V. Empirical probability: also called experimental probability; from our knowledge of numerous, similar, past events

VI. Personal Probability: from subjective considerations

Section 5.3

I. Basic Counting Principle: if one event can be chosen in p different ways and a second event can be chosen in q different ways, then there are p x q ways to chose both.

II. Permutation: The arrangement of objects in a certain order where order where order is important.

A. Sample space: P(n,n)

B. The number of permutations of n objects taken r at a time is written as P(n,r)

P(n,r) =

C. The number of permutations of n objects of which p are alike and q are alike:

III. Combination: The arrangement of objects where order does not matter.

A. The number of combinations of n objects taken r at a time is written C(n,r)

C(n,r) =

IV. Independent probability: P(A and B) = P(A B) = P(A) P(B)

V. Dependent probability: P(A and B) = P(A) P(BA)

VI. Mutually exclusive events: two or more events that cannot occur at the same time (disjoint).

A. P(A or B) = P(A) + P(B)

VII. Mutually inclusive events: two or more events that can occur at the same time.

A. P(A or B) = P(A) + P(B) – P(A B) = P(A) + P(B) – [P(A) P(B)]

Section 5.4

I. Conditional probability: Finding the probability of an event given that some proceeding event has already occurred.

A. Reduced sample space – subset of a sample space that contains only those outcomes that satisfy a given condition

II. “Probability of A given B” = P(AB) = =