# Stats Chapter 1 Notes

By walkthatwalk
Nov 19, 2013
3943 Words

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

1-1: Review and Preview

Definitions:

• Data: observations (such as measurements, genders, survey responses) that have been collected. •

Statistics (the subject): a collection of methods for planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on data.

•

Population: any complete collection of individuals (people, animals, plants or things) from which we may collect data.

•

Census: a collection of data from every member in the population.

•

Sample: a subcollection of members from the population.

Example #1: Evaluate each situation, and identify whether it is a population or a sample. 1. American citizens

_______________________

2. Marketing research for a new deodorant

_______________________

3. All registered voters

_______________________

4. People of North America

_______________________

5. A Gallup poll

_______________________

6. The world's births

_______________________

7. A survey of smoking habits of hypertensive patients

_______________________

Important features about a sample:

• Sample data must be collected in an appropriate way. This means that the sample must be obtained through random selection.

• If sample data are not collected in an appropriate way (randomly), the data will not allow us to draw conclusions for the population.

Example #2:

A poll asked 1,000 American adults: “Do you agree that ‘Global Warming’ is a phenomenon that is occurring with foreseeable consequences?”

What is the sample? _______________________________________________ What is the population of interest? ___________________________________________________

1

1-2: Statistical Thinking

During statistical analysis, we must consider the following factors: • Context of the data: description of what the values represent • Source of the data: where the data came from

• Sampling method: how the individuals from which the data was collected were chosen • Conclusions: statements about the statistical analysis that should be clear to those without an understanding of statistics

• Practical implications: any suggestions, implied consequences, broader reaching implications in the real word

Practical significance: what the results mean in the real world. Statistical significance: the difference in outcome (such as for different treatments) is so large it cannot be due to chance. 1-3: Types of Data

Definitions:

• Individuals are the objects described by a set of data. Individuals may be people (subjects), but they may also be animals or things (experimental units).

Examples:_____________________________________________________________ A variable is any characteristic of an individual that we plan to collect data on. A variable usually takes different values for different individuals.

•

Examples: _____________________________________________________________ •

Parameter: numerical measurement describing some variable of a population.

•

Statistic: numerical measurement describing some variable of a sample.

Example #3: Statistic versus Parameter

Before the San Diego Mayoral Election in 2005, 2,500 San Diegans were asked who they planned to vote for and 41.5% of that group said they were going to vote for Donna Frye. In the actual election, Donna Frye actually received 46% of the popular vote.

The value 41.5% is a ___________________ because it describes a numerical measurement from the _________________. The value 46% is a _____________________ because it describes a numerical measurement from the ________________.

Quantitative vs. Qualitative Data:

• Quantitative (numerical) variable: variable with data that consists of numbers representing counts or measurements for which arithmetic operations such as adding and averaging make sense. These variables are usually recorded in a unit of measurement such as kilograms, pounds, inches, centimeters, dollars, etc. Examples: _____________________________________________________________________ •

Categorical (qualitative) variable: variable with data that takes on category names or labels. The individuals fall into one of several groups or categories. *Data that take on numbers can still be CATEGORICAL if the numbers don’t count or measure anything. Examples: _____________________________________________________________________

2

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Quantitative variables can be further broken down into:

• Discrete variables: have data whose possible values are either a finite or a “countable” number. •

Continuous variables: have data whose values are infinite in possibility that correspond to some scale that covers a range of values without gaps, interruptions or jumps.

Example #4:

The number of eggs a hen lays is a __________________________ variable since we can count the number of eggs. The amount of milk a cow produces is a _______________________ variable since the amount is not restricted to a discrete number of gallons, for example, and could take on amounts like 2.51167983 gallons. In addition to be classified as quantitative (then either discrete or continuous) or qualitative, data can also be classified into one of four levels of measurement.

Levels of Measurement:

• Nominal: data that consist of names, labels, or categories only. The data cannot be arranged in any order from low to high (example: eye color).

•

Ordinal: data that can be arranged in order, but difference between data values cannot be determined or are meaningless (example: grades in a course A, B, C, D, F).

•

Interval: data that can be arranged in order and the difference between any two data values is meaningful, but there is no natural zero starting point (example: year).

•

Ratio: data can be arranged in order, the difference between any two data values is meaningful, AND there is a natural zero starting point (example: price).

Example #5: Determine the level of measurement for the following variables: Course Grades (A, B, C, D, E, F): _____________________

Temperature in ˚F: ____________________________

Price of college textbooks: __________________________

Political Party: __________________________

1-4: Critical Thinking

We must critically think about the context of the data, source of the data, sampling method, conclusions, and practical implications. If there is a misunderstanding of the context or source of the data or a flaw in the sampling method, the conclusions drawn from the statistical analysis may be flawed and thus the practical implications don’t apply. Often times statistics are misused and the data is distorted for the purpose of deception! Here are a few sources of bias or error in statistical analysis: 1. Bad Graphs/Misuse of Graphs

2. Bad Samples

3. Correlation being used as Causality

4. Self-Reported Data

5. Small Samples

We will address each of these individually.

3

1. Bad Graphs/Misuse of Graphs

What is wrong with the graph below?

2. Bad Samples

Some samples are bad in the sense that the method used to collect the data was somehow biased. One example of a biased sample is a voluntary response sample.

Voluntary Response Sample: a sample in which the respondents themselves decide whether or not to be included. Examples would be mail-in, phone-in, and Internet polls.

Why does this result in possible bias?

3. Correlation being used as Causality

When an association is found between two variables and it is concluded that one variable causes the other variable. However, just because two variables are correlated, it does not imply causality. There is a very strong correlation between shoe size and reading score. As shoe size increases, reading score increases. Does that mean that having a bigger shoe size causes one to be a better reader? Big feet causes better reading???

4. Self-Reported Data

Moral of the story is: people lie. If you collect data by allowing people to self-report it, the results are going to be distorted. For example, let’s say we are doing a study about weight. If people allowed to simply tell the researcher what they weigh, rather than being weighed by the researcher, are the calculations and conclusions about weight going to be accurate?

5. Small Samples

Conclusions based on samples far too small, relative to the population of interest, will not be valid. For example, if I am interested in the average number of hours students at SWC study per week and I take a sample of five people, will be conclusion be valid?

4

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Here are some biases that can happen in the collection and reporting process: 1. Loaded Questions

2. Order of Questions

3. Nonresponse

4. Missing Data

5. Self-Interest Study

6. Precise Numbers

7. Deliberate Distortion

Again, we will address each of these individually:

1. Loaded Questions: when the questions on a survey are worded in such a way that they can be misleading, confusing or worded to elicit a desired response. This leads to response bias. Example: Don’t you think it is wrong that there is a proposal to put in a new casino in Jamul that would cause a massive traffic jam in the area and would have a negative impact on the natural wildlife in the area?

2. Order of Questions- Often times, people chose whatever option they hear first… 3. Nonresponse bias occurs when someone is intended to be in the study/survey, but either refuses to respond or is unavailable to respond.

4. Missing Data: when an individual agrees to take part in the study, but does not answer all questions resulting in missing data for some of the variables.

For example, people with low incomes may not respond to the question “What is your annual income?” but may answer all the other questions in the survey. Thus on the variable “Income” the calculations are invalid because low-income individuals are not represented.

5. Self-Interest Study is a study that may be sponsored by a company or other entity that has an interest in the outcome. For example, if a pharmaceutical company is the one sponsoring a study about one of their new drugs, why might that be a problem?

6. Precise Numbers- Just because the numbers reported are precise, does not mean that they are accurate. “There are now 105,215,027 households in the U.S.” Because the number 105,215,027, people erroneously assume “hey, that must be accurate!”

7. Deliberate Distortions

Sometimes reports of studies are just LIES. This is why companies often get sued for false claims and advertisements.

5

1-5: Collecting Sample Data

*The sample data must be collected in an appropriate way so that we can draw accurate conclusions about our data and infer them for our population!

PART 1: Basics if Collecting Data

An observational study observes individuals and measures variables of interest but does not attempt to influence or modify the individuals being studied.

Example: Among a group of 40 women aged 65 and older who were tracked for several years. Researchers looked at the calcium intake of the women and the occurrence of osteoporosis. An experiment imposes some treatment on individuals and then its effects on the individuals are observed. Example: To research the effects of “dietary patterns” on blood pressure in 450 subjects, subjects were randomly assigned to three groups and had their meals prepared by dieticians. The blood pressure of the subjects was measured at the start and end of the study and compared.

Random Samples:

• Random sample: members from the population are selected in such a way that each individual member has an equal chance of being selected in the sample.

• Simple random sample: the sample of n subjects is selected in such a way that every possible sample of size n has the same chance of being chosen.

• Probability sample: selecting members of a population to be in the sample in such a way that each member has a known (but not necessarily the same) chance of being selected.

Other Types of Sampling:

• Stratified Sampling: we subdivide the population into non-overlapping groups (or strata), e.g. geographical areas, age groups, genders that have similar characteristics. A random sample is taken from each stratum and then these smaller samples are combined to form the entire sample.

Example #6: Suppose you were studying the cost of living in San Diego and wanted to sample 1000 homes. Just a Simple Random Sample of homes in San Diego may render, by chance, homes only in East County, which is not representative of all of San Diego County. How could you divide San Diego into geographic regions (strata)?

From each of these strata, we randomly sample the same number of homes, and then combine them together for a full sample representing San Diego.

6

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

• Cluster Sampling: the population is divided into (heterogeneous) groups called clusters. We then randomly select clusters and measure all of the individuals within the clusters that have been selected. Example #7: To measure customer satisfaction, airlines often randomly sample a set of flights, lets say 10, (serving as clusters) from a possible 200 flights, and they distribute a survey to every person on the flights selected.

**Stratified sampling and cluster sampling are often confused. Cluster sampling uses all members from a sample of clusters, whereas stratified sampling uses a sample of members from each strata! • Systematic Sampling: This is random sampling with a system! A starting point is chosen at random, and thereafter every kth element is included in the sample (chose individuals at regular intervals). This method is often used in industry for quality control. An item is selected for testing from a production line (say, every fifteen minutes) to ensure that machines and equipment are working to specification.

It is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n. (Note: if I is not a whole number, then it is rounded to the nearest whole number). Example #8: Suppose you want to sample 8 houses from a street of 120 houses. 120/8=15, so every 15th house is chosen after a random starting point between 1 and 15. If the random starting point is 14, then what houses would be selected?

• Multistage Sampling: a large area, such as a country, is first divided into smaller regions (such as states), and a random sample of these regions is collected. In the second stage, a random sample of smaller areas (such as counties) is taken from within each of the regions chosen in the first stage. Then, in the third stage, a random sample of even smaller areas (such as neighborhoods) is taken from within each of the areas chosen in the second stage. If these areas are sufficiently small for the purposes of the study, then the researcher might stop at the third stage. If not, he or she may continue to sample from the areas chosen in the third stage, etc., until appropriately small areas have been chosen. Note: a different method of sampling (one of the other sampling methods above) could be used at each level of the multistage sampling process!!! Example #9: Southwestern College is interested in whether students prefer classes with some online component or no online component. To do so, the school first randomly selects 3 schools within the college campus, then from each school two departments are chosen, and then from within each department five classes are chosen, lastly, from each class 10 students are chosen. How many students will end up in the sample?

7

Biased Sampling Methods:

• Voluntary Response Sampling: Individuals are asked to provide information, and all who respond are counted.

•

Convenience Sampling: Selects individuals that are easiest to reach.

Other Biases in Sampling:

1. Undercoverage: When some groups in the population are left out of the process of choosing the sample. 2. Nonresponse Bias: When an individual chosen for the sample can’t be contacted or refuses to cooperate.

3. Response Bias: When the behavior of the interviewer, wording of the questions, order of the questions, or incentives given influences the outcome of the survey or questionnaire. PART 2: BEYOND THE BASICS OF COLLECTING DATA

Types of Observational Studies:

1) Retrospective (case-control) study is a study that looks backwards in time. A typical retrospective design is used when studying a disease that takes a long time to appear. The biggest problem in a retrospective study is that some of the information that we need may be hard to get.

2) Prospective (cohort) study is a study that looks forward in time. The study usually involves taking a group of subjects (cohort) and following them over a long period of time. The outcome of interest should be common and retaining participants in the study is important to avoid bias. These studies usually have less chance of bias than retrospective studies.

3) Cross-sectional study is a study where data are observed, measured, and collected at one point in time.

Parts of an Experiment:

Individuals: the people or things being studied in an experiment. When the individuals are people, the individuals are called subjects.

Factors: the explanatory variables in an experiment. These are what the researchers are modifying/controlling. Response Variable(s): the measured outcome(s) of interest.

Levels: the specific values of the factor.

Treatment: the experimental condition applied to the individuals in the experiment. If an experiment has several factors, a treatment is a combination of different levels of each factor.

Example #10: To research the effects of “dietary patterns” on blood pressure in 450 subjects, subjects were randomly assigned to one of three diets (low fat, low carb or usual diet) and to one of two physical activity levels (low or high physical activity). The difference in blood pressure from the start to the finish was the outcome.

8

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Individuals: _____________________________________________

Factors: ________________________________________________

Response Variable: _____________________________________

Levels of Each Factor: ___________________________________________________________ Treatments:______________________________________________

In order to study the effect of one variable or more variables on another (or other variables), it is important to control all other possible influences outside of the treatments being imposed. For this reason, we wish to conduct randomized comparative experiments.

An experiment that uses both comparison of two or more treatments and chance assignment of subjects to treatments is a randomized comparative experiment.

1. Randomization: To ensure that we do not impose personal bias in the selection process, and use enough subjects in each group to reduce chance variation in the results.

2. Control Groups: Using control groups helps to ensure that we account for known factors that could affect a study’s results. Researchers, however, may be unaware of important factors and not account for them in the experiment.

3. Replication: Repetition or duplication of an experiment so that the results can be confirmed of verified to ensure that the results we got the first time are not just by chance. When we forget to control for or consider other variables effects on the response, we may end up with incorrect conclusions about cause-and-effect relationship between the treatment and the response. Confounding occurs in an experiment when you are not able to distinguish among the effects of different factors. Confounding variables: Are additional explanatory variables that affect the response but are not considered when exploring the explanatory/response relationship.

X

Y

R

Y

T

Y

Example: A professor wishes to test the effect of his new attendance policy (“your grade will be lowered by one point for every class missed”) and plans to compare the average number of missed classes this semester (with the new policy) with last semester (no policy). However, in comparison to last semester, this semester had “mild” whether, whereas last semester had many days with blizzard conditions. What is the confounding variable?

9

More Experiment Terminology:

• Placebos: A placebo is a fake treatment, made from an inactive substance like sugar, distilled water, or saline solution. It should have no effect on the measured outcome made on the subject. • Blinding: In an experiment, it is desirable to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as blinding. It is “double blind” if both patients and evaluators are unaware, “single blind” if only the patient is unaware. Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study. It minimizes the placebo effect (when subjects receiving the placebo still report improvement in symptoms). • Statistically Significant: When comparing the treatment groups, if the difference between the two is so large that it would rarely occur by chance, this is called statistically significant. Types of Experimental Designs:

1. Completely Randomized: randomly assigning subjects into treatment groups and measuring an outcome. Example #11: A Completely Randomized Design

In the early 1980’s, an experiment was conducted to test if AZT affects survival time of AIDS patients. 1200 AIDS patients between the ages of 35 and 55 with similar severity of disease were selected. 600 were randomly selected to received the drug AZT. There were three groups among the 600 (10mg., 20mg., 30mg.of AZT). The other half received an inert substance made of sugar. Researchers and patients were unaware which pill was the AZT and which pill was the placebo. They were both white in color, the same shape, smell and taste. •

Draw a diagram to represent this experiment.

(Identify the subjects, the response variable, the factor levels, the treatments, where randomization occurs, and any factors that have already been controlled for)

•

•

Was “Blinding” used in this experiment?

•

What other factors should be controlled for that might affect the response variable?

•

10

What is the placebo in this study?

Suppose for this particular experiment, 347 of the 600 individuals in the placebo group died in a 5-year period, while 110 of the 600 individuals in the treatment groups died in a 5-year period. Is this just a fluke or does the treatment really help extend survival time (statistically significant)?

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

2. Block Design: When groups of subjects or experimental units are similar, it’s often to “block them off” and isolate the variability between groups so we can see the effect of the treatments more clearly. This design allows for comparisons of more than 2 treaments.

Block 1 random

Subjects

Block 2 random

.

.

.

Block n random

Treatment 1

Treatment 2

Treatment n

Treatment 1

Treatment 2

Treatment n

observe results

Treatment 1

Treatment 2

Treatment n

Common Examples: Agricultural experiments where the blocks are different fields, blocks may gender, grades (1st, 2nd, 3rd etc). Note: if there are two subjects in each block, this is a matched pairs design! Example #12: Block Design

A researcher is carrying out a study of the effectiveness of two different skin creams in combination with two different retinol treatments (80% retinol and 30% retinol) for the treatment of a certain skin disease. He has 240 women and plans to divide them into 2 groups of 120 subjects each, according to how severe their skin condition is; the 120 most severe cases are the first group, the remaining moderate cases will be the second group. The members of each group are then randomly assigned to one of the treatment combinations. The women were unaware which skin cream they were using. • Draw a diagram to represent this experiment.

(Identify the subjects, the response variable, the treatments (including the # of treatments), where randomization occurs, and any factors that have already been controlled for) • Was “Blinding” used in this experiment? Single or Double or neither? •

What other factors should be controlled for that might affect the response variable?

•

Is there a placebo? Why do you think they did not have a control group in this experiment?

11

3. Matched Pairs Design: This is a reduced block design in which the block only contains two subjects. The researcher must choose pairs of subjects that are as closely matched as possible in characteristics that may have an effect on the outcome. Then randomly assign them into treatment groups.

Pair 1 random

Treat 1

Treat 2

Treat 1

Subjects

Paired

Pair 2 random

Treat 2

observe results

.

.

Treat 1

Pair n random

Treat 2

Common Examples: Twin studies, before and after studies, taste tests in which one subject tries two brands (i.e. Coke vs. Pepsi). A person can serve as their own pair.

Example #13: Matched Pairs

Researchers at the University of California, Santa Barbara wished to determine whether “music cognition and cognitions pertaining to abstract operations such as mathematical reasoning” were related. To test this, 12 college students listened to Mozart’s sonata for two pianos in D major for 10 minutes and then were given a mathematical reasoning test. The same students were also given a test measuring the same skills after sitting in a room for 10 minutes in complete silence. The order was randomized. The mean score on the test following the Mozart piece was 119 and the mean score following silence was 110. The researchers concluded that subjects performed better on abstract/spatial reasoning tests after listening to Mozart.

• Draw a diagram of the experimental design:

(Identify the subjects, the response variable, the treatments, and where randomization occurs) •

•

12

Describe how you could improve this study?

Was blinding used in this study?

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

1-1: Review and Preview

Definitions:

• Data: observations (such as measurements, genders, survey responses) that have been collected. •

Statistics (the subject): a collection of methods for planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on data.

•

Population: any complete collection of individuals (people, animals, plants or things) from which we may collect data.

•

Census: a collection of data from every member in the population.

•

Sample: a subcollection of members from the population.

Example #1: Evaluate each situation, and identify whether it is a population or a sample. 1. American citizens

_______________________

2. Marketing research for a new deodorant

_______________________

3. All registered voters

_______________________

4. People of North America

_______________________

5. A Gallup poll

_______________________

6. The world's births

_______________________

7. A survey of smoking habits of hypertensive patients

_______________________

Important features about a sample:

• Sample data must be collected in an appropriate way. This means that the sample must be obtained through random selection.

• If sample data are not collected in an appropriate way (randomly), the data will not allow us to draw conclusions for the population.

Example #2:

A poll asked 1,000 American adults: “Do you agree that ‘Global Warming’ is a phenomenon that is occurring with foreseeable consequences?”

What is the sample? _______________________________________________ What is the population of interest? ___________________________________________________

1

1-2: Statistical Thinking

During statistical analysis, we must consider the following factors: • Context of the data: description of what the values represent • Source of the data: where the data came from

• Sampling method: how the individuals from which the data was collected were chosen • Conclusions: statements about the statistical analysis that should be clear to those without an understanding of statistics

• Practical implications: any suggestions, implied consequences, broader reaching implications in the real word

Practical significance: what the results mean in the real world. Statistical significance: the difference in outcome (such as for different treatments) is so large it cannot be due to chance. 1-3: Types of Data

Definitions:

• Individuals are the objects described by a set of data. Individuals may be people (subjects), but they may also be animals or things (experimental units).

Examples:_____________________________________________________________ A variable is any characteristic of an individual that we plan to collect data on. A variable usually takes different values for different individuals.

•

Examples: _____________________________________________________________ •

Parameter: numerical measurement describing some variable of a population.

•

Statistic: numerical measurement describing some variable of a sample.

Example #3: Statistic versus Parameter

Before the San Diego Mayoral Election in 2005, 2,500 San Diegans were asked who they planned to vote for and 41.5% of that group said they were going to vote for Donna Frye. In the actual election, Donna Frye actually received 46% of the popular vote.

The value 41.5% is a ___________________ because it describes a numerical measurement from the _________________. The value 46% is a _____________________ because it describes a numerical measurement from the ________________.

Quantitative vs. Qualitative Data:

• Quantitative (numerical) variable: variable with data that consists of numbers representing counts or measurements for which arithmetic operations such as adding and averaging make sense. These variables are usually recorded in a unit of measurement such as kilograms, pounds, inches, centimeters, dollars, etc. Examples: _____________________________________________________________________ •

Categorical (qualitative) variable: variable with data that takes on category names or labels. The individuals fall into one of several groups or categories. *Data that take on numbers can still be CATEGORICAL if the numbers don’t count or measure anything. Examples: _____________________________________________________________________

2

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Quantitative variables can be further broken down into:

• Discrete variables: have data whose possible values are either a finite or a “countable” number. •

Continuous variables: have data whose values are infinite in possibility that correspond to some scale that covers a range of values without gaps, interruptions or jumps.

Example #4:

The number of eggs a hen lays is a __________________________ variable since we can count the number of eggs. The amount of milk a cow produces is a _______________________ variable since the amount is not restricted to a discrete number of gallons, for example, and could take on amounts like 2.51167983 gallons. In addition to be classified as quantitative (then either discrete or continuous) or qualitative, data can also be classified into one of four levels of measurement.

Levels of Measurement:

• Nominal: data that consist of names, labels, or categories only. The data cannot be arranged in any order from low to high (example: eye color).

•

Ordinal: data that can be arranged in order, but difference between data values cannot be determined or are meaningless (example: grades in a course A, B, C, D, F).

•

Interval: data that can be arranged in order and the difference between any two data values is meaningful, but there is no natural zero starting point (example: year).

•

Ratio: data can be arranged in order, the difference between any two data values is meaningful, AND there is a natural zero starting point (example: price).

Example #5: Determine the level of measurement for the following variables: Course Grades (A, B, C, D, E, F): _____________________

Temperature in ˚F: ____________________________

Price of college textbooks: __________________________

Political Party: __________________________

1-4: Critical Thinking

We must critically think about the context of the data, source of the data, sampling method, conclusions, and practical implications. If there is a misunderstanding of the context or source of the data or a flaw in the sampling method, the conclusions drawn from the statistical analysis may be flawed and thus the practical implications don’t apply. Often times statistics are misused and the data is distorted for the purpose of deception! Here are a few sources of bias or error in statistical analysis: 1. Bad Graphs/Misuse of Graphs

2. Bad Samples

3. Correlation being used as Causality

4. Self-Reported Data

5. Small Samples

We will address each of these individually.

3

1. Bad Graphs/Misuse of Graphs

What is wrong with the graph below?

2. Bad Samples

Some samples are bad in the sense that the method used to collect the data was somehow biased. One example of a biased sample is a voluntary response sample.

Voluntary Response Sample: a sample in which the respondents themselves decide whether or not to be included. Examples would be mail-in, phone-in, and Internet polls.

Why does this result in possible bias?

3. Correlation being used as Causality

When an association is found between two variables and it is concluded that one variable causes the other variable. However, just because two variables are correlated, it does not imply causality. There is a very strong correlation between shoe size and reading score. As shoe size increases, reading score increases. Does that mean that having a bigger shoe size causes one to be a better reader? Big feet causes better reading???

4. Self-Reported Data

Moral of the story is: people lie. If you collect data by allowing people to self-report it, the results are going to be distorted. For example, let’s say we are doing a study about weight. If people allowed to simply tell the researcher what they weigh, rather than being weighed by the researcher, are the calculations and conclusions about weight going to be accurate?

5. Small Samples

Conclusions based on samples far too small, relative to the population of interest, will not be valid. For example, if I am interested in the average number of hours students at SWC study per week and I take a sample of five people, will be conclusion be valid?

4

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Here are some biases that can happen in the collection and reporting process: 1. Loaded Questions

2. Order of Questions

3. Nonresponse

4. Missing Data

5. Self-Interest Study

6. Precise Numbers

7. Deliberate Distortion

Again, we will address each of these individually:

1. Loaded Questions: when the questions on a survey are worded in such a way that they can be misleading, confusing or worded to elicit a desired response. This leads to response bias. Example: Don’t you think it is wrong that there is a proposal to put in a new casino in Jamul that would cause a massive traffic jam in the area and would have a negative impact on the natural wildlife in the area?

2. Order of Questions- Often times, people chose whatever option they hear first… 3. Nonresponse bias occurs when someone is intended to be in the study/survey, but either refuses to respond or is unavailable to respond.

4. Missing Data: when an individual agrees to take part in the study, but does not answer all questions resulting in missing data for some of the variables.

For example, people with low incomes may not respond to the question “What is your annual income?” but may answer all the other questions in the survey. Thus on the variable “Income” the calculations are invalid because low-income individuals are not represented.

5. Self-Interest Study is a study that may be sponsored by a company or other entity that has an interest in the outcome. For example, if a pharmaceutical company is the one sponsoring a study about one of their new drugs, why might that be a problem?

6. Precise Numbers- Just because the numbers reported are precise, does not mean that they are accurate. “There are now 105,215,027 households in the U.S.” Because the number 105,215,027, people erroneously assume “hey, that must be accurate!”

7. Deliberate Distortions

Sometimes reports of studies are just LIES. This is why companies often get sued for false claims and advertisements.

5

1-5: Collecting Sample Data

*The sample data must be collected in an appropriate way so that we can draw accurate conclusions about our data and infer them for our population!

PART 1: Basics if Collecting Data

An observational study observes individuals and measures variables of interest but does not attempt to influence or modify the individuals being studied.

Example: Among a group of 40 women aged 65 and older who were tracked for several years. Researchers looked at the calcium intake of the women and the occurrence of osteoporosis. An experiment imposes some treatment on individuals and then its effects on the individuals are observed. Example: To research the effects of “dietary patterns” on blood pressure in 450 subjects, subjects were randomly assigned to three groups and had their meals prepared by dieticians. The blood pressure of the subjects was measured at the start and end of the study and compared.

Random Samples:

• Random sample: members from the population are selected in such a way that each individual member has an equal chance of being selected in the sample.

• Simple random sample: the sample of n subjects is selected in such a way that every possible sample of size n has the same chance of being chosen.

• Probability sample: selecting members of a population to be in the sample in such a way that each member has a known (but not necessarily the same) chance of being selected.

Other Types of Sampling:

• Stratified Sampling: we subdivide the population into non-overlapping groups (or strata), e.g. geographical areas, age groups, genders that have similar characteristics. A random sample is taken from each stratum and then these smaller samples are combined to form the entire sample.

Example #6: Suppose you were studying the cost of living in San Diego and wanted to sample 1000 homes. Just a Simple Random Sample of homes in San Diego may render, by chance, homes only in East County, which is not representative of all of San Diego County. How could you divide San Diego into geographic regions (strata)?

From each of these strata, we randomly sample the same number of homes, and then combine them together for a full sample representing San Diego.

6

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

• Cluster Sampling: the population is divided into (heterogeneous) groups called clusters. We then randomly select clusters and measure all of the individuals within the clusters that have been selected. Example #7: To measure customer satisfaction, airlines often randomly sample a set of flights, lets say 10, (serving as clusters) from a possible 200 flights, and they distribute a survey to every person on the flights selected.

**Stratified sampling and cluster sampling are often confused. Cluster sampling uses all members from a sample of clusters, whereas stratified sampling uses a sample of members from each strata! • Systematic Sampling: This is random sampling with a system! A starting point is chosen at random, and thereafter every kth element is included in the sample (chose individuals at regular intervals). This method is often used in industry for quality control. An item is selected for testing from a production line (say, every fifteen minutes) to ensure that machines and equipment are working to specification.

It is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n. (Note: if I is not a whole number, then it is rounded to the nearest whole number). Example #8: Suppose you want to sample 8 houses from a street of 120 houses. 120/8=15, so every 15th house is chosen after a random starting point between 1 and 15. If the random starting point is 14, then what houses would be selected?

• Multistage Sampling: a large area, such as a country, is first divided into smaller regions (such as states), and a random sample of these regions is collected. In the second stage, a random sample of smaller areas (such as counties) is taken from within each of the regions chosen in the first stage. Then, in the third stage, a random sample of even smaller areas (such as neighborhoods) is taken from within each of the areas chosen in the second stage. If these areas are sufficiently small for the purposes of the study, then the researcher might stop at the third stage. If not, he or she may continue to sample from the areas chosen in the third stage, etc., until appropriately small areas have been chosen. Note: a different method of sampling (one of the other sampling methods above) could be used at each level of the multistage sampling process!!! Example #9: Southwestern College is interested in whether students prefer classes with some online component or no online component. To do so, the school first randomly selects 3 schools within the college campus, then from each school two departments are chosen, and then from within each department five classes are chosen, lastly, from each class 10 students are chosen. How many students will end up in the sample?

7

Biased Sampling Methods:

• Voluntary Response Sampling: Individuals are asked to provide information, and all who respond are counted.

•

Convenience Sampling: Selects individuals that are easiest to reach.

Other Biases in Sampling:

1. Undercoverage: When some groups in the population are left out of the process of choosing the sample. 2. Nonresponse Bias: When an individual chosen for the sample can’t be contacted or refuses to cooperate.

3. Response Bias: When the behavior of the interviewer, wording of the questions, order of the questions, or incentives given influences the outcome of the survey or questionnaire. PART 2: BEYOND THE BASICS OF COLLECTING DATA

Types of Observational Studies:

1) Retrospective (case-control) study is a study that looks backwards in time. A typical retrospective design is used when studying a disease that takes a long time to appear. The biggest problem in a retrospective study is that some of the information that we need may be hard to get.

2) Prospective (cohort) study is a study that looks forward in time. The study usually involves taking a group of subjects (cohort) and following them over a long period of time. The outcome of interest should be common and retaining participants in the study is important to avoid bias. These studies usually have less chance of bias than retrospective studies.

3) Cross-sectional study is a study where data are observed, measured, and collected at one point in time.

Parts of an Experiment:

Individuals: the people or things being studied in an experiment. When the individuals are people, the individuals are called subjects.

Factors: the explanatory variables in an experiment. These are what the researchers are modifying/controlling. Response Variable(s): the measured outcome(s) of interest.

Levels: the specific values of the factor.

Treatment: the experimental condition applied to the individuals in the experiment. If an experiment has several factors, a treatment is a combination of different levels of each factor.

Example #10: To research the effects of “dietary patterns” on blood pressure in 450 subjects, subjects were randomly assigned to one of three diets (low fat, low carb or usual diet) and to one of two physical activity levels (low or high physical activity). The difference in blood pressure from the start to the finish was the outcome.

8

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

Individuals: _____________________________________________

Factors: ________________________________________________

Response Variable: _____________________________________

Levels of Each Factor: ___________________________________________________________ Treatments:______________________________________________

In order to study the effect of one variable or more variables on another (or other variables), it is important to control all other possible influences outside of the treatments being imposed. For this reason, we wish to conduct randomized comparative experiments.

An experiment that uses both comparison of two or more treatments and chance assignment of subjects to treatments is a randomized comparative experiment.

1. Randomization: To ensure that we do not impose personal bias in the selection process, and use enough subjects in each group to reduce chance variation in the results.

2. Control Groups: Using control groups helps to ensure that we account for known factors that could affect a study’s results. Researchers, however, may be unaware of important factors and not account for them in the experiment.

3. Replication: Repetition or duplication of an experiment so that the results can be confirmed of verified to ensure that the results we got the first time are not just by chance. When we forget to control for or consider other variables effects on the response, we may end up with incorrect conclusions about cause-and-effect relationship between the treatment and the response. Confounding occurs in an experiment when you are not able to distinguish among the effects of different factors. Confounding variables: Are additional explanatory variables that affect the response but are not considered when exploring the explanatory/response relationship.

X

Y

R

Y

T

Y

Example: A professor wishes to test the effect of his new attendance policy (“your grade will be lowered by one point for every class missed”) and plans to compare the average number of missed classes this semester (with the new policy) with last semester (no policy). However, in comparison to last semester, this semester had “mild” whether, whereas last semester had many days with blizzard conditions. What is the confounding variable?

9

More Experiment Terminology:

• Placebos: A placebo is a fake treatment, made from an inactive substance like sugar, distilled water, or saline solution. It should have no effect on the measured outcome made on the subject. • Blinding: In an experiment, it is desirable to keep the information about the treatments hidden from the patients and anyone involved with evaluating the patient. This is known as blinding. It is “double blind” if both patients and evaluators are unaware, “single blind” if only the patient is unaware. Blinding prevents conscious or subconscious biases or expectations from influencing the outcome of the study. It minimizes the placebo effect (when subjects receiving the placebo still report improvement in symptoms). • Statistically Significant: When comparing the treatment groups, if the difference between the two is so large that it would rarely occur by chance, this is called statistically significant. Types of Experimental Designs:

1. Completely Randomized: randomly assigning subjects into treatment groups and measuring an outcome. Example #11: A Completely Randomized Design

In the early 1980’s, an experiment was conducted to test if AZT affects survival time of AIDS patients. 1200 AIDS patients between the ages of 35 and 55 with similar severity of disease were selected. 600 were randomly selected to received the drug AZT. There were three groups among the 600 (10mg., 20mg., 30mg.of AZT). The other half received an inert substance made of sugar. Researchers and patients were unaware which pill was the AZT and which pill was the placebo. They were both white in color, the same shape, smell and taste. •

Draw a diagram to represent this experiment.

(Identify the subjects, the response variable, the factor levels, the treatments, where randomization occurs, and any factors that have already been controlled for)

•

•

Was “Blinding” used in this experiment?

•

What other factors should be controlled for that might affect the response variable?

•

10

What is the placebo in this study?

Suppose for this particular experiment, 347 of the 600 individuals in the placebo group died in a 5-year period, while 110 of the 600 individuals in the treatment groups died in a 5-year period. Is this just a fluke or does the treatment really help extend survival time (statistically significant)?

Triola,

Elementary

Statistics

with

TI-‐83/84+

Calculator,

3e

Chapter

1:

Introduction

to

Statistics

2. Block Design: When groups of subjects or experimental units are similar, it’s often to “block them off” and isolate the variability between groups so we can see the effect of the treatments more clearly. This design allows for comparisons of more than 2 treaments.

Block 1 random

Subjects

Block 2 random

.

.

.

Block n random

Treatment 1

Treatment 2

Treatment n

Treatment 1

Treatment 2

Treatment n

observe results

Treatment 1

Treatment 2

Treatment n

Common Examples: Agricultural experiments where the blocks are different fields, blocks may gender, grades (1st, 2nd, 3rd etc). Note: if there are two subjects in each block, this is a matched pairs design! Example #12: Block Design

A researcher is carrying out a study of the effectiveness of two different skin creams in combination with two different retinol treatments (80% retinol and 30% retinol) for the treatment of a certain skin disease. He has 240 women and plans to divide them into 2 groups of 120 subjects each, according to how severe their skin condition is; the 120 most severe cases are the first group, the remaining moderate cases will be the second group. The members of each group are then randomly assigned to one of the treatment combinations. The women were unaware which skin cream they were using. • Draw a diagram to represent this experiment.

(Identify the subjects, the response variable, the treatments (including the # of treatments), where randomization occurs, and any factors that have already been controlled for) • Was “Blinding” used in this experiment? Single or Double or neither? •

What other factors should be controlled for that might affect the response variable?

•

Is there a placebo? Why do you think they did not have a control group in this experiment?

11

3. Matched Pairs Design: This is a reduced block design in which the block only contains two subjects. The researcher must choose pairs of subjects that are as closely matched as possible in characteristics that may have an effect on the outcome. Then randomly assign them into treatment groups.

Pair 1 random

Treat 1

Treat 2

Treat 1

Subjects

Paired

Pair 2 random

Treat 2

observe results

.

.

Treat 1

Pair n random

Treat 2

Common Examples: Twin studies, before and after studies, taste tests in which one subject tries two brands (i.e. Coke vs. Pepsi). A person can serve as their own pair.

Example #13: Matched Pairs

Researchers at the University of California, Santa Barbara wished to determine whether “music cognition and cognitions pertaining to abstract operations such as mathematical reasoning” were related. To test this, 12 college students listened to Mozart’s sonata for two pianos in D major for 10 minutes and then were given a mathematical reasoning test. The same students were also given a test measuring the same skills after sitting in a room for 10 minutes in complete silence. The order was randomized. The mean score on the test following the Mozart piece was 119 and the mean score following silence was 110. The researchers concluded that subjects performed better on abstract/spatial reasoning tests after listening to Mozart.

• Draw a diagram of the experimental design:

(Identify the subjects, the response variable, the treatments, and where randomization occurs) •

•

12

Describe how you could improve this study?

Was blinding used in this study?