1. Measure of central tendency and dispersion

Mean of a group of number is the average value of these numbers. Median is the number that is in the middle when you order these numbers in ascending order. It is probably immediate to you that the order doesn’t have to be ascending. A tricky question is what the median is when the total number of the numbers is even. In that case, you just take the average of the two numbers that are in the middle. For example, suppose the numbers given are a, b, c and d, where a2*10}

Using Chebyshev inequality, we note that theta is 50, standard deviation is 10, and hence 2 is what we called c in the formula. So, Pr{X-50>2*10} [pic] .

Notice that Chebyshev inequality provides a bound for almost every distribution. It says that if X is picked from the population which has mean θ and standard deviation s, the probability that any one realization of x is [pic]farther from θ is less than [pic] . I am using k deliberately here, because many of you seemed to have struggled with the idea of a variable c in the original formula during our class discussion.

The central limit theorem actually takes us a little farther. According to the central limit theorem,

[pic]

Where Z is the standard normal variable and [pic]is standard normal distribution function. Using c=2 as before, we get that [pic](2)=0.977 (you can check this value in Uma Sekaran’s book, page 433 in fourth edition of the book. Such normal distribution table is generally given at the end of the most of the intermediate level statistics books). This implies that the probability that the sample mean is two standard deviation away from true mean value is about 0.046.

2. Correlation between two datasets.

Let {xi}i and {yi}i be two datasets each having n elements. Pearson’s correlation coefficient,r, is given by the following formula:

[pic]

Students must not confuse correlation with causation. Correlation implies that in general the datasets show certain simultaneity pattern. If small x are paired with small y, and large x are paired with large y, we can see correlation between these data. However, it doesn’t mean that small x causes y to be small. The correlation coefficient ranges from -1 to 1. If correlation coefficient is 1, then it implies that there exists a linear relationship between these two data. If the correlation coefficient is 0, there exists no relationship between these two data.

Suppose, you sampled two datasets from their pertinent population and you got the correlation coefficient of 0.05. Can you claim that the two datasets are correlated? To test whether this correlation coefficient is significant (different from zero), note the following result. If the size of sample is n, then the statistic

[pic]is characterized by student’s t-distribution under the assumption that there is zero correlation. What does that mean? It means that if t>1.96, you can assume that the correlation is nonzero with 95% certainty. To understand more about such statement, we now move to hypothesis testing.

3. Hypothesis Testing

This is the crux of any research. Often, a testable hypothesis is formulated by the researcher. Suppose math grade of students of a particular highschool looks like the following:

[pic]

The grade is given in the scale of 1-9. The vertical line gives the number of students who scored the corresponding grade. Out of 108 total students, 3 got 9. This implies that the probability of getting 9 out of 9 among the students was around 2%.

Now suppose a teacher develops a tonic and claims that the tonic improves the grade of the students. Hypothesis testing implies something like this. Generally, we first form the null hypothesis that there is no effect of the tonic. Then a random student is selected, and given the tonic. If the student goes on to score 9, we reject the null hypothesis that there is no impact of the tonic, if...