Statistics&Probability: Confidence Interval

Statistics&Probability: Confidence Interval

Sampling

When we create a sample, we do not need matching. Matching is to have you sample composition exactly proportional to that of the population.

For example, if we have a population of students on campus, 64% of the students are females and 40% are males. Then we want to have a random sample that matches the population on this particular variable, i.e., 64% of the sampling students are females and 40% are males. However, in practice, we have many variables(such as age, birthplace, etc) that can potentially impact the outcome of the interest. If we want to match on all these variables, this can be very hard.

Therefore, all we need is to sample randomly from the population with an equal chance.

Sampling Distribution

Here, we create two populations using 200 LEGO pieces. Each population, population A and population B, has 100 pieces:

sampling_1

The variable of interest, x, is the number of points on a sampled LEGO piece. For population A, we counted the number of points on each of the pieces(We are the “god”, and we know about the truth of population A), and concluded that the average of x is 4.35. And For population B, the average for x is 5.84.

We randomly select samples from these two populations. We do sampling many many times, and each time we randomly select 10 sampled LEGO pieces, calculate the mean of the number of points(As each time we use different random samples, we have sampling variability). Therefore we can generate the sampling distributions of two populations:

sampling_2

In this case, we calculate the mean to generate the sample distribution, and we can also use the median, the standard deviation and the quantile to generate the distribution. There is an important factor that affects sampling distribution, which is the sample size. In this case, our sample size is 10. As the sample size grows, our estimate should be more and more accurate.

In practice, we normally do not do sampling many times. We only have one sample from a population and use the mean of this sample to estimate the poplulation. And we should use probability model to conduct the statistical inference.

Confidence Interval

Our first statistical inference topic is confidence interval. We use confidence interval to extend a point estimate(such as sample mean) to an interval estimate. By doing this, our chance of covering the true value of the population increases.

How we use confidence interval? Assume we actually know the sampling distribution(we are the “god”, and we know the truth), and we want to create a region(interval) where most of the sample estimates fall into:

sampling_3

The lenght of the region in the above picture, corresponds to 95% of chance. And the half of the length is called margin of error:

sampling_4

For example, if we do sampling 100 times, we can say that there are 95 times that our sample mean falls within this range:

sampling_5

However, in practice we do not know the true sampling distribution(we are not god), therefore it is invisible:

sampling_6

Based on the sample data and certain assumptions, we can derive this “95%” width:

sampling_7

And we can apply this width(this confidence interval) to individual sample estimates:

sampling_8

Although in practice, we do not know where the population truth is, there should be a 95% chance that this individual interval created using individual sample estimates can cover the invisible truth:

sampling_9

Margin of error

From our previous discussion, we can know that the confidence interval is estimate +/- margin of error. And we can have other confidence values such as 90%.

We also know that the margin of error is half the width of the confidence interval. If the confidence interval/margin of error is too wide, the results could be right with high probability, but is not very useful. For example, we can predict tomorrow’s temperature to be between 50 degrees and 90 degrees, and with 99% confidence. This is a very highly confident estimate, but it’s not very useful. Therefore in practice, we need to reduce margin of error.

There are several factors affecting margin of error:

  1. The overall variability in the population: The higher the variability in the population, the wider the margin of error tend to be.
  2. The confidence level: The higher confidence we want in our results, the wider the margin of error tend to be.
  3. The sample size: The larger the sample size, the narrower the margin of error tends to be.

For example, let’s go back to our LEGO example. For population A and population B, we have several interval estimates:

sampling_10

As shown:

  1. Population A has a higher variability than popluation B, therefore the confidence interval of A is wider.

  2. For each population, as the sample size becomes larger(from 10 to 20), the confidence interval becomes narrower.

  3. The blue lines in the picture are the true means of these two popluations. We can see that there seems to have a systematic bias in population B’s samples. And the intervals are also systematic biased. We observe a trend that in the samples from population B, nearly all the intervals have the sample mean above the truth. This systematic bias did not go away when we increased sample size, which tells us that, increasing sample size always reduce the variability in the interval estimate, but it does not reduce bias. If we have a systematic bias in our interval estimate, increasing the sample size to reduce margin of error may make our interval estimate converge to a wrong center, give us an illusion that our estimate is getting more precise.