Point Estimation

We have learned to estimate many population parameters by computing sample statistics. For example, we estimate the population mean $\mu$ with the sample mean $\bar{x}$ , the variance $\sigma^2$ with $s^2$ , the proportion $\pi$ with $\hat{p}$ (called just $p$ in your textbook), and the correlation $\rho$ with $r$ .

This statistics are all examples of point estimates; essentially, a single number that represents our best ‘educated guess’ as to the value of the parameter in question.

With a point estimate, we are not indicating any degree of uncertainty in our estimate. For example, if I was trying to estimate the proportion of Americans with hypertension, I would feel more confident and have less uncertainty if I used a sample of $n=1000$ to get $\hat{p}=\frac{220}{1000}=0.22$ rather than $n=100$ and $\hat{p}=\frac{22}{100}=0.22$ .

Interval Estimation

In interval estimation, instead of using a single value, or point, to estimate a parameter, we estimate the paramter with an interval, or range. of values. A common example of an interval estimate from everyday life is when you see the results of a political poll or public opinion survey reported in the media. Typically, the results will be reported as ‘54.3% of voters plan on voting for Richard Guy for Congress, with a margin of error of plus-or-minus 3.7%’.

That margin of error defines an interval of values, and the primary purpose of this chapter is to study the mathematical methods involved in computing such intervals.

Standard Form of Many Confidence Intervals

While not all confidence intervals have this format, most of the commonly used ones do, where we first compute some sample statistic to estimate the parameter in question. Under certain mathematical assumptions, we will know the sampling distribution of that statistic, so we will “take” the middle $100(1-\alpha)\%$ of the sampling distribution by multiplying the standard error (standard deviation of the sampling distribution) by a critical value from the appropriate distribution.

$\text{(sample statistic)} \pm \text{(critical value)} \times \text{(standard error)}$

The quantity on the right hand side of the $\pm$ is the margin of error of the confidence interval.

A Confidence Interval for a Single Mean $\mu$

For example, suppose we want a $100(1-\alpha)\%$ confidence interval for a mean. If we assume $X \sim N(\mu, \sigma)$ where the value of $\sigma$ is known, then $\bar{X} \sim N(\mu, \sigma/\sqrt{n})$ . So the formula will be:

$\bar{x} \pm z_{1-\alpha/2} \times \frac{\sigma}{\sqrt{n}}$

where $z_{1-\alpha/2}$ is a critical value from the standard normal distribution that corresponds to the given confidence level (i.e. 90%, 95%, 99%, etc.) and $\frac{\sigma}{\sqrt{n}}$ is the standard error of the sample mean.

Determining the Critical Value

The usual confidence level is 95%, defined by taking $\alpha=.05$ so that $100(1-\alpha)=100(1-.05)=100(.95)=95\%$ . Hence, we want the quantiles of the standard normal distribution such that the middle 95% of the curve is between $-z_{1-\alpha/2}$ and $+z_{1-\alpha/2}$ .

Since $\alpha=.05$ , then $\alpha/2=.025$ and we find the 2.5th and 97.5th percentiles of the standard normal distribution. With technology, you could use invNorm(.025) and invNorm(.975) on a TI calculator or qnorm(c(.025,.975),mean=0,sd=1) in R or Normal Quantiles in R Commander.

Determining the Critical Value

For 95% confidence, the critical value is $z=\pm 1.96$ . The blue tails in the graph below each have area= $\alpha/2=.025$ and the white area between is equal to 0.95.

Formulas for the Confidence Interval for a Mean

If we choose standard confidence levels, such as 90%, 95%, or 99%, we will use the same critical value over and over. The formulas are:

90% CI for $\mu$ : $\bar{x} \pm 1.645 \times \frac{\sigma}{\sqrt{n}}$
95% CI for $\mu$ : $\bar{x} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}$
99% CI for $\mu$ : $\bar{x} \pm 2.576 \times \frac{\sigma}{\sqrt{n}}$

Computing a Confidence Interval for a Mean

Example: Suppose a random sample of $n=64$ women is collected and their total cholesterol level $X$ is found. We will assume $X$ is normally distributed, with $\sigma=24$ mg/dL. The mean of the sample is $\bar{x}=192.4$ .

Computing a Confidence Interval for a Mean

The 95% confidence inteval is: $\begin{aligned} \bar{x} \pm & 1.96 \times \frac{\sigma}{\sqrt{n}} \\ 192.4 \pm & 1.96 \times \frac{24}{\sqrt{64}} \\ 192.4 \pm & 1.96 \times 3 \\ 192.4 \pm & 5.88 \\ \end{aligned}$

The interval is $(186.52,198.28)$ , with a margin of error of $\pm 5.88$ mg/dL.

Changes to a Confidence Interval

What do you think would happen to this interval if:

We increase the sample size from $n=64$ to $n=100$ ?
We increase the confidence level from 95% to 99%?

What Happens When We Change Sample Size?

Our intutition might tell us that the margin of error should decrease if a larger sample size is used, keeping everything else constant. Let’s see what happens to our 95% CI when $n$ increases from 64 to 100.

The 95% confidence inteval is: $\begin{aligned} \bar{x} \pm & 1.96 \times \frac{\sigma}{\sqrt{n}} \\ 192.4 \pm & 1.96 \times \frac{24}{\sqrt{100}} \\ 192.4 \pm & 1.96 \times 2.4 \\ 192.4 \pm & 4.704 \\ \end{aligned}$

What Happens When We Change Sample Size?

The interval is $(187.696,197.104)$ , with a margin of error of $\pm 4.704$ mg/dL. The margin of error decreased, as expected.

What Happens When We Change The Confidence Level?

Now, let us see what happens if we find a 99% CI rather than a 95% CI, using $n=64$ .

The 99% confidence inteval is: $\begin{aligned} \bar{x} \pm & 2.576 \times \frac{\sigma}{\sqrt{n}} \\ 192.4 \pm & 2.576 \times \frac{24}{\sqrt{64}} \\ 192.4 \pm & 2.576 \times 3 \\ 192.4 \pm & 7.728 \\ \end{aligned}$

What Happens When We Change The Confidence Level?

The interval is $(184.672,200.128)$ , with a margin of error of $\pm 7.728$ mg/dL. To have a higher degree of confidence, we must allow the margin of error to be larger.

What if We Don’t Know $\sigma$ ?

It might have struck you as strange that we would know the true standard deviation $\sigma$ of a distribution in a situation where we did not know the true mean $\mu$ and felt it necessary to estimate $\mu$ with a point estimate $\bar{x}$ and a confidence interval.

In most situations, we would not know the true value for $\sigma$ . The sample standard deviation $s$ seems like a natural choice to use in place of $\sigma$ in our confidence interval formulas.

However, the distribution of the quantity $\frac{x-\mu}{s/\sqrt{n}}$ does not precisely follow a normal distribution, especially when $n$ is small. The sampling distribution in this case follows the $t$ -distribution, discovered by William Gosset.

The $t$ -distribution

The $t$ -distribution is similar to the standard normal distribution in that it is:

Bell-shaped
Symmetric
Has $\mu=0$

However, the $t$ -distribution has ‘heavier tails’ (i.e. a higher variance) than the standard normal distribution. Visually, it is shorter and flatter than the standard normal curve, as seen on the next slide. Also, each $t$ distribuiton is dscribed by a parameter called degrees of freedom, where $df=n-1$ . As $n \rightarrow \infty$ , the $t$ -curve approaches the $z$ -curve.

Comparing the Standard Normal and t distribution

The black line is standard normal, the blue line (very close to the black) is $t$ with 30 df, and the red line (further from the black) is $t$ with 4 df.

Confidence Interval Based On the $t$ distribution

The formula for the confidence interval for a mean is changed slightly when we do not know $\sigma$ . We still need to assume $X$ follows a normal distribution, or that we have a large sample size ( $n>30$ ).

$\bar{x} \pm t_{n-1,1-\alpha/2} \times \frac{s}{\sqrt{n}}$

We need to know both $n$ and $\alpha$ to find the critical value. The degrees of freedom are $df=n-1$ and the critical value will change for each sample size.

Critical Values for the $t$ distribution

If $df=15$ and a 95% confidence level ( $\alpha=.05$ ), then we can use a $t$ -distribution table such as the one I provided on Canvas. Use the row with $df=15$ and the column with probability between of .95. The critical value is $t=\pm 2.131$ .

Confidence Interval Based On the $t$ distribution

If $n=16$ (so $df=15$ ), $\bar{x}=192.4$ and $s=26.5$ , then the 95% confidence interval for $\mu$ is: $192.4 \pm 2.131 \times \frac{26.5}{\sqrt{16}}$ $192.4 \pm 14.12$

The $t$ critical value can be obtained with technology. Use invNorm(.025,15) and invNorm(.975,15) on a TI calculator or qt(c(.025,.975),df=15) in R or t Quantiles in R Commander. You should still get $t=\pm2.131$ .

Confidence Intervals For a Single Proportion $\pi$

Another parameter that we are interested in is the proportion $\pi$ . We often think of this as a ‘percentage’. Common examples would include the proportion of registered voters that plan on voting for a particular candidate, the proportion of people that have hypertension, or the proportion of newborn bear cubs that survive the first year of life.

The point estimate, or sample statistic, used here is: $\hat{p}=\frac{Y}{n}$

For example, if $Y=312$ voters out of a simple random sample of $n=600$ say that they plan on voting for Richard Guy, then $\hat{p}=\frac{312}{600}=0.52$ .

Sampling Distribution of a Proportion

If we have a simple random sample, we could say that $Y$ , the number of voters that will vote for R. Guy, has a binomial distribution. When $n$ is large enough, then $Y$ and hence $\hat{p}$ will have a sampling distribution that is approximately normal.

If both $n \pi$ and $n(1-\pi)$ are 10 or greater, then: $\hat{p} \dot{\sim} N(\pi,\sqrt{\frac{\pi(1-\pi)}{n}})$

We use this fact to base a $100(1-\alpha)\%$ CI for $\pi$ on the following formula: $\hat{p} \pm z_{1-\alpha/2} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

Confidence Interval for a Proportion

Suppose we have a simple random sample of $n=600$ voters and $Y=312$ say they will vote for Richard Guy. A 90% confidence interval for $\pi$ is:

$\begin{aligned} 0.52 \pm & 1.645 \times \sqrt{\frac{0.52(1-0.52)}{600}} \\ 0.52 \pm & 1.645 \times \sqrt{.000416} \\ 0.52 \pm & 0.034 \\ \end{aligned}$

The 90% CI is (0.486,0.554) and the margin of error is $\pm 3.4\%$ . Notice that this candidate’s interval contains values below 50%, so he would not be confident (at the 90% level) of obtaining a majority of the votes.

Margin of Error

You have probably heard or read about the ‘margin of error’ in the media many times. Typically, the margin of error (plus/minus some percentage) is reported to the public. Most CIs in the media are based on the 95% level of confidence, so a critical value of $z=\pm 1.96$ rather than $z=\pm 1.645$ would have been used on the previous slide.

Most margins of error in polls & surveys collected by news organizations or polling companies range from $\pm3\%$ to $\pm5\%$ . This is because a sample size of a few hundred to about a thousand yield margins of error of this size.

To cut a margin of error in half, you have to not double, but the sample size! This is why you will almost never see a margin of error of $\pm 1\%$ , as you would need a sample size of about 10000!

Example of a Poll

http://www.gallup.com/poll/152528/Congress-Job-Approval-New-Low.aspx

Polling Methodology

If you go to the very bottom of the webpage, Gallup tells you a bit about the methodology.

Poll Details

This Gallup poll used a multistage random sample of $n=1029$ adults, the 95% level for confidence, and they computed a margin of error of approx. $\pm 4\%$ .

If you tried to compute the margin of error of this Gallup poll with the formula shown a few slides ago, your answer would be off by a few tenths of a percent. The reason why is that our formula is based on a simple random sample, but Gallup uses a more complex method of random sampling. The formula Gallup would use is similar, but with a more complicated term for the standard error.

Incorrect Interpretation of the Confidence Interval

Now that you have mastered computing CIs for a single mean $\mu$ and a single proportion $\pi$ , we need to think about how to properly interpret the interval.

What most people want to say is something like this:

My 95% confidence interval for the mean total cholesterol level is $192.4 \pm 14.12$ , so I am 95% sure that the actual mean for total cholesterol is between 178.28 mg/dL and 206.52 mg/dL.

Incorrect Interpretation of the Confidence Interval

or (this one is even worse!):

Our interval means 95% of the population has a total cholesterol between 178.28 and 206.52 mg/dL

These statements might seem reasonable but are WRONG, even though the statistics were computed correctly.

Correct Interpretation of the Confidence Interval

A correct interpretation of the confidence interval is more subtle and less straightforward.

If we took many, many random samples of size $n$ from our population and computed a $100(1-\alpha)\%$ CI for each sample, then in the long run, $100(1-\alpha)\%$ of the CIs would contain the true value of the parameter.

While this is technically correct, many people find this interpretation confusing and unsatisfying.

Simulating Confidence Intervals

The following web page will let you simulate and graph the CIs from many many samples. http://www.rossmanchance.com/applets/NewConfsim/Confsim.html If you choose Proportions, use Wald as the method (that is the formula we covered).

Bayesian Credible Intervals

Many statisticians use what are called methods, which are extensions of Bayes’ Theorem that we studied earlier. These methods can be quite complicated and involve computer-intensive methods, and we will not study them in this course.

You might see researchers talking about intervals computed with Bayesian methods in the academic literature. They are often called credible intervals (so you don’t confuse them with the confidence intervals that we have covered). Unlike confidence intervals, it is correct to interpret a credible interval as saying ‘There is a 95% chance the parameter is between the lower and upper limits of the interval’. That is an easier, more natural interpretation, but the catch is the additional complexity in the computation.

Confidence Intervals for a Correlation

You might think there is some similar formula to find the confidence interval for the population correlation $\rho$ that would be based on the sample correlation $r$ . Unfortunately, we do not have a good approximation for the sampling distribution of $r$ as we do for $\hat{p}$ and $\bar{x}$ , so an ‘extra step’ will be needed and the process is a bit more complicated.

We will have to change our sample statistic $z$ into another statistic called Fisher’s $z$ transformation.

$z=\frac{1}{2} \ln(\frac{1+r}{1-r})$

We then find a confidence interval for Fisher’s $z$ . Once we do that, we will ‘undo’ the Fisher’s $z$ transformation to obtain a CI for $\rho$ .

Example of a CI for Correlation

Suppose we have a data set of size $n=67$ with a correlation between two variables $X$ and $Y$ that is $r=0.25$ . We want a 95% CI for $\rho$ .

Step 1: Convert $r$ to Fisher’s $z$ .

$\begin{aligned} z & =\frac{1}{2} \ln(\frac{1+r}{1-r}) \\ & = \frac{1}{2} \ln(\frac{1+0.25}{1-0.25}) \\ & = \frac{1}{2} \ln(\frac{1.25}{0.75}) \\ & = \frac{1}{2} \ln(\frac{5}{3}) \\ & = 0.2554 \\ \end{aligned}$

Example of a CI for Correlation

Step 2: Compute a confidence interval for Fisher’s $z$

$\frac{1}{2} \ln(\frac{1+r}{1-r}) \pm z^* \frac{1}{\sqrt{n-3}}$ $0.2554 \pm 1.96 \frac{1}{\sqrt{67-3}}$ $0.2554 \pm 0.2450$ $(0.0104, 0.5004)$

At this point we have a 95% CI for Fisher’s $Z$ and NOT for the correlation $\rho$ . We don’t actually care about or want to interpret Fisher’s $Z$ . Unlike the correlation, it can take on any real number and is not bounded by the range $-1 \leq r \leq 1$ .

Example of a CI for Correlation

Step 3: Undo the transformation

Right now we have a CI for Fisher’s $Z$ with values $(L_Z, U_Z)$ . In our example, $L_Z=0.0104$ and $U_Z=0.5004$ .

We actually want $L_\rho$ and $U_\rho$ instead, which are computed as follows:

$L_\rho=\frac{e^{2L_Z}-1}{e^{2L_Z}+1}$ $U_\rho=\frac{e^{2U_Z}-1}{e^{2U_Z}+1}$

(These equations were obtained by solving the Fisher’s $z$ transformation formula for $r$ )

Example of a CI for Correlation

In our problem:

$L_\rho=\frac{e^{2(.0104)}-1}{e^{2(0.0104)}+1}=0.0104$ $U_\rho=\frac{e^{2(.5004)}-1}{e^{2(.5004)}+1}=0.4624$

Our 95% CI for the correlation $\rho$ is $(0.0104, 0.4624)$ . We are done!

Another way to compute the CI for Correlation

We could also follow the lead of the textbook in the blue box on pages 153-154, and compute the CI as:

Lower limit $L_\rho=\frac{1+r-(1-r)e^{\frac{2z^*}{\sqrt{n-3}}}}{1+r+(1-r)e^{\frac{2z^*}{\sqrt{n-3}}}}$

Upper limit $U_\rho=\frac{1+r-(1-r)e^{\frac{-2z^*}{\sqrt{n-3}}}}{1+r+(1-r)e^{-\frac{2z^*}{\sqrt{n-3}}}}$

where $z^*$ is the critical value from the standard normal distribution for the desired confidence level $100(1-\alpha)$ %.

Our example, recomputed

In our example, $n=67$ , $r=0.25$ and $z^*=1.96$ , so $\frac{2z^*}{\sqrt{n-3}}=\frac{2(1.96)}{\sqrt{64}}=0.49$ , yielding:

$\begin{aligned} L_\rho & =\frac{1+r-(1-r)e^{\frac{2z^*}{\sqrt{n-3}}}}{1+r+(1-r)e^{\frac{2z^*}{\sqrt{n-3}}}} \\ & = \frac{1.25-0.75e^{0.49}}{1.25+0.75e^{0.49}} \\ & = 0.0104 \\ \end{aligned}$

Our example, recomputed

$\begin{aligned} U_\rho & = \frac{1+r-(1-r)e^{\frac{-2z^*}{\sqrt{n-3}}}}{1+r+(1-r)e^{\frac{-2z^*}{\sqrt{n-3}}}} \\ & = \frac{1.25-0.75e^{-0.49}}{1.25+0.75e^{-0.49}} \\ & = 0.4624 \end{aligned}$

The computed interval

We obtain the same interval $(0.0104, 0.4624)$ . This correlation of $r=0.25$ with $n=67$ would be deemed statistically significant at the $\alpha=0.05$ level since the interval does not contain zero.

This does NOT necessarily mean that the correlation is clinically or practically important.

CIs for Correlation with Technology

Unfortunately, the TI calculators do not have a function for computing confidence intervals for a correlation coefficient (I think because this is not typically covered in 100-level stats courses).

The confidence interval for $\rho$ could be obtained with R Commander IF you already have a data set. You would go to Statistics, then Correlation test… Ignoring for now the hypothesis test, here’s output for the 95% CI for the correlation between calories and carbo in cereal, using the UScereal dataset.

CI for Correlation with R Commander

## 
##  Pearson's product-moment correlation
## 
## data:  calories and carbo
## t = 10.183, df = 63, p-value = 6.217e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6745945 0.8660255
## sample estimates:
##       cor 
## 0.7887227

Week #6–Confidence Intervals (ch. 7)

Point Estimation

Interval Estimation

Standard Form of Many Confidence Intervals

A Confidence Interval for a Single Mean μ\mu

Determining the Critical Value

Determining the Critical Value

Formulas for the Confidence Interval for a Mean

Computing a Confidence Interval for a Mean

Computing a Confidence Interval for a Mean

Changes to a Confidence Interval

What Happens When We Change Sample Size?

What Happens When We Change Sample Size?

What Happens When We Change The Confidence Level?

What Happens When We Change The Confidence Level?

What if We Don’t Know σ\sigma?

The tt-distribution

Comparing the Standard Normal and t distribution

Confidence Interval Based On the tt distribution

Critical Values for the tt distribution

Confidence Interval Based On the tt distribution

Confidence Intervals For a Single Proportion π\pi

Sampling Distribution of a Proportion

Confidence Interval for a Proportion

Margin of Error

Example of a Poll

Polling Methodology

Poll Details

Incorrect Interpretation of the Confidence Interval

Incorrect Interpretation of the Confidence Interval

Correct Interpretation of the Confidence Interval

Simulating Confidence Intervals

Bayesian Credible Intervals

Confidence Intervals for a Correlation

Example of a CI for Correlation

Example of a CI for Correlation

Example of a CI for Correlation

Example of a CI for Correlation

Another way to compute the CI for Correlation

Our example, recomputed

Our example, recomputed

The computed interval

CIs for Correlation with Technology

CI for Correlation with R Commander

A Confidence Interval for a Single Mean $\mu$

What if We Don’t Know $\sigma$ ?

The $t$ -distribution

Confidence Interval Based On the $t$ distribution

Critical Values for the $t$ distribution

Confidence Interval Based On the $t$ distribution

Confidence Intervals For a Single Proportion $\pi$