MAT 660–Biostatistics (Dr. Christopher J. Mecklin)
February 22, 2016
We have learned to estimate many population parameters by computing sample statistics. For example, we estimate the population mean μ with the sample mean ˉx, the variance σ2 with s2, the proportion π with ˆp (called just p in your textbook), and the correlation ρ with r.
This statistics are all examples of point estimates; essentially, a single number that represents our best ‘educated guess’ as to the value of the parameter in question.
With a point estimate, we are not indicating any degree of uncertainty in our estimate. For example, if I was trying to estimate the proportion of Americans with hypertension, I would feel more confident and have less uncertainty if I used a sample of n=1000 to get ˆp=2201000=0.22 rather than n=100 and ˆp=22100=0.22.
In interval estimation, instead of using a single value, or point, to estimate a parameter, we estimate the paramter with an interval, or range. of values. A common example of an interval estimate from everyday life is when you see the results of a political poll or public opinion survey reported in the media. Typically, the results will be reported as ‘54.3% of voters plan on voting for Richard Guy for Congress, with a margin of error of plus-or-minus 3.7%’.
That margin of error defines an interval of values, and the primary purpose of this chapter is to study the mathematical methods involved in computing such intervals.
While not all confidence intervals have this format, most of the commonly used ones do, where we first compute some sample statistic to estimate the parameter in question. Under certain mathematical assumptions, we will know the sampling distribution of that statistic, so we will “take” the middle 100(1−α)% of the sampling distribution by multiplying the standard error (standard deviation of the sampling distribution) by a critical value from the appropriate distribution.
(sample statistic)±(critical value)×(standard error)
The quantity on the right hand side of the ± is the margin of error of the confidence interval.
For example, suppose we want a 100(1−α)% confidence interval for a mean. If we assume X∼N(μ,σ) where the value of σ is known, then ˉX∼N(μ,σ/√n). So the formula will be:
ˉx±z1−α/2×σ√n
where z1−α/2 is a critical value from the standard normal distribution that corresponds to the given confidence level (i.e. 90%, 95%, 99%, etc.) and σ√n is the standard error of the sample mean.
The usual confidence level is 95%, defined by taking α=.05 so that 100(1−α)=100(1−.05)=100(.95)=95%. Hence, we want the quantiles of the standard normal distribution such that the middle 95% of the curve is between −z1−α/2 and +z1−α/2.
Since α=.05, then α/2=.025 and we find the 2.5th and 97.5th percentiles of the standard normal distribution. With technology, you could use invNorm(.025)
and invNorm(.975)
on a TI calculator or qnorm(c(.025,.975),mean=0,sd=1)
in R or Normal Quantiles
in R Commander.
For 95% confidence, the critical value is z=±1.96. The blue tails in the graph below each have area=α/2=.025 and the white area between is equal to 0.95.
If we choose standard confidence levels, such as 90%, 95%, or 99%, we will use the same critical value over and over. The formulas are:
90% CI for μ: ˉx±1.645×σ√n
95% CI for μ: ˉx±1.96×σ√n
99% CI for μ: ˉx±2.576×σ√n
Example: Suppose a random sample of n=64 women is collected and their total cholesterol level X is found. We will assume X is normally distributed, with σ=24 mg/dL. The mean of the sample is ˉx=192.4.
The 95% confidence inteval is: ˉx±1.96×σ√n192.4±1.96×24√64192.4±1.96×3192.4±5.88
The interval is (186.52,198.28), with a margin of error of ±5.88 mg/dL.
What do you think would happen to this interval if:
We increase the sample size from n=64 to n=100?
We increase the confidence level from 95% to 99%?
Our intutition might tell us that the margin of error should decrease if a larger sample size is used, keeping everything else constant. Let’s see what happens to our 95% CI when n increases from 64 to 100.
The 95% confidence inteval is: ˉx±1.96×σ√n192.4±1.96×24√100192.4±1.96×2.4192.4±4.704
The interval is (187.696,197.104), with a margin of error of ±4.704 mg/dL. The margin of error decreased, as expected.
Now, let us see what happens if we find a 99% CI rather than a 95% CI, using n=64.
The 99% confidence inteval is: ˉx±2.576×σ√n192.4±2.576×24√64192.4±2.576×3192.4±7.728
The interval is (184.672,200.128), with a margin of error of ±7.728 mg/dL. To have a higher degree of confidence, we must allow the margin of error to be larger.
It might have struck you as strange that we would know the true standard deviation σ of a distribution in a situation where we did not know the true mean μ and felt it necessary to estimate μ with a point estimate ˉx and a confidence interval.
In most situations, we would not know the true value for σ. The sample standard deviation s seems like a natural choice to use in place of σ in our confidence interval formulas.
However, the distribution of the quantity x−μs/√n does not precisely follow a normal distribution, especially when n is small. The sampling distribution in this case follows the t-distribution, discovered by William Gosset.
The t-distribution is similar to the standard normal distribution in that it is:
However, the t-distribution has ‘heavier tails’ (i.e. a higher variance) than the standard normal distribution. Visually, it is shorter and flatter than the standard normal curve, as seen on the next slide. Also, each t distribuiton is dscribed by a parameter called degrees of freedom, where df=n−1. As n→∞, the t-curve approaches the z-curve.
The black line is standard normal, the blue line (very close to the black) is t with 30 df, and the red line (further from the black) is t with 4 df.
The formula for the confidence interval for a mean is changed slightly when we do not know σ. We still need to assume X follows a normal distribution, or that we have a large sample size (n>30).
ˉx±tn−1,1−α/2×s√n
We need to know both n and α to find the critical value. The degrees of freedom are df=n−1 and the critical value will change for each sample size.
If df=15 and a 95% confidence level (α=.05), then we can use a t-distribution table such as the one I provided on Canvas. Use the row with df=15 and the column with probability between of .95. The critical value is t=±2.131.
If n=16 (so df=15), ˉx=192.4 and s=26.5, then the 95% confidence interval for μ is: 192.4±2.131×26.5√16 192.4±14.12
The t critical value can be obtained with technology. Use invNorm(.025,15)
and invNorm(.975,15)
on a TI calculator or qt(c(.025,.975),df=15)
in R or t Quantiles
in R Commander. You should still get t=±2.131.
Another parameter that we are interested in is the proportion π. We often think of this as a ‘percentage’. Common examples would include the proportion of registered voters that plan on voting for a particular candidate, the proportion of people that have hypertension, or the proportion of newborn bear cubs that survive the first year of life.
The point estimate, or sample statistic, used here is: ˆp=Yn
For example, if Y=312 voters out of a simple random sample of n=600 say that they plan on voting for Richard Guy, then ˆp=312600=0.52.
If we have a simple random sample, we could say that Y, the number of voters that will vote for R. Guy, has a binomial distribution. When n is large enough, then Y and hence ˆp will have a sampling distribution that is approximately normal.
If both nπ and n(1−π) are 10 or greater, then: ˆp˙∼N(π,√π(1−π)n)
We use this fact to base a 100(1−α)% CI for π on the following formula: ˆp±z1−α/2×√ˆp(1−ˆp)n
Suppose we have a simple random sample of n=600 voters and Y=312 say they will vote for Richard Guy. A 90% confidence interval for π is:
0.52±1.645×√0.52(1−0.52)6000.52±1.645×√.0004160.52±0.034
The 90% CI is (0.486,0.554) and the margin of error is ±3.4%. Notice that this candidate’s interval contains values below 50%, so he would not be confident (at the 90% level) of obtaining a majority of the votes.
You have probably heard or read about the ‘margin of error’ in the media many times. Typically, the margin of error (plus/minus some percentage) is reported to the public. Most CIs in the media are based on the 95% level of confidence, so a critical value of z=±1.96 rather than z=±1.645 would have been used on the previous slide.
Most margins of error in polls & surveys collected by news organizations or polling companies range from ±3% to ±5%. This is because a sample size of a few hundred to about a thousand yield margins of error of this size.
To cut a margin of error in half, you have to not double, but the sample size! This is why you will almost never see a margin of error of ±1%, as you would need a sample size of about 10000!
If you go to the very bottom of the webpage, Gallup tells you a bit about the methodology.
This Gallup poll used a multistage random sample of n=1029 adults, the 95% level for confidence, and they computed a margin of error of approx. ±4%.
If you tried to compute the margin of error of this Gallup poll with the formula shown a few slides ago, your answer would be off by a few tenths of a percent. The reason why is that our formula is based on a simple random sample, but Gallup uses a more complex method of random sampling. The formula Gallup would use is similar, but with a more complicated term for the standard error.
Now that you have mastered computing CIs for a single mean μ and a single proportion π, we need to think about how to properly interpret the interval.
What most people want to say is something like this:
My 95% confidence interval for the mean total cholesterol level is 192.4±14.12, so I am 95% sure that the actual mean for total cholesterol is between 178.28 mg/dL and 206.52 mg/dL.
or (this one is even worse!):
Our interval means 95% of the population has a total cholesterol between 178.28 and 206.52 mg/dL
These statements might seem reasonable but are WRONG, even though the statistics were computed correctly.
A correct interpretation of the confidence interval is more subtle and less straightforward.
If we took many, many random samples of size n from our population and computed a 100(1−α)% CI for each sample, then in the long run, 100(1−α)% of the CIs would contain the true value of the parameter.
While this is technically correct, many people find this interpretation confusing and unsatisfying.
The following web page will let you simulate and graph the CIs from many many samples. http://www.rossmanchance.com/applets/NewConfsim/Confsim.html If you choose Proportions, use Wald as the method (that is the formula we covered).
Many statisticians use what are called methods, which are extensions of Bayes’ Theorem that we studied earlier. These methods can be quite complicated and involve computer-intensive methods, and we will not study them in this course.
You might see researchers talking about intervals computed with Bayesian methods in the academic literature. They are often called credible intervals (so you don’t confuse them with the confidence intervals that we have covered). Unlike confidence intervals, it is correct to interpret a credible interval as saying ‘There is a 95% chance the parameter is between the lower and upper limits of the interval’. That is an easier, more natural interpretation, but the catch is the additional complexity in the computation.
You might think there is some similar formula to find the confidence interval for the population correlation ρ that would be based on the sample correlation r. Unfortunately, we do not have a good approximation for the sampling distribution of r as we do for ˆp and ˉx, so an ‘extra step’ will be needed and the process is a bit more complicated.
We will have to change our sample statistic z into another statistic called Fisher’s z transformation.
z=12ln(1+r1−r)
We then find a confidence interval for Fisher’s z. Once we do that, we will ‘undo’ the Fisher’s z transformation to obtain a CI for ρ.
Suppose we have a data set of size n=67 with a correlation between two variables X and Y that is r=0.25. We want a 95% CI for ρ.
Step 1: Convert r to Fisher’s z.
z=12ln(1+r1−r)=12ln(1+0.251−0.25)=12ln(1.250.75)=12ln(53)=0.2554
Step 2: Compute a confidence interval for Fisher’s z
12ln(1+r1−r)±z∗1√n−3 0.2554±1.961√67−3 0.2554±0.2450 (0.0104,0.5004)
At this point we have a 95% CI for Fisher’s Z and NOT for the correlation ρ. We don’t actually care about or want to interpret Fisher’s Z. Unlike the correlation, it can take on any real number and is not bounded by the range −1≤r≤1.
Step 3: Undo the transformation
Right now we have a CI for Fisher’s Z with values (LZ,UZ). In our example, LZ=0.0104 and UZ=0.5004.
We actually want Lρ and Uρ instead, which are computed as follows:
Lρ=e2LZ−1e2LZ+1 Uρ=e2UZ−1e2UZ+1
(These equations were obtained by solving the Fisher’s z transformation formula for r)
In our problem:
Lρ=e2(.0104)−1e2(0.0104)+1=0.0104 Uρ=e2(.5004)−1e2(.5004)+1=0.4624
Our 95% CI for the correlation ρ is (0.0104,0.4624). We are done!
We could also follow the lead of the textbook in the blue box on pages 153-154, and compute the CI as:
Lower limit Lρ=1+r−(1−r)e2z∗√n−31+r+(1−r)e2z∗√n−3
Upper limit Uρ=1+r−(1−r)e−2z∗√n−31+r+(1−r)e−2z∗√n−3
where z∗ is the critical value from the standard normal distribution for the desired confidence level 100(1−α)%.
In our example, n=67, r=0.25 and z∗=1.96, so 2z∗√n−3=2(1.96)√64=0.49, yielding:
Lρ=1+r−(1−r)e2z∗√n−31+r+(1−r)e2z∗√n−3=1.25−0.75e0.491.25+0.75e0.49=0.0104
Uρ=1+r−(1−r)e−2z∗√n−31+r+(1−r)e−2z∗√n−3=1.25−0.75e−0.491.25+0.75e−0.49=0.4624
We obtain the same interval (0.0104,0.4624). This correlation of r=0.25 with n=67 would be deemed statistically significant at the α=0.05 level since the interval does not contain zero.
This does NOT necessarily mean that the correlation is clinically or practically important.
Unfortunately, the TI calculators do not have a function for computing confidence intervals for a correlation coefficient (I think because this is not typically covered in 100-level stats courses).
The confidence interval for ρ could be obtained with R Commander IF you already have a data set. You would go to Statistics, then Correlation test… Ignoring for now the hypothesis test, here’s output for the 95% CI for the correlation between calories
and carbo
in cereal, using the UScereal
dataset.
##
## Pearson's product-moment correlation
##
## data: calories and carbo
## t = 10.183, df = 63, p-value = 6.217e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6745945 0.8660255
## sample estimates:
## cor
## 0.7887227
Space, Right Arrow or swipe left to move to next slide, click help below for more details