Skip to main content

Z-statistic for a Confidence Level and Estimating a Confidence Interval

Before knowing what a Confidence Level is, it is imperative to understand what a Confidence Interval is. It is an estimate computed from the statistic of the data, a range of possible values for an unknown parameter (e.g. mean of a distribution, standard deviation).
In general, it is an interval for an unknown population parameter based on the sampling distribution of the estimator.
Read the below few lines before we get to the calculation part of it.

A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals have a margin of error. It tells you how confident you can be that the results from a poll or survey reflect what you would expect to find if it were possible to perform a survey for the entire population.

Intrinsically, confidence interval is related to the confidence levels.
Confidence Level and Confidence Interval
Confidence level is expressed as a percentage (for example, a 90% confidence level). It means that should you repeat an experiment or survey over and over again, 90 percent of the time your results will match the results you get from a population (in other words, your statistics would be sound!). Confidence intervals are your results…usually numbers.

For example, you survey a group of students to see how many hours they study in a week. You test your statistics at the 99 percent confidence level and get a confidence interval of (8, 14). That means you think that they study between 8 and 14 hours a week. You are super confident (99% is a very high level!) that your results are sound, statistically.
The confidence level is associated with alpha (value you are free to choose). Supposing an ɑ of 10% the corresponding Confidence Level will be (1- ɑ) i.e. 90%.

Let us take a practical example here:
Suppose a cold drink machine is adjusted to fill up bottles of exactly 1 L. But the machine cannot fill exactly 1 L in each bottle and shows variation with some variable X. This variation is assumed to be normally distributed an average amount of 1 L with a standard deviation of 25 ml. To check whether machine is correctly calibrated we take a sample of 25 bottles.
        μ = 1
        Assuming, x̅ (sample mean) = 0.997
        n = 25
If we take more samples, the expected means could be around 1.02 , 0.99, 0.98.
Further, in our case we may determine the confidence interval by considering that the sample mea from a normally distributed sample is also normally distributed, with the same expectation μ, but with a standard error of:
        σ/√n = 25/ 5 = 5 ml
Using the formula:
           Z = ( x̅ - μ)/σ/√n         (if you don't know what this is)
As the Z follows a normal distribution and for a two-tailed test with ɑ = 0.05
P(Z) = 1 - ɑ / 2 = 0.975             (How do we get this)
or Z-statistic = 1.96
Now lower point:
LP = x̅  - 1.96*(σ/√n) = 997 - 1.96* 5 = 987.2
and upper point:
UP = x̅ + 1.96*(σ/√n) = 997 + 1.96* 5 = 1006.8
This means that every time the measurements are repeated, there will be another value for the mean x̅ of the sample. In 95% of the cases μ will be between the endpoints calculated from this mean, but in 5% of the cases it will not be.
Few, very commonly use Z-statistic(z*) values:
                                              
90%
1.645
95%
1.96
98%
2.326
99%
2.576

Below is a plot for a standard normal distribution showing the amount of information between each +/- z-score. How to calculate the Z-score for a standard normal distribution and how do we get the above percentages , that will be in upcoming posts.
 
Also, recommend reading two tailed and one-tailed tests.

There is a common misconception regarding the Confidence interval and level.
A 90 % confidence level does not mean that for a given realized interval there is a 90% probability that the population parameter lies within the interval (i.e. a 90% probability that the interval covers the population parameter). Once an interval is calculated, this interval covers the parameter value or it does not, it is no longer a matter of probability. The 90% probability relates to the reliability of the estimation procedure, not to a calculated interval.
In short, in 90% of the samples the interval estimate will contain the true population parameter.


Comments

Learn More

Learn to Survive with Titanic Dataset

In this tutorial, we will learn about one of the most popular datasets in data science. It will give you idea about how to analyze and relate with real conditions. The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive compared to others. In this challenge, we will need to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). This post will help you start with data science and familiarize yourself with Machine Learning. The competition is simple: use machi...

MMM - Guide to Marketing Mix Modelling

The landscape of Indian media and ad-expenditure is constantly evolving and will continue to witness the fastest growth of 10.7% to reach Rs. 91641 crores.  While it is expected to see stable investment across media in India, Digital will garner approx. 65% of incremental ad spends in 2020. Also with the current pandemic situation where the use of print media is on decline, the digital model of marketing is set to gain more popularity. In the digital age, marketing spend is an important component of total expenses by any company. Hence the importance on how it's used and how much actual benefit these campaigns are making can't be understated. These days marketing is done through multiple channels TV, Radio, Newspaper, Banners, Social media, etc. which makes it even more challenging to quantify how much benefit each of these channel is making. Market mix model is a statistical model accepted industry wide to quantify these benefits and optimize the budget allotment to differ...

randn and normal in Numpy

There is always a confusion as to why we have two functions of randn and normal for giving the same output. And the key to understanding the difference between them is to understand, what is a normal and standard normal distribution. Normal Distribution : It is a Gaussian distribution or a bell shaped curve which has values distributed around a central value (i.e. the mean) with some standard deviation(i.e. the spread of the distribution). Definition in python is as below :         numpy . random . normal ( loc = 0.0 , scale = 1.0 , size =100 ) This draws a random sample from the normal Gaussian distribution of dimensions 1x100 centered i.e. a GENERIC normal distribution loc :- the central value around which values are located scale :- the standard deviation of the sample size :- the dimensions of the array returned Standard Normal Distribution : It is a distribution or a bell shaped curve which has values distributed around 0 wi...