who has written a kids’s guide and launched it in two variations on the similar time into the market on the similar value. One model has a primary cowl design, whereas the opposite has a high-quality cowl design, which after all price him extra.
He then observes the gross sales for a sure interval and gathers the info proven beneath.

Now he involves us and desires to know whether or not the duvet design of his books has affected their gross sales.
From the gross sales information, we will observe that there are two categorical variables. The primary is canopy sort, which is both excessive price or low price, and the second is gross sales consequence, which is both bought or not bought.
Now we need to know whether or not these two categorical variables are associated or not.
We all know that when we have to discover a relationship between two categorical variables, we use the Chi-square take a look at for independence.
On this state of affairs, we are going to usually use Python to use the Chi-square take a look at and calculate the chi-square statistic and p-value.
Code:
import numpy as np
from scipy.stats import chi2_contingency
# Noticed information
noticed = np.array([
[320, 180],
[350, 150]
])
chi2, p, dof, anticipated = chi2_contingency(noticed, correction=False)
print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Levels of freedom:", dof)
print("Anticipated frequencies:n", anticipated)
Outcome:

The chi-square statistic is 4.07 with a p-value of 0.043 which is beneath the 0.05 threshold. This means that the duvet sort and gross sales are statistically related.
Now we have now obtained the p-value, however earlier than treating it as a call, we have to perceive how we obtained this worth and what the assumptions of this take a look at are.
Understanding this can assist us determine whether or not the consequence we obtained is dependable or not.
Now let’s attempt to perceive what the Chi-Sq. take a look at really is.
Now we have this information.

By observing the info, we will say that gross sales for books with the high-cost cowl are increased, so we might imagine that the duvet labored.
Nevertheless, in actual life, the numbers fluctuate by likelihood even when the duvet has no impact or prospects choose books randomly. We will nonetheless get unequal values.
Randomness at all times creates imbalances.
Now the query is, “Is that this distinction greater than what randomness often creates?”
Let’s see how Chi-Sq. take a look at solutions that query.
We have already got this formulation to calculate the Chi-Sq. statistic.
[
chi^2 = sum_{i=1}^{r} sum_{j=1}^{c}
frac{(O_{ij} – E_{ij})^2}{E_{ij}}
]
the place:
χ² is the Chi-Sq. take a look at statistic
i represents the row index
j represents the column index
Oᵢⱼ is the noticed rely in row i and column j
Eᵢⱼ is the anticipated rely in row i and column j
First let’s concentrate on Anticipated Counts.
Earlier than understanding what anticipated counts are, let’s state the speculation for our take a look at.
Null Speculation (H₀)
The duvet sort and gross sales consequence are unbiased. (The duvet sort has no impact)
Different Speculation (H₁)
The duvet sort and gross sales consequence will not be unbiased. (The duvet sort is related to whether or not a guide is bought.)
Now what will we imply by anticipated counts?
Let’s say the null speculation is true, which implies the duvet sort has no impact on the gross sales of books.
Let’s return to chances.
As we already know, the formulation for easy chance is:
[P(A) = frac{text{Number of favorable outcomes}}{text{Total number of outcomes}}]
In our information, the general chance of a guide being bought is:
[P(text{Sold}) = frac{text{Number of books sold}}{text{Total number of books}} = frac{670}{1000} = 0.67]
In chance, once we write P(A∣B), we imply the chance of occasion A on condition that occasion B has already occurred.
[
text{Under independence, cover type and sales are not related.}
text{This means the probability of being sold does not depend on cover type.}
text{which means}
P(text{Sold} mid text{Low-cost cover}) = P(text{Sold})
P(text{Sold} mid text{High-cost cover}) = P(text{Sold})
P(text{Sold}) = frac{670}{1000} = 0.67
text{Therefore, }
P(text{Sold} mid text{Low-cost cover}) = 0.67
]
Underneath independence, we have now P (Offered | Low-cost Cowl) = 0.67, which implies 67% of low-cost cowl books are anticipated to be bought.
Since we have now 500 books with low-cost covers, we convert this chance into an anticipated variety of bought books.
[0.67 times 500 = 335]
This implies we anticipate 335 low-cost cowl books to be bought beneath independence.
Based mostly on our information desk, we will characterize this as E11.
Equally, the anticipated worth for the high-cost cowl and bought can also be 335, which is represented by E21.
Now let’s calculate E12 – Low-cost cowl, Not Offered and E22 – Excessive-cost cowl, Not Offered.
The general chance of a guide not being bought is:
[P(text{Not Sold}) = frac{330}{1000} = 0.33]
Underneath independence, this chance applies to every sub group as earlier.
[P(text{Not Sold} mid text{Low-cost cover}) = 0.33]
[P(text{Not Sold} mid text{High-cost cover}) = 0.33]
Now we convert this chance into the anticipated rely of unsold books.
[E_{12} = 0.33 times 500 = 165]
[E_{22} = 0.33 times 500 = 165]
We used chances right here to grasp the thought of anticipated counts, however we have already got direct formulation to calculate them. Let’s additionally check out these.
Components to calculate Anticipated Counts:
[E_{ij} = frac{R_i times C_j}{N}]
The place:
- Ri = Row whole
- Cj = Column whole
- N = Grand whole
Low-cost cowl, Offered:
[E_{11} = frac{500 times 670}{1000} = 335]
Low-cost cowl, Not Offered:
[E_{12} = frac{500 times 330}{1000} = 165]
Excessive-cost cowl, Offered:
[E_{12} = frac{500 times 670}{1000} = 335]
Excessive-cost cowl, Not Offered:
[E_{22} = frac{500 times 330}{1000} = 165]
In each methods, we get the identical values.
By calculating anticipated counts, what we’re discovering is that this: if we assume the null speculation is true, then the 2 categorical variables are unbiased.
Right here, we have now 1,000 books and we all know that 670 are bought. Now we think about randomly selecting books and labeling them as bought.
After deciding on 670 books, we verify what number of of them belong to the low-cost cowl group and what number of belong to the high-cost cowl group.
If we repeat this course of many occasions, we might acquire values round 335. Generally they might be 330 or 340.
We then think about the typical, and 335 turns into the central level of the distribution if every little thing occurs purely attributable to randomness.
This doesn’t imply the rely should equal 335, however that 335 represents the pure middle of variation beneath independence.
The Chi-Sq. take a look at then measures how far the noticed rely deviates from this central worth relative to the variation anticipated beneath randomness.
We calculated the anticipated counts:
E11 = 335; E21 = 335; E12 = 165; E22 = 165

The following step is to calculate the deviation between the noticed and anticipated counts. To do that, we subtract the anticipated rely from the noticed rely.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & O – E = 320 – 335 = -15 [8pt]
textual content{Low-Price Cowl & Not Offered:} quad & O – E = 180 – 165 = 15 [8pt]
textual content{Excessive-Price Cowl & Offered:} quad & O – E = 350 – 335 = 15 [8pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & O – E = 150 – 165 = -15
finish{aligned}
Within the subsequent step, we sq. the variations as a result of if we add the uncooked deviations, the constructive and detrimental values cancel out, leading to zero.
This could incorrectly recommend that there isn’t a imbalance. Squaring solves the cancellation drawback by permitting us to measure the magnitude of the imbalance, no matter course.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & (O – E)^2 = (-15)^2 = 225 [6pt]
textual content{Low-Price Cowl & Not Offered:} quad & (15)^2 = 225 [6pt]
textual content{Excessive-Price Cowl & Offered:} quad & (15)^2 = 225 [6pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & (-15)^2 = 225
finish{aligned}
Now that we have now calculated the squared deviations for every cell, the following step is to divide them by their respective anticipated counts.
This standardizes the deviations by scaling them relative to what was anticipated beneath the null speculation.
start{aligned}
textual content{Low-Price Cowl & Offered:} quad & frac{(O – E)^2}{E} = frac{225}{335} = 0.6716 [6pt]
textual content{Low-Price Cowl & Not Offered:} quad & frac{225}{165} = 1.3636 [6pt]
textual content{Excessive-Price Cowl & Offered:} quad & frac{225}{335} = 0.6716 [6pt]
textual content{Excessive-Price Cowl & Not Offered:} quad & frac{225}{165} = 1.3636
finish{aligned}
Now, for each cell, we have now calculated:
start{aligned}
frac{(O – E)^2}{E}
finish{aligned}
Every of those values represents the standardized squared contribution of a cell to the whole imbalance. Summing them offers the general standardized squared deviation for the desk, referred to as the Chi-Sq. statistic.
start{aligned}
chi^2 &= 0.6716 + 1.3636 + 0.6716 + 1.3636 [6pt]
&= 4.0704 [6pt]
&approx 4.07
finish{aligned}
We obtained a Chi-Sq. statistic of 4.07.
How can we interpret this worth?
After calculating the chi-square statistic, we evaluate it with the vital worth from the chi-square distribution desk for 1 diploma of freedom at a significance degree of 0.05.
For df = 1 and α = 0.05, the vital worth is 3.84. Since our calculated worth (4.07) is larger than 3.84, we reject the null speculation.
The chi-square take a look at is full at this level, however we nonetheless want to grasp what df = 1 means and the way the vital worth of three.84 is obtained.
That is the place issues begin to get each fascinating and barely complicated.
First, let’s perceive what df = 1 means.
‘df’ means Levels of Freedom.
From our information,

We will name this a Contingency desk and to be particular it’s a 2*2 contingency desk as a result of it’s outlined by variety of classes in variable 1 as rows and variety of classes in variable 2 as columns. Right here we have now 2 rows and a pair of columns.
We will observe that the row totals and column totals are fastened. Because of this if one cell worth adjustments, the opposite three should modify accordingly to protect these totals.
In different phrases, there is just one unbiased manner the desk can fluctuate whereas conserving the row and column totals fastened. Due to this fact, the desk has 1 diploma of freedom.
We will additionally compute the levels of freedom utilizing the usual formulation for a contingency desk:
[
df = (r – 1)(c – 1)
]
the place r is the variety of rows and c is the variety of columns.
In our instance, we have now a 2*2 desk, so:
[
df = (2 – 1)(2 – 1)
]
[
df = 1
]
We now have an concept of what levels of freedom imply from the info desk. However why do we have to calculate them?
Now, let’s think about a four-dimensional house during which every axis corresponds to at least one cell of the contingency desk:
Axis 1: Low-cost & Offered
Axis 2: Low-cost & Not Offered
Axis 3: Excessive-cost & Offered
Axis 4: Excessive-cost & Not Offered
From the info desk, we have now the noticed counts (320, 180, 350, 150). We additionally calculated the anticipated counts beneath independence as (335, 165, 335, 165).
Each the noticed and anticipated counts could be represented as factors in a four-dimensional house.
Now we have now two factors in a four-dimensional house.
We already calculated the distinction between noticed and anticipated counts (-15, 15, 15, -15).
We will write it as -15(1, -1, -1, 1)
Within the noticed information,

Let’s say we enhance the Low-cost & Offered rely from 320 to 321 (a +1 change).
To maintain the row and column totals fastened, Low-cost & Not Offered should lower by 1, Excessive-cost & Offered should lower by 1, and Excessive-cost & Not Offered should enhance by 1.
This produces the sample (1, −1, −1, 1).
Any legitimate change in a 2×2 desk with fastened margins follows this similar sample multiplied by some scalar.
Underneath fastened row and column totals, many various 2×2 tables are attainable. Once we characterize every desk as some extent in four-dimensional house, these tables lie on a one-dimensional straight line.
We will consult with the anticipated counts, (335, 165, 335, 165), as the middle of that straight line and let’s denote that time as E.
The purpose E lies on the middle of the road as a result of, beneath pure randomness (independence), these are the values we anticipate to watch.
We then measure how a lot the noticed counts deviate from these anticipated counts.
We will observe that each level on the road is:
E + x (1, −1, −1, 1)
the place x is any scalar.
From our noticed information desk, we will write it as:
O = E + (-15) (1, −1, −1, 1)
Equally, each level could be written like this.
The (1, −1, −1, 1) defines the course of the one-dimensional deviation house. We name it as a course vector. Scalar worth simply tells us how far to maneuver in that course.
Each legitimate desk is obtained by beginning on the anticipated desk and transferring a ways alongside this course.
For instance, any level on the road is (335+x, 165-x, 335-x, 165+x).
Substituting x=−15, the values develop into
(335−15, 165+15, 335+15, 165−15),
which simplifies to (320, 180, 350, 150).
This matches our noticed desk.
We will think about that as x adjustments, the desk strikes solely in a single course alongside a straight line.
Because of this your entire deviation from independence is managed by a single scalar worth, which strikes the desk alongside a straight line.
Since all tables lie alongside a one-dimensional line, the system has just one unbiased course of motion. For this reason the levels of freedom equal 1.
At this level, we all know how you can compute the chi-square statistic. As derived earlier, standardizing the deviation from the anticipated rely and squaring it leads to a chi-square worth of 4.07.
Now that we perceive what levels of freedom imply, let’s discover what the chi-square distribution really is.
Coming again to our noticed information, we have now 1000 books in whole. Out of those, 670 had been bought and 330 weren’t bought.
Underneath the idea of independence (i.e., cowl sort doesn’t affect whether or not a guide is bought), we will think about randomly deciding on 670 books out of 1000 and labeling them as “bought.”
We then rely what number of of those chosen books have a low-cost cowl sort. Let this rely be denoted by X.
If we repeat this experiment many occasions as mentioned earlier, every repetition would produce a special worth of X, resembling 321, 322, 326 and so forth.
Now if we plot these values throughout many repetitions, then we will observe that the values cluster round 335, forming a bell-shape curve.
Plot:

We will observe the Regular Distribution.
From our noticed information desk, the variety of Low-cost and Offered books is 320. The distribution proven above represents how values behave beneath independence.
We see that values like 334 and 336 are frequent, whereas 330 and 340 are considerably much less frequent. A worth like 320 seems to be comparatively uncommon.
However how will we decide this accurately? To reply that, we should evaluate 320 to the middle of the distribution, which is 335, and think about how broad the curve is.
The width of the curve displays how a lot pure variation we anticipate beneath independence. Based mostly on this unfold, we will assess how regularly a price like 320 would happen.
For that we have to carry out Standardization.
Anticipated worth: ( mu = 335 )
Noticed worth: ( X = 320 )
Distinction: ( 320 – 335 = -15 )
Customary deviation: ( sigma approx 7.44 )
[
Z = frac{320 – 335}{7.44} approx -2.0179
]
So, 320 is about two commonplace deviations beneath the typical.
We already know that we calculated the Z-score right here.
The Z-score of 320 is roughly −2.0179.
In the identical manner, if we standardize every attainable of X, then the above sampling distribution of X will get reworked into the usual regular distribution with imply = 0 and commonplace deviation = 1.

Now we already know that 320 is about two commonplace deviations beneath the typical.
Z-Rating = -2.0179
We already computed a chi-square statistic equal to 4.07.
Now let’s sq. the Z-Rating
Z2 = (−2.0179)2 = 4.0719 and this is the same as our chi-square statistic.
If a standardized deviation follows a typical regular distribution, then squaring that random variable transforms the distribution right into a chi-square distribution with one diploma of freedom.

That is the curve obtained once we sq. a typical regular random variable Z. Since squaring removes the signal, each constructive and detrimental values of Z map to constructive values.
In consequence, the symmetric bell-shaped distribution is reworked right into a right-skewed distribution that follows a chi-square distribution with one diploma of freedom.
When the levels of freedom is 1, we really don’t have to suppose when it comes to squaring to decide.
There is just one unbiased deviation from independence, so we will standardize it and carry out a two-sided Z-test.
Squaring merely turns that Z worth right into a chi-square worth, when df = 1. Nevertheless, when the levels of freedom are higher than 1, there are a number of unbiased deviations.
If we simply add these deviations collectively, constructive and detrimental values cancel out.
Squaring ensures that every one deviations contribute positively to the whole deviation.
That’s the reason the chi-square statistic at all times sums squared standardized deviations, particularly when df is larger than 1.
We now have a clearer understanding of how the conventional distribution is linked to the chi-square distribution.
Now let’s use this distribution to carry out speculation testing.
Null Speculation (H₀)
The duvet sort and gross sales consequence are unbiased. (The duvet sort has no impact)
Different Speculation (H₁)
The duvet sort and gross sales consequence will not be unbiased. (The duvet sort is related to whether or not a guide is bought.)
A generally used significance degree is α = 0.05. This implies we reject the null speculation provided that our consequence falls inside essentially the most excessive 5% of outcomes beneath the null speculation.
From the Chi-Sq. distribution at df = 1 and α = 0.05: the vital worth is 3.84.
The worth 3.84 is the vital (cut-off) worth. The world to the suitable of three.84 equals 0.05, representing the rejection area.
Since our calculated chi-square statistic exceeds 3.84, it falls inside this rejection area.

The p-value right here is 0.043, which is the world to the suitable of 4.07.
This implies if cowl sort and gross sales had been actually unbiased, there can be solely a 4.3% likelihood of observing a distinction this huge.
Now whether or not these outcomes are dependable or not is determined by the assumptions of the chi-square take a look at.
Let’s take a look at the assumptions for this take a look at:
1) Independence of Observations
On this context, independence implies that one guide sale shouldn’t affect one other. The identical buyer shouldn’t be counted a number of occasions, and observations shouldn’t be paired or repeated.
2) Knowledge should be Categorical counts.
3) Anticipated Frequencies Ought to Not Be Too Small
All anticipated cell counts ought to usually be at the least 5.
4) Random Sampling
The pattern ought to characterize the inhabitants.
As a result of all of the assumptions are happy and the p-value (0.043) is beneath 0.05, we reject the null speculation and conclude that cowl sort and gross sales are statistically related.
At this level, you could be confused about one thing.
We spent a whole lot of time specializing in one cell, for instance the low-cost books that had been bought.
We calculated its deviation, standardized it, and used that to grasp how the chi sq. statistic is fashioned.
However what concerning the different cells? What about high-cost books or the unsold ones?
The essential factor to appreciate is that in a 2×2 desk, all 4 cells are linked. As soon as the row totals and column totals are fastened, the desk has solely one diploma of freedom.
This implies the counts can’t fluctuate independently. If one cell will increase, then different cells routinely adjusted to maintain the totals constant.
As we mentioned earlier, we will consider all attainable tables with the identical margins as factors in a four-dimensional house.
Nevertheless, due to the constraints imposed by the fastened totals, these factors don’t unfold out in each course. As an alternative, they lie alongside a single straight line, which we already mentioned earlier.
Each deviation from independence strikes the desk solely alongside that one course, which we mentioned earlier.
So, when one cell deviates by, say, +15 from its anticipated worth, the opposite cells are routinely decided by the construction of the desk.
The entire desk shifts collectively. The deviation is not only about one quantity. It represents the motion of your entire system.
Once we compute the chi sq. statistic, we subtract noticed from anticipated for all cells and standardize every deviation.
However in a 2×2 desk, these deviations are tied collectively. They transfer as one coordinated construction.
This implies, inspecting one cell is sufficient to perceive how far your entire desk has moved away from independence and in addition concerning the distribution.
Studying by no means ends, and there’s nonetheless rather more to discover concerning the chi-square take a look at.
I hope this text has given you a transparent understanding of what the chi-square take a look at really does.
In one other weblog, we are going to talk about what occurs when the assumptions will not be met and why the chi-square take a look at might fail in these conditions.
There was a small pause in my time collection collection. I spotted that a number of matters deserved extra readability and cautious considering, so I made a decision to decelerate as a substitute of pushing ahead. I’ll return to it quickly with explanations that really feel extra full and intuitive.
In the event you loved this text, you’ll be able to discover extra of my writing on Medium and LinkedIn.
Thanks for studying!















