Statistics
Chi-Squared Distribution
The chi-squared distribution is a fundamental concept in statistics. The chi-squared distribution is closely related to the normal distribution, and it's important that you first understand the normal distribution before diving into chi-squared. I have a tutorial on the normal distribution here. In short, the main difference between the normal and chi-squared distributions is in its use case. The chi-squared distribution can be used to compare two discrete random variables through a chi-squared test, while the normal distribution is used primarily for analyzing continuous random variables.
In this blog, I will further unpack the relationship between the two distributions and understand the definition and properties of chi-squared. Additionally, we’ll look at how to leverage the chi-squared distribution as data scientists. We will use a practical example from Major League Soccer (MLS) to illustrate how to compare two distributions of discrete random variables.
Formal Definition of Chi-Squared
If a random variable, X, has a normal distribution, then X² has a chi-squared distribution with 1 degree of freedom. The 1 degree of freedom signifies that we have 1 normally distributed continuous random variable that we are squaring (hence the squared in chi-squared). Let’s leverage Python to visualize this concept.
x = np.random.normal(0, 1, 1000)
chi_squared_manual = x ** 2
fig = go.Figure()
fig.add_trace(go.Histogram(x=chi_squared_manual, name=f"X<sup>2</sup> manual", histnorm='probability density'))
fig.update_layout(title="Chi2 Distribution with 1 Degree Of Freedom",
width=800)
fig.show()
The relationship between the normal distribution and the chi-squared distribution goes even further. If X₁- Xₖ are all normally distributed random variables, then if you square and sum X₁- Xₖ the result will be a chi-squared distribution with k degrees of freedom. Let’s visualize this with Python.
df = 5
def get_chi_square(df):
total = np.full(1000, 0)
for i in range(df):
x = np.random.normal(0, 1, 1000) ** 2
total = total + x
return total
chi_squared_manual = get_chi_square(df)
fig = go.Figure()
fig.add_trace(go.Histogram(x=chi_squared_manual, name=f"X<sup>2</sup> manual", histnorm='probability density'))
[IMG]
Properties of the X² Distribution
The X² distribution has the following 4 properties:
- Mean — The mean of a chi-squared distribution will always be equal to K.
- Standard Deviation — The standard deviation of a chi-squared distribution will always be equal to 2K.
- Mode — The mode or peak of the bell curve will always be K-2.
- PDF — The probability density function of a chi-squared distribution is also known as the gamma function, denoted as:
[IMG]
The Chi-Squared Test
The main application of the chi-squared distribution is the chi-squared test. If you would like to get a foundational understanding of hypothesis testing, you can do so in my tutorial here. In this section, however, I will uncover what is a chi-squared test, why we use it, and how to conduct one with a real-world example.
A chi-squared test is a statistical test used to compare 2 distributions of discrete random variables. For our example, we’re looking at the number of goals scored by the home and away clubs for MLS soccer matches across the 2023 season. Our stakeholders want to know if the distribution of the number of goals scored for the home and away teams is the same. To answer this, we can conduct a chi-squared test. The steps to conduct this experiment are as follows:
1. Define the Null Hypothesis
Our null hypothesis in a chi-squared test is that the observed and expected distributions are the same, so for our case:
[IMG]
Null and Alternative for Chi-Squared Test Our null hypothesis is that the home team goal distribution is the same as the away team goal distribution. Our alternative hypothesis is that these distributions are not the same.
[IMG]
Our null hypothesis is that the home team goal distribution is the same as the away team goal distribution. Our alternative hypothesis is that these distributions are not the same.
2. Build a Contingency Table
To test our null hypothesis, we need to build a contingency table with our observed and expected frequencies. A contingency table shows a multivariate frequency distribution in a tabular format. Let’s look at the distribution of goals scored by the home team and the away team across the 2023 MLS season:
# Load the data
df = pd.read_csv('../../data/preprocessed.csv')
# Query for MLS
mls = df[(df["season"] == 2023) & (df["league_id"] == 253)]
# create contingency table
home = pd.DataFrame(mls.home_goals.astype('str').value_counts().sort_index()).transpose()
away = pd.DataFrame(mls.away_goals.astype('str').value_counts().sort_index()).transpose()
contingency_table = pd.concat([home, away])
contingency_table.index = ['home', 'away']
contingency_table.replace({np.nan: 0}, inplace=True)
contingency_table
[IMG]
3. Compute a Test Statistic
Now we have what we need to compute our test statistic. The formula for computing the test statistic for a chi-squared test is as follows:
[IMG]
In this formula, K is the number of categories, O is the observed counts, and E is the expected counts. We already have our observed counts from our contingency table above, but how do we calculate our expected counts? If we were to calculate by hand, we would first have to total the rows and columns in our contingency table.
contingency_table['total'] = contingency_table.transpose().sum()
contingency_table.loc["total", :] = contingency_table.sum().values
contingency_table
[IMG]
The last row “total” represents the expected counts we would see if we were to have 1042 games. Therefore, the ratio between each category and the grand total (bottom right-hand corner) can be multiplied by the total for each category (home and away). In this case, they are the same, 521.
For example, the expected count for the number of home goals under the null hypothesis:
[IMG]
We could calculate these expected frequencies across all categories, and then plug the results into our equation above to obtain our test statistic. Luckily, Scipy Stats has a function that performs this calculation for us:
res = chi2_contingency(contingency_table.loc[["home", "away"]].values)
print(res.expected_freq)
print(res.statstic)
print(res.pvalue)
>>> array([[136.5, 177. , 119. , 59.5, 22. , 5.5, 1.5, 521. ],
... [136.5, 177. , 119. , 59.5, 22. , 5.5, 1.5, 521. ]])
... 72.58570708122053
... 4.428861217682383e-13
As you can see, we can confirm our manual calculation, and we have also successfully obtained our test statistic and p-value.
4. Reject or Fail to Reject The Null Hypothesis
Remember, our test statistic is evidence against the null hypothesis. The higher the test statistic, the more evidence we have that the distribution of goals for the home and away teams differs. If our test statistic lies above the critical value, we can reject the null hypothesis. Let’s plot our test statistic against a critical value for the 95% confidence interval:
# get test stat and critical value
t_stat = res.statistic
t_alpha_05 = chi2.ppf(0.95, df)
# get pdf data
x = np.linspace(0, 15, 100)
y = chi2.pdf(x, df)
# visualize
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="pdf"))
fig.add_trace(go.Scatter(x=[t_alpha_05, t_alpha_05], y=[0, 0.15], mode="lines", name="Critical Value (0.95)"))
fig.update_layout(showlegend=True)
fig.show()
[IMG]
Conclusion
In this tutorial, we covered the basic intuition behind the chi-squared distribution. We now understand its relationship with the normal distribution as well as all of its properties.
We learned its most practical application to the field of data science — hypothesis testing. With a chi-squared test, we can compare distributions and test their equivalence. By analyzing the game scores of MLS games, we were able to reject the null hypothesis that the distribution for the number of goals scored by the home and away teams is equal. There was strong evidence against this null hypothesis indicating that there may indeed be a home-field advantage.
I’m a firm believer that sports provide a vehicle to make statistics, data science, and machine learning more easily digestible. There is no better way to learn than through relatable analogies and examples; however, the concepts I teach are industry-agnostic. You are now equipped with the skills to leverage the chi-squared distribution to test similarities between the distributions of any two discrete random variables. Happy coding!
Let’s Connect
- LinkedIn (Open to Work)