Chi–squared test

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import scipy.stats

food = pd.read_pickle("../data/processed/food")

We saw in the section on hypothesis testing that it is possible to perform a statistical test to see if there is an association or relationship between two or more variables. The type of hypothesis test that is appropriate to use depends largely on the level of measurement of your data.

We specified a null hypothesis that there is no association between household reference person’s NS–SEC and housing tenure. Both NS–SEC and housing tenure are nominal (NS–SEC can be treated as ordinal in some circumstances) so the most appropriate test of association for this null hypothesis is the \(\chi ^ 2\) test, sometimes written chi–squared test, and usually pronounced ‘kai’ (to rhyme with dye) or ‘key’.

To carry out the \(\chi ^ 2\) test, the scipy.stats.chi2_contingency() function returns the following pieces of information:

In [2]:
scipy.stats.chi2_contingency(
    pd.crosstab(index = food.A094r, columns = food.A121r, margins = False)
)
Out[2]:
(530.06779152390982,
 2.4769639188654904e-109,
 8,
 array([[ 200.1322314 ,  165.61868687,  643.24908173],
        [ 121.78512397,  100.78282828,  391.43204775],
        [ 194.38016529,  160.85858586,  624.76124885],
        [  37.88429752,   31.3510101 ,  121.76469238],
        [ 309.81818182,  256.38888889,  995.79292929]]))

Test statistic

The test statistic is, roughly, the amount of variance explained by our test compared to the amount of variance not explained. In all my years of statistics I have never worked one of these out by hand, so don’t worry too much about this. It is needed, along with the degrees of freedom, to calculate the significance value; it’s value on its own is not needed for interpretation.

Degrees of freedom

The degrees of freedom are the the number of independent pieces of information to perform the test on (a bit like we saw earlier with the standard deviation, the DOF used is \(n - 1\) because we set the population mean to be the sample mean). In a cross tab this is the number of rows minus 1, multiplied by the number of columns minus 1, in this case:

In [3]:
(5 - 1) * (3 - 1)
Out[3]:
8

This is because, in this example, once we know rows 1–4 we can calculate row 5 because we know the total. Similarly once we know columns 1–2 we can calculate column 3 because we know the total.

We’re not interested in the test statistic or degrees of freedom directly, but these are used to calculate the \(p\) value.

p value

The \(p\) value tells us how likely we are to observe the relationship or pattern we have by chance along. We want the \(p\) value to be low, by convention at least below 0.05. A low value would mean we are unlikely to see the relationship we have by chance alone, so it is likely that there is a true relationship.

In this case the \(p\) value is so low it is returned in scientific notation. The 2.9e-109 means the decimal place is moved 109 places to the left, i.e.:

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000029

So, essentially, zero (in fact it’s highly dubious that the p value is known to this level of accuracy, so we treat it as essentially zero). A \(p\) value this small means it is very, very unlikely that we would have observed the relationship we did just by chance, so we can say with some confidence that there is an association or link between NS-SEC and housing tenure.

Assumptions

There are a few important assumptions we must satisfy to use a chi–squared test. One of these is to do with expected frequencies, which are used in calculating the actual chi–squared statistic. In calculating the chi–squared statistic we calculate the expected frequency for each cell. In our example we have 15 cells in our crosstab, so we calculate 15 expected frequencies.

Specifically we should not have any expected frequencies of 0 (i.e. should be at least 1), and no more than 20% of expected frequencies should be less than 5. To calculate the expected frequency for each cell we use the formula:

\[E_{ij} = \frac{T_{i} x T_{j}}{N}\]

where \(E_{ij}\) is the expected frequency of cell in row \(_i\) and column \(_j\); \(T_i\) is the total of row \(_i\); \(T_j\) is the total of column \(_j\); and \(N\) is the table grand total. So the expected frequency for row 1, column 1 is:

\[E_{1, 1} = \frac{1009 x 864}{4356}\]
In [4]:
(1009 * 864) / 4356
Out[4]:
200.13223140495867

Which is what is returned by the scipy.stats.chi_contingency() function.

When running a chi–square test on a 2x2 contingency table it is likely to produce p values that are too small (i.e. it’s more likely to make a false positive or a type I error. To correct this scipy.stats.chi_contingency() automatically applies the Yates’s continuity correction if you’re performing a test on a 2x2 table. I’ve never worried about what this is or how it works (although Andy Field’s textbook, as usual, covers it in an accessible way); just know that it has been applied when reporting on a 2x2 table.

Odds ratio

Determining that there is an association is all very well and good, but it tells us nothing of what the size of the effect is. For example, our hypothesis test has determined it is probable that there is an association between the employment grade of the household reference person and tenure, but it does not tell us how much more likely one group is than another to own their home.

For a 2x2 contingency table we can use the odds ratio. The odds ratio is the odds of one group for the event of interest divided by the odds of the other group for the event of interest. For tables with more than 2x2 variables (as in our case) it is common to see researchers re–state the association to produce a 2x2 table.

For our example of employment grade and housing tenure we can restate it so instead of just measuring an association between all employment grades and tenure types we can calculate the odds ratio of, say, professional and managerial respondents owning their home against all other NS–SEC grades. This results in a 2x2 contingency table:

In [5]:
food = food[food.A094r <= 3]  # remove unemployed and 'other' NS-SEC

# label NSSEC
food.A094r.replace(to_replace = 1, value = "nssec1", inplace = True)
food.A094r.replace(to_replace = [2, 3], value = "other", inplace = True)

# label tenure
food.A121r.replace(to_replace = [1, 2], value = "rented", inplace = True)
food.A121r.replace(to_replace = 3, value = "owned", inplace = True)

pd.crosstab(index = food.A094r, columns = food.A121r, margins = False)
Out[5]:
A121r owned rented
A094r
nssec1 763 246
other 880 714

First we specify the odds of the professional and managerial group owning their own home. This is the number of professional respondents who own their home (763), divided by the number of professional respondents who do not own their home (246), or 763:246. We can calculate a single number to represent the odds by dividing:

\[Odds_{1} = \frac{763}{246}\]

or:

In [6]:
763 / 246
Out[6]:
3.1016260162601625

This means that, roughly, for every professional and managerial respondent who does not own their home there are just over three who do.

The odds of other NS–SEC respondents owning their home is again the ratio of them owning to not owning, or 880:714.

\[Odds_{other} = \frac{880}{714}\]

or:

In [7]:
880 / 714
Out[7]:
1.2324929971988796

This means that for every respondent who is not NS–SEC grade 1 who does not own their home there is only just over one who does.

To calculate the ratio of these odds, we simply divide again:

\[OR = \frac{763:246}{880:714} = \frac{3.1}{1.2}\]

or:

In [8]:
(763 / 246) / (880 / 714)
Out[8]:
2.5165465631929043

So NS–SEC grade 1 respondents are 2.5 times more likely to own their home than respondents of other NS–SEC grades.