How to Create a Contingency Table in Python

Spread the love

Contingency tables, also known as cross-tabulation tables or crosstabs, are a statistical tool used to analyze the relationship between two or more categorical variables. They provide a basic picture of the interrelation between variables and can help find interactions between them. Python’s powerful data manipulation libraries such as pandas and seaborn offer easy ways to create and visualize contingency tables.

Setting Up

Before we begin, you should have Python installed on your system. Python 3.6 or newer is recommended. If you don’t have Python installed, you can get it from the official website or install the Anaconda distribution which includes Python along with a suite of scientific libraries. The necessary libraries can be installed using pip:

pip install pandas seaborn matplotlib

Creating a Contingency Table with Pandas

Pandas is a Python library used for data manipulation and analysis. It provides functionalities to quickly and efficiently create contingency tables using the crosstab() function.

Consider the following example:

import pandas as pd

# Creating a simple dataframe
data = {'Gender':['M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M'],
        'Pet': ['Cat', 'Dog', 'Dog', 'Cat', 'Rabbit', 'Dog', 'Rabbit', 'Cat', 'Dog', 'Dog', 'Cat']}
df = pd.DataFrame(data)

# Creating the contingency table
contingency_table = pd.crosstab(df['Gender'], df['Pet'])

print(contingency_table)

Here, the crosstab() function is used to compute a simple cross-tabulation of the ‘Gender’ and ‘Pet’ factors in the DataFrame. This provides us with a contingency table showing the frequency distribution of pet ownership among different genders.

Visualizing a Contingency Table with Seaborn and Matplotlib

Visualizations can make it easier to understand the relationship between variables. The seaborn library, which is a statistical data visualization library based on matplotlib, can help visualize the contingency table. Let’s plot a heatmap of the contingency table:

import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
sns.heatmap(contingency_table, annot=True, cmap="YlGnBu")

plt.title('Contingency Table Heatmap')
plt.xlabel('Pet')
plt.ylabel('Gender')
plt.show()

In this code, seaborn’s heatmap() function is used to plot the contingency table. The ‘annot’ parameter is set to True for the function to annotate each cell with its value. The color scheme is set using the ‘cmap’ parameter.

Creating a Contingency Table with Margins

Sometimes, you might want to include the row and column totals in the contingency table. You can do this using the margins parameter in the crosstab() function. Here’s how you can do it:

# Creating the contingency table with margins
contingency_table = pd.crosstab(df['Gender'], df['Pet'], margins=True)

print(contingency_table)

Setting ‘margins’ to True adds an ‘All’ row and column in the output, giving us the total across rows and columns.

Chi-Square Test of Independence using Contingency Table

In addition to merely creating a contingency table, you might want to test for independence of the factors involved. The Chi-square test of independence can help with this.

The chi2_contingency() function from the scipy library can be used to conduct this test:

from scipy.stats import chi2_contingency

# Calculating the chi-square test statistic
chi2, p, dof, expected = chi2_contingency(contingency_table)

print('Chi-square Statistic: %0.3f, p-value: %0.3f' % (chi2, p))

The chi2_contingency() function returns the test statistic (chi2), the p-value of the test (p), degrees of freedom (dof), and the expected frequencies (expected).

In conclusion, contingency tables are a valuable tool in statistics and data analysis to understand the relationship between categorical variables. Python, with its rich ecosystem of data analysis libraries, provides an efficient and straightforward way to create and work with contingency tables.

Leave a Reply