11. GSS 2020 data#
11.1. Overview#
In this section we perform data analysis on 2020 GSS data. We cover the following concepts:
Univariate/Bivariate analysis
Data cleaning
Data manipulation
Anova
etc
11.1.1. Univariate/Bivariate analysis#
Univariate and bivariate analyses are essential steps in exploratory data analysis (EDA). They help you understand the distribution and relationships of variables in your dataset. Here’s a step-by-step guide for each:
11.1.1.1. Univariate Analysis#
Univariate analysis examines the distribution of a single variable. The goal is to understand the central tendency, dispersion, and shape of the variable’s distribution.
Steps:
Identify the Variable Type:
Categorical (e.g., gender, color): Non-numeric data that represent categories.
Numerical (e.g., age, income): Numeric data that can be discrete or continuous.
Descriptive Statistics:
Categorical Variables:
Frequency Distribution: Count the occurrences of each category.
Mode: The category with the highest frequency.
Numerical Variables:
Measures of Central Tendency: Mean, median, mode.
Measures of Dispersion: Range, variance, standard deviation, interquartile range (IQR).
Skewness & Kurtosis: To assess the shape of the distribution.
Visualization:
Categorical Variables:
Bar Chart
Pie Chart
Numerical Variables:
Histogram
Boxplot
Density Plot
Check for Outliers: (Numerical variables)
Use boxplots or z-scores to identify and assess outliers.
Assess Distribution: (Numerical variables)
Identify if the data is normally distributed or skewed using histograms and Q-Q plots.
11.1.1.2. Bivariate Analysis#
Bivariate analysis examines the relationship between two variables. The aim is to identify patterns, correlations, or associations between them.
Steps:
Identify the Variable Types:
Both Categorical
Both Numerical
One Categorical and One Numerical
Descriptive Statistics:
Categorical vs. Categorical:
Cross-tabulation: Create a contingency table to observe the frequency of combinations of categories.
Chi-Square Test: Test the independence of the variables.
Numerical vs. Numerical:
Correlation Coefficient: Use Pearson (linear relationships) or Spearman (non-linear relationships) to measure the strength and direction of the relationship.
Covariance: Assess the direction of the relationship (positive or negative).
Categorical vs. Numerical:
Group Statistics: Calculate summary statistics (mean, median, etc.) for the numerical variable across different categories.
T-test or ANOVA: Test the difference in means between groups.
Visualization:
Categorical vs. Categorical:
Stacked Bar Chart
Mosaic Plot
Numerical vs. Numerical:
Scatter Plot
Line Chart (if time-series data)
Pair Plot (for multiple numerical variables)
Categorical vs. Numerical:
Boxplot (by category)
Violin Plot
Bar Plot with error bars
Identify Relationships:
Categorical vs. Categorical: Look for patterns or significant associations in the contingency table.
Numerical vs. Numerical: Look for trends, clusters, or correlations in the scatter plot.
Categorical vs. Numerical: Check for differences in distributions or central tendencies across categories.
Hypothesis Testing:
Conduct appropriate statistical tests to validate or reject hypotheses about the relationships between variables.
11.1.2. Step 1:#
If necessary, select your sample: For example we select those who’s age are 25+
import pandas as pd
df = pd.read_csv("/Users/amirrezamousavi/Desktop/lisa/Microdata Coding Project/gss2020_small_new.csv")
filtered_df = df[df['AGEGR10'] > 2]
We can run a frequency table for agegr10 to check the sample selection was done correctly.
frequency_table = filtered_df['AGEGR10'].value_counts().sort_index()
print(frequency_table)
AGEGR10
3 6388
4 5920
5 6443
6 5742
7 3225
Name: count, dtype: int64
We can also check to see if we have missing values:
print("number of missing values for each feature:",df.isna().sum())
number of missing values for each feature: PUMFID 0
WGHT_PER 0
AGEGR10 0
MARSTAT 0
PHSDFLG 0
..
LAN_01 0
LANHSD_C 0
LANCH_C 0
INC_C 0
FAMINC_C 0
Length: 264, dtype: int64
11.1.3. Step 2:#
Removing missing values:
In pandas, the dropna() method is used to remove missing values (i.e., NaN values) from a DataFrame or Series.
# Remove rows with any missing values
df_cleaned = filtered_df.dropna()
To obtain the frequency distribution of the variable vismin_c
# Get frequency distribution
frequency_distribution = filtered_df['VISMIN_C'].value_counts()
print(frequency_distribution)
VISMIN_C
10 15568
99 2003
5 1948
7 1600
6 1532
1 1300
2 1254
8 1136
3 636
9 446
4 295
Name: count, dtype: int64
# Create new variable vismin2 as a copy of vismin_c
import warnings
warnings.filterwarnings('ignore')
filtered_df['vismin2'] = filtered_df['VISMIN_C']
recode vismin2(1 thru 9=1)(10=0)(999=sysmis).
import numpy as np
# Recode values
filtered_df['vismin2'] = filtered_df['vismin2'].replace({999: np.nan}) # Set 999 to NaN (sysmis)
filtered_df['vismin2'] = filtered_df['vismin2'].apply(lambda x: 1 if 1 <= x <= 9 else (0 if x == 10 else np.nan))
In the next step we give value label to each category of the new var vismin2. Value labels vismin2 0’not a visible minority’ 1’visible minority’.
labels = {0: 'Not a Visible Minority', 1: 'Visible Minority'}
# Apply labels
filtered_df['vismin2_label'] = filtered_df['vismin2'].map(labels)
In this section as an example we reverse code GEN_01 variable. recode genhealth_rev(1=5)(2=4)(3=3)(4=2)(5=1)(9=sysmis). value labels genhealth_rev 1’poor’ 2’fair’ 3’good’ 4’very good’ 5’excellent’.
# Reverse code the variable
filtered_df['genhealth_rev'] = filtered_df['GEN_01'].replace({9: np.nan}) # Set 9 to NaN (sysmis)
filtered_df['genhealth_rev'] = filtered_df['genhealth_rev'].apply(lambda x: 6 - x if pd.notna(x) else np.nan)
# Define labels
labels = {1: 'Poor', 2: 'Fair', 3: 'Good', 4: 'Very Good', 5: 'Excellent'}
# Apply labels
filtered_df['genhealth_rev_label'] = filtered_df['genhealth_rev'].map(labels)
Since ratio variables are limited in the GSS data I am recoding SCF100_C (ordinal) into a ratio
filtered_df['numfriends'] = filtered_df['SCF100_C'].replace({9: np.nan}) # Assuming 9 indicates missing values
recode numfriends(10=14.5)(11=24.5)(12=30)(999=sysmis).
filtered_df['numfriends'] = filtered_df['numfriends'].replace({
10: 14.5,
11: 24.5,
12: 30,
999: np.nan # Treat 999 as missing
})
To analyze the ‘numfriends’ variable and its relationship with ‘SCF100_C’, you need to generate a frequency distribution and a cross-tabulation table.
# Frequencies
frequency_distribution = filtered_df['numfriends'].value_counts(dropna=False)
print("Frequencies of numfriends:")
print(frequency_distribution)
# Cross-tabulation
crosstab = pd.crosstab(filtered_df['numfriends'], filtered_df['SCF100_C'], rownames=['numfriends'], colnames=['SCF100_C'])
print("\nCross-tabulation of numfriends by SCF100_C:")
print(crosstab)
Frequencies of numfriends:
numfriends
2.0 5170
3.0 3916
5.0 3135
1.0 3067
4.0 2935
0.0 2689
14.5 2548
6.0 1706
8.0 827
7.0 532
24.5 523
NaN 449
30.0 221
Name: count, dtype: int64
Cross-tabulation of numfriends by SCF100_C:
SCF100_C 0 1 2 3 4 5 6 7 8 10 11 12
numfriends
0.0 2689 0 0 0 0 0 0 0 0 0 0 0
1.0 0 3067 0 0 0 0 0 0 0 0 0 0
2.0 0 0 5170 0 0 0 0 0 0 0 0 0
3.0 0 0 0 3916 0 0 0 0 0 0 0 0
4.0 0 0 0 0 2935 0 0 0 0 0 0 0
5.0 0 0 0 0 0 3135 0 0 0 0 0 0
6.0 0 0 0 0 0 0 1706 0 0 0 0 0
7.0 0 0 0 0 0 0 0 532 0 0 0 0
8.0 0 0 0 0 0 0 0 0 827 0 0 0
14.5 0 0 0 0 0 0 0 0 0 2548 0 0
24.5 0 0 0 0 0 0 0 0 0 0 523 0
30.0 0 0 0 0 0 0 0 0 0 0 0 221
For univariate analysis of a ratio variable such as numfriends, you can calculate various descriptive statistics including frequencies, mean, median, standard deviation, and mode:
# Frequencies
frequency_distribution = filtered_df['numfriends'].value_counts(dropna=False)
# Descriptive statistics
mean = filtered_df['numfriends'].mean()
median = filtered_df['numfriends'].median()
stddev = filtered_df['numfriends'].std()
mode = filtered_df['numfriends'].mode().values # mode() returns a Series, take the values
print("Frequencies of numfriends:")
print(frequency_distribution)
print("\nDescriptive Statistics for numfriends:")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {stddev}")
print(f"Mode: {mode}")
Frequencies of numfriends:
numfriends
2.0 5170
3.0 3916
5.0 3135
1.0 3067
4.0 2935
0.0 2689
14.5 2548
6.0 1706
8.0 827
7.0 532
24.5 523
NaN 449
30.0 221
Name: count, dtype: int64
Descriptive Statistics for numfriends:
Mean: 4.750284205508086
Median: 3.0
Standard Deviation: 5.272239825076114
Mode: [2.]
Bivariate analysis: t-test (comparing mean # of friends between visible minority and non-visible minority).
from scipy import stats
# Split data into two groups
group1 = filtered_df[filtered_df['vismin2_label'] == 'Not a Visible Minority']['numfriends']
group2 = filtered_df[filtered_df['vismin2_label'] == 'Visible Minority']['numfriends']
# Perform the t-test
t_stat, p_value = stats.ttest_ind(group1, group2, nan_policy='omit')
print("T-test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
T-test Results:
T-statistic: 10.102743758610377
P-value: 5.9640357996091385e-24
t-test groups vismin2(0 1)/variables numfriends:
from scipy import stats
# Split data into two groups
group1 = filtered_df[filtered_df['vismin2'] == 0]['numfriends']
group2 = filtered_df[filtered_df['vismin2'] == 1]['numfriends']
# Perform the t-test
t_stat, p_value = stats.ttest_ind(group1, group2, nan_policy='omit')
print("T-test Results:")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
T-test Results:
T-statistic: 10.102743758610377
P-value: 5.9640357996091385e-24
11.2. ANOVA#
ANOVA, or Analysis of Variance, is a statistical technique used to determine if there are any statistically significant differences between the means of three or more independent groups. It is commonly used when comparing more than two groups to see if at least one group mean is different from the others.
11.2.1. Key Concepts:#
Null Hypothesis (H0): Assumes that all group means are equal.
Alternative Hypothesis (H1): Assumes that at least one group mean is different.
F-statistic: The ratio of the variance between the group means to the variance within the groups. A higher F-statistic suggests that the group means are not all the same.
p-value: Helps determine the significance of the results. A p-value less than a chosen significance level (e.g., 0.05) leads to rejection of the null hypothesis.
11.2.2. Types of ANOVA:#
One-Way ANOVA: Compares the means of three or more independent groups based on one independent variable.
Two-Way ANOVA: Examines the influence of two different independent variables on the dependent variable and can also assess the interaction between these variables.
Repeated Measures ANOVA: Used when the same subjects are tested under different conditions (e.g., time points).
11.2.3. Example of One-Way ANOVA:#
Imagine a study where researchers want to test the effect of different diets on weight loss. The independent variable is the type of diet (e.g., Diet A, Diet B, Diet C), and the dependent variable is weight loss. ANOVA can determine if there is a statistically significant difference in weight loss between the diet groups.
11.2.4. Steps to Perform ANOVA:#
State the hypotheses: Formulate the null and alternative hypotheses.
Calculate the F-statistic: Use statistical software or formulas to compute the F-statistic.
Compare the F-statistic to the critical value: Determine whether to reject or fail to reject the null hypothesis.
Post hoc tests (if necessary): If ANOVA indicates a significant difference, post hoc tests (like Tukey’s HSD) can determine which specific groups are different.
For our dataframe we want to compare mean # of friends between racial groups (original vismin_c).
from scipy import stats
# Drop rows with NaN values in the relevant columns
anova_data = filtered_df[['numfriends', 'VISMIN_C']].dropna()
# Perform one-way ANOVA
anova_results = stats.f_oneway(
*[anova_data['numfriends'][anova_data['VISMIN_C'] == group] for group in anova_data['VISMIN_C'].unique()]
)
print("ANOVA F-statistic:", anova_results.statistic)
print("ANOVA p-value:", anova_results.pvalue)
ANOVA F-statistic: 20.016378312830568
ANOVA p-value: 2.0899985562466142e-37
11.2.5. Interpreting the reuslts:#
F-statistic: This tells you whether there is a statistically significant difference in the means across the groups.
p-value: If the p-value is below a certain threshold (commonly 0.05), you can reject the null hypothesis that the means are equal across the groups
11.3. Crosstab#
In this section we perform crosstab for nominal (visible minority) and ordinal variable (highest degree).
create crosstab:
import pandas as pd
crosstab = pd.crosstab(filtered_df['ED_05'], filtered_df['VISMIN_C'], margins=True, normalize='columns')
print(crosstab)
VISMIN_C 1 2 3 4 5 6 \
ED_05
1 0.133846 0.192982 0.114780 0.010169 0.109856 0.089426
2 0.287692 0.234450 0.283019 0.088136 0.141170 0.200392
3 0.041538 0.042265 0.075472 0.077966 0.044148 0.075065
4 0.116154 0.125997 0.183962 0.186441 0.109856 0.156005
5 0.048462 0.057416 0.051887 0.122034 0.069815 0.052219
6 0.190000 0.197767 0.130503 0.427119 0.289528 0.224543
7 0.172308 0.138756 0.144654 0.054237 0.221766 0.186031
99 0.010000 0.010367 0.015723 0.033898 0.013860 0.016319
VISMIN_C 7 8 9 10 99 All
ED_05
1 0.214375 0.069542 0.134529 0.096608 0.108337 0.109892
2 0.226250 0.174296 0.253363 0.213965 0.210185 0.212173
3 0.059375 0.037852 0.044843 0.109584 0.057913 0.085107
4 0.146250 0.112676 0.139013 0.202274 0.110834 0.170611
5 0.049375 0.049296 0.069507 0.053700 0.043934 0.054477
6 0.188750 0.257923 0.230942 0.179471 0.158263 0.195577
7 0.101875 0.292254 0.112108 0.121017 0.131303 0.141244
99 0.013750 0.006162 0.015695 0.023381 0.179231 0.030919
Chi-Square Test
To assess the association between the two variables, you can perform a chi-square test.
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency(crosstab.iloc[:-1, :-1])
print(f"Chi-Square Test Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
Chi-Square Test Statistic: 1.1965048877363103
P-value: 1.0
Degrees of Freedom: 60
Interpret the Results:
Chi-Square Statistic: Measures how the observed counts diverge from the expected counts.
P-value: If this value is below 0.05, it suggests a statistically significant association between the two variables.
Degrees of Freedom (DoF): This is calculated based on the number of categories in each variable.