Exploring Key Predictors of HS CS Enrollment in Washington

Overview

The Jupyter notebook on GitHub, WACSKeyPredictors.ipynb, attempts to determine the key predictors of whether a given high school student is enrolled in a computer science course in Washington state.

The current study failed to find a definitive model that provided any actual predictive value. Most of the tendencies revealed were pretty well-known. For example, if a student is male, they are far more likely to be enrolled in a CS course than otherwise. However, the study revealed the relative importance of school income and school size as powerful predictors of whether a student is enrolled in CS. School income and school size as predictors of student enrollment in CS are two factors that are touched upon but not fully explored in reports such as the 2023 State of Computer Science Education report.

Neither I (Lawrence Tanimoto) nor E. Brink, who authored the Jupyter notebook and is a senior in Application Development at North Seattle College as of March 2024, have degrees in data science. While we believe that it is worthwhile publishing this work despite the lack of definitive results, we welcome any questions or comments on the methodology, analysis, or ideas for future investigation regarding this study in preparation for a follow-up study OSPI releases the 2022-23 report.

Data Source and Methodology

The 2021-22 K–12 Computer Science Education Data Summary Report published by the Washington Office of Public Instruction (OSPI) provides the following data for all public high schools:

How many students
How many CS student enrollments
How many male students
How many male CS student enrollments
How many female students
How many female CS student enrollments
How many Gender X
How many Gender X CS student enrollments

with similar data for race/ethnicity, free and reduced lunch (FRL) status, English language learner (ELL) status, and disability status. Cross-sectional data (e.g. how many black female students) is not available. Despite some limitations, Washington is one of the leading states in providing detailed data about CS enrollments. Dashboards exploring this data further may be found at the CS for All Washington website with further details in Washington State CS Education Dashboards (2021-22) on this site.

This study attempted to find the key predictors of student enrollment in computer science by creating synthetic students based on the implied probabilities in the summary report, one-hot encoding features and doing feature engineering, and then applying techniques such as decision trees, random forests, and logistic regression.

Initial Analysis

Initial analysis applied decision tree and logistic regression analysis to synthetic student data without feature engineering or features based on the school that the student attended.

With so many features to consider, the decision tree quickly became too complex to be useful and ran into the “accuracy paradox”. While the initial decision tree was accurate 92.3% of the time, it was only because it always predicted that the student was NOT enrolled in CS. Since the underlying data had CS enrollment at 7.7%, accuracy of 92.3% was easy to achieve even if the decision tree was useless.

To avoid the “accuracy paradox”, we under sampled the number of synthetic students who were NOT enrolled in CS so that they equaled the number of synthetic students who were enrolled in CS and recreated the decision tree. However, the resulting decision tree with balanced classes was only accurate 52% of the time.

The dilemma between generating:

A model that always predicted that a student would NOT enroll in CS using the full synthetic student dataset AND
A very inaccurate model using an under sampled dataset with balanced classes

that occurred when creating decision trees re-occurred when using logistic regression.

However, logistic regression did provide insights into what were the more influential factors were. Below is a list of the coefficients that had absolute value of greater than -.05:

Asian: 0.114
NA: 0.106 (race/ethnicity: non-applicable)
9: 0.077 (9th grader)
male: 0.064
…
Low Income: -0.055
female: -0.082
Disability: -0.085
HPI: -0.100 (race/ethnicity: Hawaiian/Pacific Islander)

and charting feature importance when using random trees results in the following:

Feature importance bar chart from random forest with balanced classes

Feature Engineering

In the hopes of creating a more predictive model, we implemented the following simplifications:

Simplified the seven one-hot race/ethnicity features (White, Black, Hispanic-Latinx, Asian, Native American, Hawaiian-Pacific Islander, Two-or-more, Non-applicable) into three (White, Asian, BIPOC or NA not including Asian)
Simplified the three one-hot gender features (male, female, gender X) into one (male)
Removed the grade-related features (9, 10, 11, 12) from consideration on the simplification that all students have the opportunity to be in those grades and other features do not change between grades. Of course, real life exceptions do exist.

In addition, we added the following features based on the school that each synthetic student attended:

School size:

Small (300 students <= ) – 0
Medium (300 students < x <= 1200 students) – 0.5
Large ( 1200 students < ) – 1Note that the cutoffs for small/medium/large were arbitrarily created.

School income:

(<20% LowIncome) – 0
(20 – 40 LI) – 0.25
(40 – 60 LI) – 0.5
(60 -80 LI) – 0.75
(> 80 LI) – 1.0

Secondary Analysis

This new model could not escape the “accuracy paradox”. Decision tree, random forest, and logistic regression either always predicted that a synthetic student did not enroll in CS to achieve 92% accuracy when all data was used or was not very accurate (56%) when using balanced classes. In fact, the 56% accuracy was only obtained by removing the following features – which had low predictive power in this new model – from consideration: Disability status, ELL status, and ‘BIPOC or NA and Not Asian’

However, this secondary analysis revealed the relative importance of the features related to the school that the synthetic student attended (school income level, school size) in predicting whether a student is likely to be enrolled in CS in Washington’s public schools.

Logistic regression using balanced classes resulted in the following coefficients:

school_size: 0.971
Asian: 0.284
male: 0.090
low_income: 0.005
White: -0.035
school_income: -1.01

and charting the random tree feature importance gives the following:

Feature importance from random forest including school-based features

Closing Thoughts

As we were new to this type of analysis, we were disappointed that we could not find a model that could determine whether a student was enrolled in CS with both accuracy and predictive power.

However, the relative strength of features related to the school the student attended (school-income, school size) as predictive features was interesting and a little surprising. Washington’s relative weakness in providing a CS elective in small schools has been noted. While 95% of Washington’s large high schools provide a CS class, only 35% of its small high schools do (2023 State of CS Education, pg 101). However, while Washington’s relative strength in making CS available in schools with high FRL rates has been lauded (2023 State of CS Education, pg 41), unfortunately, this accessibility has not translated into significantly higher participation rates among FRL students.

We expect OSPI to release data for 2022-23 by the early summer of 2024 and plan to use the lessons learned from this initial attempt to improve our analysis of the key predictors of HC CS participation in Washington when this data arrives.