Overview
The Jupyter notebook on GitHub, WACSKeyPredictors.ipynb, attempts to determine the key predictors of whether a given high school student is enrolled in a computer science course in Washington state.
The current study failed to find a definitive model that provided any actual predictive value. Most of the tendencies revealed were pretty well-known. For example, if a student is male, they are far more likely to be enrolled in a CS course than otherwise. However, the study revealed the relative importance of school income and school size as powerful predictors of whether a student is enrolled in CS. School income and school size as predictors of student enrollment in CS are two factors that are touched upon but not fully explored in reports such as the 2023 State of Computer Science Education report.
Neither I (Lawrence Tanimoto) nor E. Brink, who authored the Jupyter notebook and is a senior in Application Development at North Seattle College as of March 2024, have degrees in data science. While we believe that it is worthwhile publishing this work despite the lack of definitive results, we welcome any questions or comments on the methodology, analysis, or ideas for future investigation regarding this study in preparation for a follow-up study OSPI releases the 2022-23 report.
Data Source and Methodology
The 2021-22 K–12 Computer Science Education Data Summary Report published by the Washington Office of Public Instruction (OSPI) provides the following data for all public high schools:
- How many students
- How many CS student enrollments
- How many male students
- How many male CS student enrollments
- How many female students
- How many female CS student enrollments
- How many Gender X
- How many Gender X CS student enrollments
with similar data for race/ethnicity, free and reduced lunch (FRL) status, English language learner (ELL) status, and disability status. Cross-sectional data (e.g. how many black female students) is not available. Despite some limitations, Washington is one of the leading states in providing detailed data about CS enrollments. Dashboards exploring this data further may be found at the CS for All Washington website with further details in Washington State CS Education Dashboards (2021-22) on this site.
This study attempted to find the key predictors of student enrollment in computer science by creating synthetic students based on the implied probabilities in the summary report, one-hot encoding features and doing feature engineering, and then applying techniques such as decision trees, random forests, and logistic regression.
Initial Analysis
Initial analysis applied decision tree and logistic regression analysis to synthetic student data without feature engineering or features based on the school that the student attended.
With so many features to consider, the decision tree quickly became too complex to be useful and ran into the “accuracy paradox”. While the initial decision tree was accurate 92.3% of the time, it was only because it always predicted that the student was NOT enrolled in CS. Since the underlying data had CS enrollment at 7.7%, accuracy of 92.3% was easy to achieve even if the decision tree was useless.
To avoid the “accuracy paradox”, we under sampled the number of synthetic students who were NOT enrolled in CS so that they equaled the number of synthetic students who were enrolled in CS and recreated the decision tree. However, the resulting decision tree with balanced classes was only accurate 52% of the time.
The dilemma between generating:
- A model that always predicted that a student would NOT enroll in CS using the full synthetic student dataset AND
- A very inaccurate model using an under sampled dataset with balanced classes
that occurred when creating decision trees re-occurred when using logistic regression.
However, logistic regression did provide insights into what were the more influential factors were. Below is a list of the coefficients that had absolute value of greater than -.05:
NA: 0.106 (race/ethnicity: non-applicable)
9: 0.077 (9th grader)
male: 0.064
…
Low Income: -0.055
female: -0.082
Disability: -0.085
HPI: -0.100 (race/ethnicity: Hawaiian/Pacific Islander)
and charting feature importance when using random trees results in the following:
Feature Engineering
In the hopes of creating a more predictive model, we implemented the following simplifications:
- Simplified the seven one-hot race/ethnicity features (White, Black, Hispanic-Latinx, Asian, Native American, Hawaiian-Pacific Islander, Two-or-more, Non-applicable) into three (White, Asian, BIPOC or NA not including Asian)
- Simplified the three one-hot gender features (male, female, gender X) into one (male)
- Removed the grade-related features (9, 10, 11, 12) from consideration on the simplification that all students have the opportunity to be in those grades and other features do not change between grades. Of course, real life exceptions do exist.
In addition, we added the following features based on the school that each synthetic student attended:
School size:
Medium (300 students < x <= 1200 students) – 0.5
Large ( 1200 students < ) – 1Note that the cutoffs for small/medium/large were arbitrarily created.
School income:
(20 – 40 LI) – 0.25
(40 – 60 LI) – 0.5
(60 -80 LI) – 0.75
(> 80 LI) – 1.0
Secondary Analysis
This new model could not escape the “accuracy paradox”. Decision tree, random forest, and logistic regression either always predicted that a synthetic student did not enroll in CS to achieve 92% accuracy when all data was used or was not very accurate (56%) when using balanced classes. In fact, the 56% accuracy was only obtained by removing the following features – which had low predictive power in this new model – from consideration: Disability status, ELL status, and ‘BIPOC or NA and Not Asian’
However, this secondary analysis revealed the relative importance of the features related to the school that the synthetic student attended (school income level, school size) in predicting whether a student is likely to be enrolled in CS in Washington’s public schools.
Logistic regression using balanced classes resulted in the following coefficients:
Asian: 0.284
male: 0.090
low_income: 0.005
White: -0.035
school_income: -1.01
and charting the random tree feature importance gives the following:
Closing Thoughts
As we were new to this type of analysis, we were disappointed that we could not find a model that could determine whether a student was enrolled in CS with both accuracy and predictive power.
However, the relative strength of features related to the school the student attended (school-income, school size) as predictive features was interesting and a little surprising. Washington’s relative weakness in providing a CS elective in small schools has been noted. While 95% of Washington’s large high schools provide a CS class, only 35% of its small high schools do (2023 State of CS Education, pg 101). However, while Washington’s relative strength in making CS available in schools with high FRL rates has been lauded (2023 State of CS Education, pg 41), unfortunately, this accessibility has not translated into significantly higher participation rates among FRL students.
We expect OSPI to release data for 2022-23 by the early summer of 2024 and plan to use the lessons learned from this initial attempt to improve our analysis of the key predictors of HC CS participation in Washington when this data arrives.