Fork this data analysis from my github here
Department of Education Dataset
When I married my wife Catie, we started our marriage off with over $50K in debt. Based on various figures I’ve heard over the years; our debt levels were typical for college graduates in the US. We’ve been fortunate — steady employment and extended, tough sacrifices over the years enabled us to pay it off and enter a debt-free life, but we know that isn’t typical of people in their 20's.
This formative experience motivated me to pick up the College Scorecard Dataset from the US Department of Education and see what I can learn about student debt — what schools it comes from and who’s most likely to graduate with a lot of it.
Each line on the college scorecard represents a branch of a higher education institution. I use the word branch when I report them separately, but institution when I group affiliated branches together: for example, a main campus and it’s community branches.
Hundreds of columns then follow describing various performance metrics specific to higher education. These data come from federal reporting from the institutions, federal financial aid data, and tax information. The data do not claim to include private loans outside the federal financial aid process.
Privacy laws prevent reporting on things like debt and taxes at the individual level. These laws are important, but they might obscure our ability to understand a looming social and financial crisis as debt balances continue to soar.
Do first-generation college attendees borrow more than those born to college graduates? Can your sex or family income level predict how much you’ll borrow? Are high debt balances correlated with the death rates?
I isolated the columns I wanted to study — specifically:
Notable observations:
This section is about how I had to edit the data in Python to make my analyses possible. If that sounds boring to you, skip down to the Analysis section.
Filtering Columns: Initially, the college scorecard was too large for my machine to import to a notebook. Fortunately, I was able to open the file in a spreadsheet to reduce the size. I removed all data prior to 2004, leaving 10 years for analysis. I also removed the columns relating to things I wasn’t interested in: Completion, Dropout, and Transfer rates, demographics besides the ones you’ll find below, figures related to loan types or grants, repayment rates, and other redundant metrics.
Imputing Values: Once the file was importable, additional issues surfaced. Some branches had a negative net price; while it may be reasonable to think that occasionally students get scholarships that exceed their tuition and costs and actually earn money by attending school, it’s not reasonable to assume that a branch’s average price would be negative for the entire student body. These values were raised to zero.
Other Prominent Issues: My goal was to be able to answer some demographic questions about the debt students are leaving school with. Branches did not report average debt, but did report median debt. They reported the count of total students included in the median measurement, as well as the same counts split by various demographics. I found that many of the demographic counts did not sum to the total as you’d expect (for example, the count of males and count of females would be expected to sum to the count of total students, but did not), so I converted the counts to percentages. By multiplying the median debt by the count of total students, I could theoretically achieve a Total Debt amount for all students leaving the branch in a given year. In practice, the results I was getting were unreasonably large. Particularly problematic were online-only schools, which in some cases unashamedly report a different branch for each state despite no physical presence. The study’s code book was unclear on this, but after further investigation I concluded that institutions must have reported student counts in aggregate, and each branch was mistakenly given the total. For example, the University of Phoenix, a prominent online school, reported 71 branches and each listed exactly 279,901 students — clearly incorrect.
Dividing the student count by branch before multiplying it by the median debt gave much clearer results for the branch’s Total Debt. Using the student demographic percentages, I was able to “distribute” each branch’s Total Debt across demographics to get average debts for the following categories:
One of the many factors contributing to the student debt crisis is the rising price of higher education. Below is the range of net prices (annual tuition and expenses, less average scholarship/grants) for the most recent year of study (2013).
(Not sure how to read box plots? Click here!)
Private institutions are more expensive:
Though it’s less decisive, religious institutions tend to be more expensive than Secular ones:
These price tags, in addition to various other factors, have contributed to a steadily increasing median debt for separating students for at least 10 years:
With the debt associated with each branch characterized with my chosen demographics, my questions can be examined:
Death rates were recorded for the students from each branch at the 2-year and 8-year mark. As seen below, there was no found correlation between the median debt for a branch and the death rate of its past students, which would’ve been a very concerning suggestion that the increasing debt amounts were negatively influencing mental health or some other cause of death. At a minimum, I’m happy to report no meaningful correlation.
Since each school reported median debts figures for their students and noted how many of their students were male or female, I was able to calculate how much of their total debt for that year went to males and how much went to females. By dividing those by the number of respective males/female students, I was able to calculate an average male/female debt by branch. The two distributions of all the branches is seen below:
As a reminder, each row in this data set is a branch of an institution. Therefore, this is a distribution of the average individual debt obligations for the males/females at the branch level. This chart shows that at a majority of branches, men have significantly higher debt obligations than women upon separation (graduation or withdrawal).
Average debt for first-generation students was calculated using the same methodology as male/female students.
Despite having a lesser mode and median, the mean debt for First Generation students is raised by the long right tail of extremely high debt balances that just aren’t present in the population of latter-generation college students. This indicates First Generation students tend to graduate with more debt.
Perhaps surprisingly, debt appears to be correlated positively with your family’s income bracket! This is suggesting that instead of using wealth to graduate without debt, families opt to use their wealth to get access to even more capital and pursue even more expensive educations for their children.
Originally published on Medium, this project was part of the the preparatory section of the Thinkful Data Science boot camp. Questions & Feedback appreciated!