Get in Touch
Thank you for your interest! Please fill out the form below if you would like to work together.
(If you are looking for my resume, click here)

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form

Student Loans - who has more and where do they come from?

Greg Condit
Apr 12, 2019

Fork this data analysis from my github here

Department of Education Dataset

When I married my wife Catie, we started our marriage off with over $50K in debt. Based on various figures I’ve heard over the years; our debt levels were typical for college graduates in the US. We’ve been fortunate — steady employment and extended, tough sacrifices over the years enabled us to pay it off and enter a debt-free life, but we know that isn’t typical of people in their 20's.

This formative experience motivated me to pick up the College Scorecard Dataset from the US Department of Education and see what I can learn about student debt — what schools it comes from and who’s most likely to graduate with a lot of it.

College Scorecard Description:

Each line on the college scorecard represents a branch of a higher education institution. I use the word branch when I report them separately, but institution when I group affiliated branches together: for example, a main campus and it’s community branches.

Hundreds of columns then follow describing various performance metrics specific to higher education. These data come from federal reporting from the institutions, federal financial aid data, and tax information. The data do not claim to include private loans outside the federal financial aid process.


Privacy laws prevent reporting on things like debt and taxes at the individual level. These laws are important, but they might obscure our ability to understand a looming social and financial crisis as debt balances continue to soar.

Do first-generation college attendees borrow more than those born to college graduates? Can your sex or family income level predict how much you’ll borrow? Are high debt balances correlated with the death rates?

I isolated the columns I wanted to study — specifically:

  • Average net price per year, and the median debt for students who separate from the branch (“Separation” includes both graduation and withdrawal)
  • The count of students included in those measurements
  • Various institution demographics, and various student demographics for each branch

Notable observations:

  • Almost half of the 9,091 institutions analyzed are predominantly certificate granting

  • Despite the performance metric sources being primarily federal in nature, less than 23% of the branches are publicly controlled. 31% are Private Non-profit, a small portion of which are Religious. The largest category is secular for-profits — 46% of the institutions graded.


This section is about how I had to edit the data in Python to make my analyses possible. If that sounds boring to you, skip down to the Analysis section.

Filtering Columns: Initially, the college scorecard was too large for my machine to import to a notebook. Fortunately, I was able to open the file in a spreadsheet to reduce the size. I removed all data prior to 2004, leaving 10 years for analysis. I also removed the columns relating to things I wasn’t interested in: Completion, Dropout, and Transfer rates, demographics besides the ones you’ll find below, figures related to loan types or grants, repayment rates, and other redundant metrics.

Imputing Values: Once the file was importable, additional issues surfaced. Some branches had a negative net price; while it may be reasonable to think that occasionally students get scholarships that exceed their tuition and costs and actually earn money by attending school, it’s not reasonable to assume that a branch’s average price would be negative for the entire student body. These values were raised to zero.

Other Prominent Issues: My goal was to be able to answer some demographic questions about the debt students are leaving school with. Branches did not report average debt, but did report median debt. They reported the count of total students included in the median measurement, as well as the same counts split by various demographics. I found that many of the demographic counts did not sum to the total as you’d expect (for example, the count of males and count of females would be expected to sum to the count of total students, but did not), so I converted the counts to percentages. By multiplying the median debt by the count of total students, I could theoretically achieve a Total Debt amount for all students leaving the branch in a given year. In practice, the results I was getting were unreasonably large. Particularly problematic were online-only schools, which in some cases unashamedly report a different branch for each state despite no physical presence. The study’s code book was unclear on this, but after further investigation I concluded that institutions must have reported student counts in aggregate, and each branch was mistakenly given the total. For example, the University of Phoenix, a prominent online school, reported 71 branches and each listed exactly 279,901 students — clearly incorrect.

Dividing the student count by branch before multiplying it by the median debt gave much clearer results for the branch’s Total Debt. Using the student demographic percentages, I was able to “distribute” each branch’s Total Debt across demographics to get average debts for the following categories:

  • Male/Female
  • First-Generation Student / Not First-Generation Student
  • Family Income bracket in annual nominal dollars: under 30K, 30–75K, and 75K+

Descriptive Statistics


One of the many factors contributing to the student debt crisis is the rising price of higher education. Below is the range of net prices (annual tuition and expenses, less average scholarship/grants) for the most recent year of study (2013).

(Not sure how to read box plots? Click here!)

Interestingly, the 3 most extreme outliers are a photography school and 2 flight schools.

Private institutions are more expensive:

Though it’s less decisive, religious institutions tend to be more expensive than Secular ones:

These price tags, in addition to various other factors, have contributed to a steadily increasing median debt for separating students for at least 10 years:

With the debt associated with each branch characterized with my chosen demographics, my questions can be examined:

Are high debt balances correlated with the death rates?

Death rates were recorded for the students from each branch at the 2-year and 8-year mark. As seen below, there was no found correlation between the median debt for a branch and the death rate of its past students, which would’ve been a very concerning suggestion that the increasing debt amounts were negatively influencing mental health or some other cause of death. At a minimum, I’m happy to report no meaningful correlation.

Can your sex predict how much you’ll borrow for school?

Since each school reported median debts figures for their students and noted how many of their students were male or female, I was able to calculate how much of their total debt for that year went to males and how much went to females. By dividing those by the number of respective males/female students, I was able to calculate an average male/female debt by branch. The two distributions of all the branches is seen below:

As a reminder, each row in this data set is a branch of an institution. Therefore, this is a distribution of the average individual debt obligations for the males/females at the branch level. This chart shows that at a majority of branches, men have significantly higher debt obligations than women upon separation (graduation or withdrawal).

Do first-generation college attendees borrow more than those born to college graduates?

Average debt for first-generation students was calculated using the same methodology as male/female students.

Despite having a lesser mode and median, the mean debt for First Generation students is raised by the long right tail of extremely high debt balances that just aren’t present in the population of latter-generation college students. This indicates First Generation students tend to graduate with more debt.

Can your family income level predict how much you’ll borrow for school?

Perhaps surprisingly, debt appears to be correlated positively with your family’s income bracket! This is suggesting that instead of using wealth to graduate without debt, families opt to use their wealth to get access to even more capital and pursue even more expensive educations for their children.

Further suggested research topics:


  • Is this an income problem (men tend to have less resources to pay up front), or a selection problem (men tend to choose more expensive schools), or are there other factors driving this difference? By finding a source of application data (as opposed to ultimate enrollment), and by analyzing school prices against sex proportions, this could be hypothesized.
  • Do men leaving these schools have higher earnings or better repayment rates to offset their higher debt balances? The initial dataset has repayment and earnings data that could be aggregated to answer this question.
  • How do family dynamics and working tendencies of women vs. men in different cultures affect borrowing and repayment?

First Generation vs. Latter Generation:

  • Again — is this an income problem (first generation students come from less wealthy families) or a selection problem (they tend to choose more expensive schools), or are there other factors? By finding a source of application data (as opposed to enrollment), and by analyzing school prices against income demographics, this could be hypothesized.
  • Is there a parental experiential advantage? In other words, if your parents had a college experience, and can help you prioritize what is worth taking on debt for and what isn’t, is that advice enough to make a meaningful difference? If there is a way to hold constant other factors such as school price and parental income and times spent in school, and still have a meaningful sample, any difference found between debt values might suggest such an advantage.

Originally published on Medium, this project was part of the the preparatory section of the Thinkful Data Science boot camp. Questions & Feedback appreciated!

Greg Condit

Recent Blog Posts

Let's Work Together
Contact Me