Monday, July 20, 2015

Data Incubator Challenge Question


Credit risk analytics has recently come to the forefront of the financial sector.  With the recent financial crisis, increasing regulation, and greater competition lending agencies must improve credit risk modelling and remain adaptive in their strategies for handling credit risk.  Lending can be seen as a continuous set of experiments in which potential recipients (observations) ask for a loan and the lender must form a hypothesis on whether the recipient will be able to make payments, and if so, how much they will be able to pay back given the characteristics of the potential loan recipient.  Each recipient contributes a valuable data point in determining future predictions for lending agencies.

It is common for lending agencies to rely on the judgment of underwriters when analyzing loan applications on a case-by-case basis.  While the human element is useful in making critical judgment calls, businesses often show marked improvement by relying on rigorous analytics of historical data.  In particular, some lending agencies have begun to make their lending data, including the payment history of loans, publicly available for research and development.  I propose studying Lending Club's (www.LendingClub.com) loan data from the years 2007 to 2015 in order to examine the potential effects of change in loan approval rates over time, and to improve the risk assessment of potential loan recipients.

The lending data available from Lending Club has multiple facets that make it suitable for a Data Science study: 1) any results found in this study can be immediately useful for other lenders in order to minimize credit risk, 2) most lending agencies, in particular, long-standing companies such as banks, will have massive amounts of potentially untapped data to learn from, and 3) the data, even when presented in a large table formats, have all the characteristics of Big Data sets such as incompleteness, data complexity including factor levels, numeric values, dates, and text fields, and large volume.  Furthermore, traditional statistical methods like generalized linear models and classification trees are insufficient in handling such problems because defaults on loans are rare.  In this sample dataset there are just 178 defaults out of 550,583 observations.  Therefore, methodologies that combine statistics and other types of inference are necessary, where standard statistical methodology alone would simply determine the probability of default to be approximately 0 for all observations.

Now I present two plots to motivate the problem.  In the first, I plot the proportion of defaults out of the total number of loans each year.  There are two noticeable characteristics: first, there are no defaults from loans between 2007 and 2010, and second, there are proportionally more defaults in the year 2012 than other years. This raises the following two questions: Are there no defaults between 2007 and 2010 because of the low overall number of loans for those years, or was Lending Club's underwriting better? And, is Lending Club's underwriting in 2012 worse than other years, or is this a probabilistic anomaly?

The next plot gives the proportion of defaults by grade.  The grade is assigned by Lending Club and is an ordinal set of values which should capture the propensity of defaults.  That is, we would expect that as the grades get lower (C's, D's, and F's) the proportion of defaults by grade should be higher.  This is the case.  However, the question is raised: can we do better through other statistical methods without overfitting the data set?

I propose a study on the Lending Club data to determine, and potentially beat, the estimates for default probability given by the grades.  The method will combine data mining and statistical methodology and will have a set of holdout data to avoid overfitting. Given more time, I will combine the full dataset available after joining the website, which would allow longitudinal study of payment behaviors instead of just the latest payment information.  I will also include in the study the declined loan data in the study to see if there were perhaps loans that Lending Club should have accepted but did not.  Ideally I would like to have additional datasets similar to this one in order to compare the estimates of Lending club against other lender's methodologies.

Figure 1:
Figure 2:

No comments:

Post a Comment