The Role of Data in Credit Risk Management

It’s widely understood that data is a critical asset for effective credit risk management. However, defining “data” and determining how to acquire it are complex challenges.

Defining Data

While a single, universally accepted definition of “data” may not exist, here are some key concepts and perspectives:

Information Theory: Data is a symbolic representation of information. Information can be quantified by the extent to which it reduces expected uncertainty.
Bayesian Statistics: Data updates our prior beliefs about the probability distribution of unknown parameters.

These definitions provide a strong foundation for understanding data. However, it’s important to note that data doesn’t always directly reduce uncertainty. For instance, if we initially assign a 90% probability to a coin landing heads, and then receive evidence with a 1:9 likelihood ratio favoring tails, our posterior probability shifts to 50-50. In this case, uncertainty increases. However, when considering all possible measurements, the expected uncertainty decreases.

In the context of credit risk management, data can be viewed as pieces of information that reduce the expected uncertainty in credit risk predictions. This leads to the crucial questions: what types of data are most useful, and how can we obtain them?

Acquiring Relevant Data

Imagine starting a brand-new business in a completely unfamiliar industry, equipped only with raw processing power and no prior knowledge. How would you gather data? Naturally, you’d begin by experimenting at random and quickly learning which insights are useful. Now consider if, instead, you possessed some prior knowledge of other industries. In that case, you would compare the new industry with those you already know, assigning greater weight to insights from the most similar areas. If you had some prior knowledge of the new industry itself, your position would be stronger; and if you had deep expertise about both the industry and the specific business, you’d be in the best possible position. This thought experiment underscores that data can be generated by taking action, but the most efficient approach is often learning from data produced by others’ actions. Action generates data

It’s crucial to recognize that experience alone is insufficient without a grasp of the underlying mechanisms. Blindly imitating without understanding can be detrimental.

Data Type	Description	Source
Repayment History	Borrower’s past repayment behavior	Credit Bureau, Internal records
Income	Borrower’s income	Credit Bureau, Document Submission, Government Tax Data, Proxy data (e.g., utility bills, credit card limits)
Tax Returns	Borrower’s tax returns	Government Tax Data
Loan Application/Records	Borrower’s loan applications and records	Credit Bureau
Product Offering	Borrower’s existing financial products (e.g., credit cards)	Credit Bureau
Transaction Data	Borrower’s transaction history (e.g., wallet, e-commerce)	Internal records, Third-party data providers
Device Data	Borrower’s device information (e.g., fingerprint, location, SMS, call logs)	Device SDKs
Telco Data	Borrower’s data from their telecom provider	Telco Provider
Network Data	Borrower’s connection data, such as contact lists and social network connections	Device and social network data

The availability of data is not without limitations. Data not owned by your company often presents challenges such as inconsistent update frequencies, variable data quality, limited coverage, and granularity issues. A particularly significant hurdle is dealing with “thin-file” data, where the information available is insufficient for reliable predictions. In such cases, accumulating your own repayment history, despite initial losses, is often the most viable approach.

These are some of the most common data types employed in credit risk management. However, the true value of data is often realized through feature engineering, a process that can be guided by expert knowledge or derived from end-to-end models. The extent of data compression significantly influences its utility. Similar to data acquisition, the effective utilization of data requires both action and experience. Through iterative application and refinement, one develops the necessary expertise. While emulating best practices is generally desirable, feature engineering often remains a “black box,” both between companies and within models, making direct replication challenging. In short, “No shortcut”. For general model building and feature engineering concepts, please refer to the Creating Models of the Environment.