Creating Models of the Environment
Date: October 2024
Last Update: October 2024
Disclaimer: The author’s primary expertise is in credit risk management rather than machine learning. The theoretical content in this chapter represents personal notes that he found interesting and wishes to share with others.
Why Does Supervised Learning Work?
In essence, “Low Training Error + Sufficient Data Relative to Model Complexity \(\implies\) Low Test Error”.
Supervised learning must work on the SAME distribution.
Note: Readers primarily interested in practical applications may safely skip the proofs in this section. These proofs are included for completeness and deeper theoretical understanding.
Key Concepts
Training Error (\(\text{Train}_S(f)\)): The error of a model \(f\) evaluated on the training dataset \(S\).
Test Error (\(\text{Test}_D(f)\)): The expected error of \(f\) on new, unseen data drawn from the same distribution \(D\).
Hypothesis Class (\(\mathcal{F}\)): The set of all models \(f\) that the learning algorithm can choose from.
Sample Size (\(|S|\)): The number of training examples.
Confidence Level (\(1 - \delta\)): The probability that the bound holds true.
Degrees of Freedom: Informally, the number of parameters or complexity of the model class \(\mathcal{F}\).
Mathematical Formulation
Generalization Error Bound
The generalization error bound is given by:
Interpretation:
With probability at least \(1 - \delta\), the difference between test and training error is bounded by:
Implications:
Sample Size Effect
Larger training sample size \(|S|\) results in:
Tighter error bounds
Test error closer to training error
Model Complexity Effect
Larger hypothesis class size \(|\mathcal{F}|\) leads to:
Looser error bounds
Potentially larger gap between test and training errors
Proof
The proof follows these key steps:
Setup
Let \(f \in \mathcal{F}\) be any hypothesis
Let \(S\) be a training set of size \(|S|\) drawn i.i.d. from distribution \(D\)
Let \(\text{Test}_D(f)\) and \(\text{Train}_S(f)\) be the test and training errors
Hoeffding’s Inequality For a single hypothesis \(f\):
\[ \Pr_S[|\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq 2e^{-2|S|\epsilon^2} \]Note on Hoeffding’s Inequality:
In probability theory, Hoeffding’s inequality provides an upper bound on the probability that the sum of independent random variables deviates from its expected value.
Union Bound
Apply to all hypotheses \(f \in \mathcal{F}\):
\[ \Pr_S[\exists f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq |\mathcal{F}| \cdot 2e^{-2|S|\epsilon^2} \]Set Failure Probability
Let right side equal \(\delta\):
\[ |\mathcal{F}| \cdot 2e^{-2|S|\epsilon^2} = \delta \]Solve for \(\epsilon\)
\[\begin{split} \begin{align*} e^{-2|S|\epsilon^2} &= \frac{\delta}{2|\mathcal{F}|} \\ -2|S|\epsilon^2 &= \ln(\frac{\delta}{2|\mathcal{F}|}) \\ \epsilon^2 &= \frac{\ln(2|\mathcal{F}|/\delta)}{2|S|} \\ \epsilon &= \sqrt{\frac{\ln(2|\mathcal{F}|/\delta)}{2|S|}} \end{align*} \end{split}\]Final Result
Since \(\Pr_S[\exists f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| > \epsilon] \leq \delta\), it follows that the probability that all hypotheses satisfy the inequality is at least \(1-\delta\): \(\Pr_S[\forall f \in \mathcal{F}: |\text{Test}_D(f) - \text{Train}_S(f)| \leq \epsilon] \geq 1 - \delta\).
Thus, with probability at least \(1-\delta\), the difference between test and training error is bounded by:
\[ |\text{Test}_D(f) - \text{Train}_S(f)| \leq \sqrt{\frac{\log |\mathcal{F}| + \log \frac{1}{\delta}}{|S|}} \]
This proves that with high probability, the difference between test and training error is bounded by a term that depends on the model complexity (\(|\mathcal{F}|\)) and training set size (\(|S|\)).
In other words, supervised learning must work on the SAME distribution for these bounds to hold.
The Bias-Complexity Tradeoff
When selecting the hypothesis class \(\mathcal{F}\), we face a tradeoff between a larger, more complex class that is likely to have a small approximation error, and a more restrictive, simpler class that ensures a small estimation error. Prior knowledge about the problem constrains the hypothesis class. For example, in credit risk modeling, if we know that the relationship between a borrower’s debt-to-income ratio and default probability is generally monotonic (higher ratios correspond to higher default risk), we can restrict our hypothesis class to monotonic functions only. This prior knowledge eliminates many possible hypotheses that would violate this relationship, reducing model complexity while maintaining or improving real-world performance.
Theorem. (No-Free-Lunch) Let \(A\) be any learning algorithm for binary classification with respect to the 0–1 loss over a domain \(\mathcal{X}\). Let \(m\) be any number smaller than \(|\mathcal{X}|/2\), representing a training set size. Then, there exists a distribution \(\mathcal{D}\) over \(\mathcal{X} \times \{0,1\}\) such that:
There exists a function \(f : \mathcal{X} \rightarrow \{0,1\}\) with \(L_{\mathcal{D}}(f) = 0\).
With probability of at least \(1/7\) over the choice of \(S \sim \mathcal{D}^m\), we have that \(L_{\mathcal{D}}(A(S)) \geq 1/8\).
This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner.
While a comprehensive discussion of model selection techniques is beyond the scope of this section, several key methods deserve mention:
AIC (Akaike Information Criterion): Balances model fit against complexity by penalizing the number of parameters
BIC (Bayesian Information Criterion): Similar to AIC but with a stronger penalty for complexity
MDL (Minimum Description Length): Selects models based on their ability to compress the data efficiently
VC Dimension (Vapnik–Chervonenkis Dimension): Measures the capacity of a model class to fit arbitrary data
CV (Cross-Validation): Empirically estimates model performance on unseen data through repeated train-test splits
For a detailed treatment of these methods, readers should consult the references provided below.
Among these approaches, cross-validation has become the de facto standard when sufficient data is available. However, it’s crucial to note that proper cross-validation requires setting aside validation data before performing any data preprocessing or feature selection steps. While this principle may seem straightforward, it is frequently overlooked in practice, potentially leading to optimistic bias in performance estimates.
Feature Selection and Manipulation
Feature Selection
Our discussion so far has focused on abstract models of learning, where the prior knowledge utilized by the learner is fully encoded by the choice of the hypothesis class \(\mathcal{F}\). However, another crucial modeling choice exists: how do we represent the instance space \(\mathcal{X}\)? While there are common techniques for feature selection and learning, the No-Free-Lunch theorem still applies. Here are some widely used approaches:
Filtering: Select the \(k\) features with the highest scores (based on any chosen metric of interest) independently of other features.
Forward Greedy Selection: Start with an empty feature set and iteratively add features that yield the highest performance gain.
Backward Elimination: Start with all features and iteratively remove features whose removal results in the highest performance gain.
Constraining the hypothesis class \(\mathcal{F}\) to use a small subset of features can reduce estimation error and thus prevent overfitting. Additionally, in practical applications, obtaining and processing each feature often incurs computational and financial costs that must be considered.
Feature Manipulation
Feature manipulation or normalization transforms features into a new space, often to improve learning algorithm performance. When considering feature representation, we should relate the choice to both the learning algorithm and our prior knowledge about the problem.
A common example in credit risk modeling is the Weight of Evidence (WOE) transformation. WOE converts categorical and continuous variables into a standardized scale by comparing the proportion of good and bad credit outcomes in each category or bin. While WOE can improve model performance by handling non-linear relationships and outliers, its widespread adoption in the financial industry, in the author’s opinion, is primarily driven by regulatory requirements for model interpretability and transparency. The clear relationship between WOE values and default rates makes it easier to explain model decisions to stakeholders and regulators, even though other transformations may achieve similar or better predictive performance.
The use of WOE involves a transformation of data that requires binning. A binning process should follow these principles:
Missing values should be grouped separately
Each bin should contain at least 5% of the total observations
No bin should have zero good or bad outcomes
Establish a monotonic relationship between independent variable and target variable
The Weight of Evidence (WOE) for a bin is calculated as:
Where:
\(i\) represents a specific bin
\(\%\text{ of Good Customers}_i = \frac{\text{Number of Good Customers in bin }i}{\text{Total Number of Good Customers}}\)
\(\%\text{ of Bad Customers}_i = \frac{\text{Number of Bad Customers in bin }i}{\text{Total Number of Bad Customers}}\)
The WOE transformation has several useful properties:
It creates a linear relationship with the log odds of the target variable when using logistic regression
The values are symmetric around zero:
WOE = 0 indicates the proportion of good and bad customers is equal
WOE > 0 indicates more good than bad customers
WOE < 0 indicates more bad than good customers
Monotonicity: The WOE values are monotonic with respect to the independent variable
Using proportions instead of counts helps avoid the influence of sample size on the transformation
The logarithmic nature accentuates the difference between the proportions of good and bad, highlighting bins with strong discriminative power
A related measure, Information Value (IV), can be used to assess the overall predictive power of a variable:
Where \(n\) is the total number of bins.
IV is closely related to the concept of Kullback-Leibler Divergence, which measures the difference between two probability distributions:
The only difference between IV and KLD is the weighting factor.
Commonly Used Models
This section highlights the commonly used models in credit risk modeling rather than delving into their technical details. Once we have established that supervised learning works and sufficient data is available, model selection becomes primarily a matter of empirical performance comparison. The most prevalent models in credit risk assessment include:
Logistic Regression: A traditional but robust choice, offering high interpretability
XGBoost: Currently the most widely adopted model in practice, known for its performance
LightGBM: Another powerful gradient boosting framework gaining popularity
Based on industry experience, XGBoost has emerged as the dominant choice due to its balance of performance, scalability, and relative interpretability.
Unstructured Data
As the meme humorously points out, “The entire finance industry runs on Excel.” However, relying solely on structured tabular data has significant limitations when modeling customer behavior:
Loss of Event Order: Classical tabular models aggregate data over time periods, losing the temporal sequence of actions. Two customers with identical transactions in different orders would appear the same to the model, despite potentially meaningful differences in their behavior patterns.
Ignoring Temporal Interactions: These models fail to capture how events interact over time. For example, a large purchase followed by an unusual withdrawal could indicate different risk levels compared to the reverse sequence, but tabular aggregations would treat them identically.
Independent Predictions: Each prediction is made independently, without considering the sequence of past events. This prevents the model from identifying behavioral patterns that develop over time.
To address these limitations, incorporating unstructured and sequential data can provide additional insights beyond traditional tabular approaches. Sequential models can capture temporal dependencies and patterns that may be crucial for accurate risk assessment.
The field of learning from unstructured data continues to evolve rapidly, and industry adoption will take time. However, the author believes that the use of unstructured data will become increasingly prevalent in credit risk modeling.
Reject Inference
A fundamental assumption in supervised learning is that the training and test data come from the same distribution. However, in credit risk modeling, we often face a significant challenge: we typically only have data from approved applications, not the entire applicant population. This creates what is known as sample selection bias, where our training data distribution differs from the true population distribution. This poses a particular challenge for policy-making, as we need to make decisions that will affect the entire population, not just the approved subset.
Reject inference comprises techniques developed to address this challenge by attempting to estimate model performance on the full population using only the approved sample data. These methods try to infer the likely outcomes for rejected applications based on patterns in the approved data. While promising in theory, implementing reject inference effectively requires careful consideration of assumptions and methodology. As this is an evolving area of practice, the author looks forward to exploring these techniques in greater depth in future work.
Model Evaluation Checklist
When evaluating a credit risk model, use the following checklist to ensure the model is robust, compliant, and effective:
Data Assessment
Data Quality
Verify the accuracy and completeness of the data
Handle missing values and outliers appropriately
Data Quantity
Ensure sufficient sample size relative to model complexity
Confirm data representativeness of the target population
Data Distribution
Verify training and testing data come from the same distribution
Monitor for covariate shifts or changes in underlying data over time
Feature Engineering
Relevance and Permissibility
Include features that are relevant and permissible under regulatory guidelines
Exclude features that could introduce bias or violate privacy laws
Transformation Techniques
Apply appropriate transformations such as Weight of Evidence (WOE)
Ensure transformations maintain monotonic relationships with target variables
Variable Selection
Use Information Value (IV) or other metrics to assess predictive power
Remove low-performing variables to simplify the model and reduce overfitting
Model Evaluation
Performance Metrics
Evaluate using relevant metrics (e.g., AUC-ROC, KS Statistic, Gini Coefficient)
Assess performance across different customer segments
Benchmarking
Compare in-sample and out-of-sample performance
Compare performance across time periods
Compare against baseline models or industry standards
Analyze improvements over previous approaches
Interpretability and Explainability
Transparency
Ensure model decisions can be explained clearly
Provide insights into feature importance and influence
Regulatory Compliance
Meet explainability requirements per regulatory guidelines
Document rationale behind model choices and predictions
Validation and Testing
Stress Testing
Evaluate performance under various economic scenarios
Sensitivity Analysis
Analyze how input changes affect predictions
Monitoring and Maintenance
Post-Deployment Monitoring
Implement real-time performance monitoring
Track metrics to detect data drift or model degradation
Periodic Review
Schedule regular assumption and performance reviews
Update model based on new data or changing conditions
Ethical and Legal Considerations
Fairness and Bias
Assess for potential biases against protected groups
Implement bias mitigation strategies
Privacy Compliance
Ensure compliance with data protection regulations
Apply appropriate data anonymization
Cycle of Model Development
Understanding that supervised learning must work on the SAME distribution, that the choice of hypothesis class \(\mathcal{F}\) is crucial, and that feature representation influences model performance, we can derive the following development strategy:
Data Amount |
Prior Knowledge |
Stage |
Data Accumulation Strategy |
Model Development Strategy |
|---|---|---|---|---|
Limited |
Without Prior Belief |
Early Stage |
• Random population sampling where feasible |
• Focus on simple models |
Limited |
With Prior Belief |
Early Stage |
• Leverage external expertise |
• Restrict hypothesis class based on prior beliefs |
Sufficient |
Without Prior Belief |
Mature Stage |
• Systematic data collection across segments |
• Begin with interpretable models |
Sufficient |
With Prior Belief |
Mature Stage |
• Strategically aligned data collection |
• Restrict hypothesis class based on knowledge |
References
Shalev-Shwartz, Shai, and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. New York: Cambridge university press, 2014.
Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second edition. Springer Series in Statistics. New York, NY: Springer, 2017.