Supervised Machine Learning: An Experiential and Applied Session Custom Case Solution & Analysis

Evidence Brief: Case Data Extraction

Financial Metrics

Total Customer Base: 5,000 records.
Historical Conversion Rate: 9.6 percent (480 customers accepted a personal loan in the previous campaign).
Campaign Cost Structure: The case does not explicitly state the dollar cost per contact, but emphasizes the need to reduce the volume of unsuccessful solicitations.
Revenue Drivers: Interest income from personal loans and fee-based services for liability customers.

Operational Facts

Data Dimensions: 14 variables including Age, Experience, Income, ZIP Code, Family size, Average Credit Card Balance (CCAvg), Education level, Mortgage value, and existing account types.
Target Variable: Personal Loan (Binary: 1 if accepted, 0 if not).
Technical Process: Follows the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework.
Modeling Requirements: Data partitioning into Training (60 percent) and Validation (40 percent) sets to prevent overfitting.

Stakeholder Positions

Marketing Department: Seeks to increase the success rate of personal loan offers while minimizing the annoyance to the 90.4 percent of customers who are unlikely to respond.
Data Science Team: Tasked with selecting and tuning the optimal classification algorithm (k-Nearest Neighbors, Logistic Regression, or Decision Trees).
Retail Banking Leadership: Focused on growing the loan portfolio without increasing the risk profile of the bank.

Information Gaps

Cost of False Positives: The specific financial penalty for contacting a customer who does not convert.
Cost of False Negatives: The opportunity cost of missing a customer who would have converted.
Data Recency: The time interval between the collection of the training data and the planned execution of the next campaign.
Customer Lifetime Value (CLV): The long-term profitability of a converted loan customer versus a liability-only customer.

Strategic Analysis

Core Strategic Question

How can Universal Bank transition from mass-market solicitation to a precision-targeted acquisition model to maximize personal loan conversion while minimizing marketing waste?

Structural Analysis

The problem is a classic classification challenge. Applying the CRISP-DM framework reveals that the primary bottleneck is not data volume, but the selection of the correct features to predict conversion. The 9.6 percent baseline conversion rate indicates a high degree of class imbalance, meaning a naive model could achieve 90.4 percent accuracy by simply predicting zero for everyone. The strategic focus must shift from overall accuracy to sensitivity (recall) and the lift over the baseline.

Strategic Options

Option 1: K-Nearest Neighbors (k-NN) Classification. This non-parametric approach identifies similar customer profiles.
- Rationale: Effective at capturing non-linear relationships between income, education, and loan acceptance.
- Trade-offs: Requires significant computational power as the dataset grows and is sensitive to the choice of k.
- Resource Requirements: Data normalization and feature scaling.
Option 2: Logistic Regression. A parametric model to estimate the probability of loan acceptance.
- Rationale: Provides clear coefficients that allow management to understand which factors (e.g., Income, CD Account) drive conversion.
- Trade-offs: Assumes linear relationships between features and the log-odds of the outcome.
- Resource Requirements: Statistical validation of assumptions and variable significance testing.
Option 3: Status Quo (Broad Segment Targeting). Continuing to target customers based on simple demographic filters like Income > $100k.
- Rationale: Low technical complexity and no requirement for advanced modeling.
- Trade-offs: High waste and missed opportunities among lower-income but high-propensity segments.
- Resource Requirements: None beyond existing marketing staff.

Preliminary Recommendation

Universal Bank should implement the k-NN model with a k-value optimized via validation error rates. While Logistic Regression offers interpretability, the primary goal is predictive performance to maximize the conversion of the next 5,000 prospects. The non-linear nature of banking behavior—where the interaction between income and family size often dictates credit needs—makes k-NN the superior choice for maximizing the lift in the top two deciles of the prospect list.

Implementation Roadmap

Critical Path

Phase 1: Data Preparation (Weeks 1-3). Normalize all continuous variables (Age, Income, CCAvg). Convert categorical variables (Education) into dummy variables. This is the prerequisite for any distance-based algorithm.
Phase 2: Model Training and Selection (Weeks 4-6). Run k-NN, Logistic Regression, and Classification Trees on the 60 percent training set. Evaluate performance on the 40 percent validation set using a confusion matrix.
Phase 3: Optimization (Weeks 7-8). Focus on the Decile Lift Chart. The goal is to ensure the top 10 percent of predicted customers contain at least 50 percent of the actual converters.
Phase 4: Pilot Deployment (Weeks 9-12). Execute a live marketing campaign on a subset of 1,000 customers using the model predictions. Compare results against a control group.

Key Constraints

Data Quality: The ZIP Code variable contains 5,000 entries; if these are not grouped into regions or discarded, the model will suffer from the curse of dimensionality.
Class Imbalance: With only 480 positive cases, the model may struggle to learn the characteristics of the minority class without oversampling or adjusting the cutoff probability.

Risk-Adjusted Implementation Strategy

To mitigate the risk of model decay, the bank must implement a feedback loop where the results of the pilot campaign are fed back into the training set. If the k-NN model shows high variance, the team should pivot to an Ensemble method (Random Forest) in the second iteration to improve stability. The implementation assumes a 0.5 probability cutoff, but this must be adjusted based on the actual costs of marketing versus the revenue of a loan.

Executive Review and BLUF

BLUF

Universal Bank must replace its current marketing approach with a k-Nearest Neighbors (k-NN) predictive model. The current 9.6 percent conversion rate is insufficient and results in excessive marketing spend on non-responsive customers. By partitioning data and applying supervised learning, the bank can identify the specific customer profiles—largely driven by the intersection of high income, professional education, and existing CD accounts—that are most likely to convert. Implementation should focus on the top two deciles of predicted probability, which historically contain the majority of successful conversions. This transition will increase campaign efficiency and portfolio growth without expanding the underlying risk. APPROVED FOR LEADERSHIP REVIEW.

Dangerous Assumption

The single most consequential premise is that historical data from the previous campaign remains a valid proxy for future behavior. If macroeconomic conditions (e.g., interest rate hikes) have changed since the data was collected, the drivers of loan acceptance will shift, rendering the model obsolete before deployment.

Unaddressed Risks

Algorithmic Bias: The use of ZIP Code data may inadvertently lead to redlining or discriminatory lending patterns, creating significant regulatory and reputational risk. Probability: Medium. Consequence: High.
Overfitting: The model may perform exceptionally well on the validation set but fail in the real world due to the inclusion of noise variables like ID or Experience, which correlate too closely with Age. Probability: High. Consequence: Medium.

Unconsidered Alternative

The analysis focuses on predicting who will accept a loan, but it ignores the probability of default. A more effective strategy would be a two-stage model: first predicting acceptance propensity, then filtering those prospects through a credit-risk model. This ensures the bank is not successfully marketing to high-risk individuals who are likely to accept any credit offer because they cannot obtain it elsewhere.

MECE Analysis of Customer Segments

Segment	Characteristics	Strategic Action
High Propensity / High Credit	High Income, CD Account, Low Mortgage	Primary target for direct solicitation.
High Propensity / Low Credit	High CCAvg, Multi-family, No Securities	Exclude to prevent portfolio risk.
Low Propensity / High Credit	Low Income, Online Only, High Education	Retain as liability customers; do not solicit.
Low Propensity / Low Credit	Low Income, No existing accounts	No action; minimize contact costs.