Risk scores, model interpretability, and LLMs



It's difficult to convey excitement about a computational technique. It's basically impossible if I write like a would in an academic paper. So, today, I'll casually explain why I think risk scores are an interesting class of machine learning models and discuss their applications in data interpretation for both humans and LLMs.

Let's set the stage. You're given a tabular dataset, with a target variable and a set of features. You want to know what features predict the target, and by how much. What do you do with it? In most cases, you'd fit a GLM or a GAM, estimate coefficients, test for significance, the usual. If you're feeling it, you could also use a forest-based model like XGBoost, look at SHAP values, and pretend you obtained some vital additional insights.

I would like to offer an alternative. Let's use the red wine dataset as an example. It has a few features representing the chemical composition of each wine, and a quality grading ranging from 3 to 8. We'd like to know what impacts wine quality, so we treat this problem as a classic regression task. If you use my library, you can obtain a risk score card that looks like this with relatively little effort (omitting a few features for simplicity):

Base score: 4.7
volatile.acidity      <0.4     [0.4,0.6)  [0.6,0.7)   >=0.7    
                      0.4      0.2        0.1          0.0     

pH                    <3.3     >=3.3                 
                      0.1      0.0                   

sulphates             <0.5     [0.5,0.6)  [0.6,0.7)   >=0.7                 
                      0.0      0.1        0.3         0.4                   

alcohol (ABV)         <9.5     [9.5,9.9)  [9.9,10.5)  [10.5,11.3)  >=11.3   
                      0.0      0.0        0.2         0.4          0.7  

In this table, the ranges next to the feature names are linked to the coefficients below. The ranges are non-overlapping, and the coefficients are chosen to be non-negative. You can evaluate new samples manually, just add up the scores for each feature to the base score, and voilà, you have a prediction. A feature of scores is that the model's predictions are always constrained to a pre-defined range. You can obviously constrain the output of any model, but doing so may incur a loss of performance that needs to be carefully evaluated.

Risk scores work under the assumption that there are no feature interactions. This may sound like a strong limitation, but in practice, interaction effects are often small in tabular data, or can be captured with appropriate feature engineering. Unlike GLMs, we can model non-linear relationships without having to specify them explicitly. You might be thinking that there is no way this model performs that great. Here are my results, averaged over 50 runs on the held-out test set. I'll make sure to make the code available:

--- RMSE ---
GBRS:      Mean = 0.6528, SD = 0.0274
LinearReg:    Mean = 0.6710, SD = 0.0236
XGBoost:   Mean = 0.6013, SD = 0.0217

--- MAE ---
GBRS:      Mean = 0.5019, SD = 0.0108
LinearReg:    Mean = 0.5220, SD = 0.0090
XGBoost:   Mean = 0.6013, SD = 0.0217

Well, it's surprisingly good! I've tested these methods on other datasets and other objectives, and the results are similar (preprint on arXiv). Now here is the hot take: these numbers are not that important. We are talking about a 0.1 change in average MAE for a score ranging from 3 to 8. Other concerns will probably make or break a wine quality prediction system in the real world. The takeaway is not that we have a better MAE and and worse MSE in this particular example, it's that these models are basically equivalent in terms of performance. I could tune hyperparameters more diligently, or try other methods, and I'll have to do that to get published, but it's missing the forest for the trees.

Scores let you directly observe the impact of each feature. You can also quantify their relative importance without having to do multiple multiplications, which would be required with GLMs. Now we know that high-alcohol wines get graded higher, and we can quantify roughly by how much. Why? The answer is not in this table. Think of them as a tool to develop data-driven intuition and build a narrative. I think it's valuable, because decision making is based on factors that are often not quantifiable. In my opinion, evaluating transparent models such as scores is much more interpretable than trying to explain a black box model behavior post-hoc.

Risk scores are used mostly in clinical applications. This sort of rough data-driven assessment is common for screening and determining treatment eligibility. It is infinitely more practical to have a transparent model in those settings rather than a black box, for many reasons. Clinical risk scores are commonly reported in publications, used in clinical trials, and shared across institutions. The best models are not always the ones that make the best predictions; they are the ones that are the most useful.

There is one last particularity of risk scores that I find interesting and has, as far as I know, never been reported before. You can print them in ASCII, which means you can add them to an LLM context as a way to provide data-driven insights to user queries. This is possible because, while current SOTA reasoning LLMs are still really bad at multiplications, they tend to hallucinate much less when it comes to additions and range checking.

You can use this characteristic to let your user ask something like "My wine has 10% ABV, 0.6 sulfates, and a pH of 3.4, what grade would you expect?". The LLM can then provide a prediction even with partial data, along with possible changes that would have the most impact. All this without actually calling a tool and doing inference. You can try this out in my chatbot. It is really that useful? Frankly I'm not sure, but it might have niche use cases for models that are meant to reason with data.

So that's why I think scores are cool, and maybe you do too now! I hope you found this post interesting. If you have any questions, comments, or suggestions, please let me know.