Response to discussion on machine learning for IRB models

Go back

1: Do you currently use or plan to use ML models in the context of IRB in your institution? If yes, please specify and answer questions 1.1, 1.2, 1.3. 1.4; if no, are there specific reasons not to use ML models? Please specify (e.g. too costly, interpretability concerns, certain regulatory requirements, etc.)

As far as the use of Machine Learning models in the IRB space, our experience in developing and validating IRB models as well as ML models for credit risk purposes suggests that very few Banks are currently using ML models for regulatory purposes in the EU context. On the contrary, in many cases we have observed that Banks are using ML models for managerial uses, such as credit approval, monitoring, limit setting, etc. Such a tendency, that is widening more and more, is creating a gap which jeopardizes the use test requirements.
Regarding the future use of ML models for IRB purposes, our view is that many Banks are waiting for the Regulatory Guidance about the use of ML models. In fact, we are aware of several Banks either currently working on the development of such models (so far, mainly for managerial purposes and in some cases even willing to use them for capital calculation) or already finalizing them, somehow waiting to see if they can submit them for Regulatory approval.
We think the reasons why there have not been so far many cases of IRB application based on ML models are:
 as said, most of the Banks would not risk investments in Regulatory models as they are waiting for Regulatory Guidelines or, at least, for the results obtained by first movers;
 the pandemic most likely slowed the process of developing ML models down (such a trend appears to have started shortly before the Covid-19 outbreak), as other urgencies have shown up in the last two years;
 for many Banks there is still not enough knowledge, and not spread enough, in order to be able to internally manage the development/validation process;
 linked to the previous point, no doubt the required awareness by the Senior Management is not yet at the proper level;
 some crucial aspects related to representativeness, stability, overfitting, interpretability of ML models require the necessary skills to be properly treated and addressed;
 scarce availability of data and new data sources with the proper historical depth, in order to be able to meet the Regulatory requirements.

1.1: For the estimation of which parameters does your institution currently use or plan to use ML models, i.e. PD, LGD, ELBE, EAD, CCF?

In our experience, Banks which are using ML models for IRB purposes have developed ML models for PD.
Pertaining the use of ML models for managerial purposes, we have observed that Institutions are mostly using or planning to use ML models for credit approval / loan origination, limit setting, monitoring (therefore in a PD estimation perspective) as well as LGD, collection, recovery optimization, collateral evaluation (therefore in an LGD, LGD in Default and ELbe estimation perspective, where models are indeed used both for performing and defaulted assets). In a few cases Banks are using ML models for instalment loans and mortgage loans for pre-payment prediction or planning to develop models for EAD and/or CCF estimation.

1.2: Can you specify for which specific purposes these ML models are used or planned to be used? Please specify at which stage of the estimation process they are used, i.e. data preparation, risk differentiation, risk quantification, validation.

As specified in the answer to Question 1.1, Institutions are mostly using ML models for managerial purposes, especially in the area of credit approval / loan origination, limit setting, monitoring LGD, collection, recovery optimization, collateral evaluation.
The approach of Institutions we have been following is oriented to the use of Machine Learning techniques either for the Short List selection or for model estimation (usually both). Moreover, ML methods are widely used to treat unstructured data (i.e. free text / comments), with the aim of processing it and feeding the model as well as for validation purposes (e.g. challenger ML models). No use cases on model risk quantification have been observed, being the calibration phase generally executed with traditional techniques.

1.3: Please also specify the type of ML models and algorithms (e.g. random forest, k-nearest neighbours, etc.) you currently use or plan to use in the IRB context?

As stated in the answer to Question 1.2, the approach of Institutions we have been following is oriented to the use of Machine Learning techniques for the Short List selection and model estimation, where models are usually not estimated via extremely complex methods (such as Neural Networks or complex versions of XG-Boost), but by trying to keep a good balance between predictive power / accuracy and complexity / interpretability, therefore leveraging, for instance, simpler and therefore more interpretable versions of Random Forest or XG-Boost (where, if the model is properly tested, the risk of overfitting is reduced and, more importantly, models result interpretable). In most cases an additional module is developed by using ML techniques and then integrated into the “traditional” IRB model framework (therefore having the possibility to switch-off the ML module upon need, in order to reduce the risk of rejection during the Regulatory approval process, for instance).

1.4: Are you using or planning to use unstructured data for these ML models? If yes, please specify what kind of data or type of data sources you use or are planning to use. How do you ensure an adequate data quality?

As stated in the answer to Question 1.2, ML models described above make use of unstructured data. For example, a widespread ML model currently used by Banks for managerial purposes (and in a few cases also for regulatory purposes), is the transactional model, that is a specific module developed for PD estimation purposes (both in PD models as well as Early Warning Systems) and then integrated into the “traditional” model structure (e.g. financial information, internal behavioural and qualitative data, etc.). Such a module is based on the intra-day transactions of current accounts and credit cards associated to the evaluated client: transactions are characterized by a free text description (for example, the reason for the payment and a short description) and therefore need to be categorized for modelling purposes via ML techniques, such as Natural Language Processing methods.
Being such data source not so used so far, for the time being its quality is questionable, therefore the Data Quality process is crucial in order to avoid strong bias in the model. Due to this, the Data Quality process is extensive and comprises the typical DQ checks carried out for IRB models (technical and business checks as well as consistency checks within and between different table/sources, for instance monthly internal behavioural balance data vs. intra-day transactions, etc.) accompanied by additional checks and imputation techniques for the treatment of transactions characteristics and their classification, aimed at minimizing the missing or wrong attribution of the transaction type and therefore to increase the coverage of such data source. In any case, the transaction categorization is based on other attributes of the transactional DataBase, with a dedicated algorithm tailored on the Bank’s data, and finally back-tested in order to detect potential sources of distortion.

2: Have you outsourced or are you planning to outsource the development and implementation of the ML models and, if yes, for which modelling phase? What are the main challenges you face in this regard?

In general, many Banks are outsourcing and/or planning to outsource the development of the ML models for all the phases required by the development process (therefore the outsourcing is not limited to specific phases only). It is important to highlight that, on one side, many Banks are now organizing or planning to organize specific training on ML methods, in order to increase the knowledge of such methods among their employees, while, on the other hand, Institutions are more and more frequently hiring professionals with ML expertise and building up dedicated teams of Data Scientists.
Nevertheless, even if the model development process is quite often outsourced (not always, as some Banks can manage everything in-house), Banks’ professionals are anyway following the development process, therefore the challenges to be faced are less problematic. Such an approach ensures that Banks’ professionals do improve their expertise in the ML field. The two main challenges to be faced in this regard relate to:
 limited expertise in the Data Science field (being it still quite limited at the moment in most cases);
 in some cases, challenges in improving the user’s acceptance of such innovative methods (both Senior Management and especially credit experts).

3: Do you see or expect any challenges regarding the internal user acceptance of ML models (e.g. by credit officers responsible for credit approval)? What are the measures taken to ensure good knowledge of the ML models by their users (e.g. staff training, adapting required documentation to these new models)?

As clarified in answer to Question 2, we foresee some challenges in the internal user’s acceptance of ML models, at least at the initial stage. In particular, to make users accept such innovative methods in the absence of specific expertise is going to be challenging. Nevertheless, in our experience some specific measures can definitely mitigate such drawbacks. For instance, we have observed that training on ML methods is beneficial in this respect (more technical for quantitative staff, while rather “high level” for credit officers). On the other side, it is crucial for getting a full user acceptance of such models and methods, to involve the experts during the whole development process (as prescribed by the IRB Regulation as well), starting from the data source to be used to its categorization, up to the Long List construction and model selection phase. Moreover, a key role is played by the use of techniques for the model interpretability, such as Shapley Values and Lime Test.

4: If you use or plan to use ML models in the context of IRB, can you please describe if and where (i.e. in which phase of the estimation process, e.g. development, application or both) human intervention is allowed and how it depends on the specific use of the ML model?

As already reported answering to the previous questions, in our experience, for the time being, some Institutions have applied (or are planning to apply in some cases) for IRB approval with transactional-based modules, that are integrated into the traditional PD model framework as an additional component. ML techniques have been used to transform data and build the Long List of factors, as well as in the Short List selection and model estimation phases. Models are applied and implemented coherently with the estimation process.
The human intervention is allowed in the estimation process, in the Long List construction as well as in the Short List derivation and model estimation phases, where the human judgment is so crucial, for example, in excluding, ex-ante, some factors or preferring some features rather than others in the Short List or final model configuration.
Finally, in line with the IRB requirements and provided that all the interpretability requirements are met, the ex-post correction of the model output is allowed under specific circumstances, in coherence with the override process, properly regulated and designed.

5. Do you see any issues in the interaction between data retention requirements of GDPR and the CRR requirements on the length of the historical observation period?

Concerning the transactional module described in the previous answers, we do not see any issue with respect to the length of historical series to be used for model development, prescribed by CRR for IRB models, and the data retention rules of GDPR. Nevertheless, in other cases there might be general issues with data retention and model replicability (for instance, in case of use of Web data or other data sources).

6.a) Methodology (e.g. which tests to use/validation activities to perform).

As already specified in the previous answers, we have experience in ML models used for estimating credit risk, both for Regulatory as well as managerial purposes. Our experience is focused in the use of ML models only in the risk differentiation phase, while the risk quantification is always carried out in a traditional way.
As far as the main challenges to be faced, in more specific terms, the main topics causing concerns (e.g.: interpretability, overfitting, ethics, stability etc..) are manageable, for the majority of models (with the exception of complex methods, such as Neural Networks, etc.), with currently available Data Science technologies and methods. However, EBA’s concerns are totally grounded if read in the absolutely right perspective of a need for education and awareness by the Banks’ structures at all levels (from modellers to validation, up to credit risk analysts and top management) to Artificial Intelligence and Machine Learning-related topics, in order to avoid the development and use of ML models detached from the credit processes and their logics, which in the worst case scenario might potentially jeopardize the Bank’s business (a situation which might happen in cases when the development process and the use of models in the daily business is not properly steered by credit experts).
Nevertheless, several crucial aspects have to be addressed in developing ML models, such as:
a) Methodology
- Calibration: for the time being, we observed that the use of such models is limited to the risk differentiation phase, while the risk quantification is typically conducted in a more traditional way.
- Representativeness: severe issues might arise, therefore we would recommend to follow the Regulatory Guidelines for IRB models validation and even to extend the representativeness analysis to all the drivers of the Short List, that might be potential candidates for the model (and, in case of issues in the PSI test, to exclude such drivers from the model selection); moreover, it is crucial that all IRB requirements of historical depth of data for models estimation are met.
- Stability: models should be very carefully tested from a stability perspective, for instance a set of tests, aimed at measuring predictive power and accuracy over time and depending on the macro scenario, has to be foreseen with the proper frequency, reflecting potential increased volatility of the ML models’ outputs in comparison with traditional models.
- Overfitting: we deem necessary to test such models with different techniques, such as Bootstrapping and Cross Validation and, in all cases, the training sample has to be separated from the test sample; moreover, a third sample has to be used, i.e. the hold-out, to test hyperparameters, and finally the model has to be thoroughly assessed on a fully out-of-time sample as a necessary requirement for acceptance. In addition, some specific tests plotting the loss of predictive power or accuracy on an out-of-time sample, as a function of the hyperparameters, has to be run as a part of the validation process.
- PiT-ness: ML models might be overtly PiT, therefore they might make more difficult stress test exercises, providing biased results; special attention to the selection of too PiT models have to be paid, for instance not using factors calculated on very short time horizon (which might reduce the predictive power benefit, but, at the same time, might make the model more TTC, i.e. more stable and less volatile, as required by the IRB Regulation).
- Data sources: special attention has to be paid to the data used to develop models, in consideration not only of GDPR requirements and other EU or local Regulations, but also of its reliability, and therefore “certification”, as well as the possibility to be “frozen” and stored to ensure the replicability of IRB models (for instance in case of use of Web data or any data which cannot be stored at all or, at least, for a sufficient period of time as per the IRB Regulatory requirements of IRB).
- Rating process: in terms of rating assignment process, for example, the use of complex (and therefore not easily interpretable) models, would endanger the typical IRB rating attribution process, which entails the possibility of expert judgement, e.g. override, which, in such a situation, could not be applied as the analyst would not be in the position to understand what the model lacks with respect to its expert opinion and, therefore, where the manual intervention is necessary (the same applies during the development process, where the expert involvement is prescribed in several phases).
- Replicability: model developments based on large-scale parallel computing infrastructures (which are particularly useful when dealing with big data samples) pose several challenges about the exact replicability of the development steps and results which are not addressed by simple common approaches (e.g. by the specification of pseudo-random seeds in the model development phase). In turn, unreplicable models cannot be validated for IRB purposes, thus suggesting again an appropriate level of training of both model development staff and validation staff, in order to guarantee the certification of the whole model development process; in addition, part of the troubles in ensuring replicability lays down in the very fast evolution of open sources libraries and environments (which are typically used for ML models), as the update to more recent versions is often mandatory due to security reasons, since it relates to security patches and such a situation poses issues in the possibility to freeze the development environments, trying to find the right trade-off between security and freezing needs.
- Data quality: dealing with new information sources, that are generally less known if compared with traditional data sources, poses challenges in terms of their appropriate management for model development purposes. This is particularly true for unstructured data sources of course (e.g. free-texts, web data), but even structured - big - data may require several data quality checks and treatments before they can be used for model development purposes (e.g. in the transaction data example: considering technical accounts transactions, managing transactions between current accounts of the same owner, checking for coherence between transactions and current accounts balance, etc.).

6.b) Traceability (e.g. how to identify the root cause for an identified issue).

b) Traceability
Concerning traceability, it is strictly connected with the interpretability of models, which, as said before, is manageable for the majority of models (with the exception of complex methods, such as Neural Networks, etc.) with the currently available Data Science technologies and methods. For instance, the use of Shapley Values, Lime Test and graphical analysis / inspection, alongside with the adoption of not too complex approaches (e.g. too deep XG Boost, etc.), characterized by a relatively limited number of factors, make models interpretable as well as linear models. Such models, which are not too complex and far from being a black-box, are yet very performing in terms of predictive power. Moreover, the metrics / tools described before should be made available to credit officers (at individual level) to facilitate their assessments and decisions (for instance, by integrating them into the dashboard of credit origination, pricing, etc.).

6.c) Knowledge needed by the validation function (e.g. specialised training sessions on ML techniques by an independent party).New textarea

c) Knowledge needed by the validation function
We have observed that in many Banks there is still not enough knowledge, and that is not spread enough, in order to be able to internally manage the development as well as validation process; moreover, the required awareness by the Senior Management is not yet at the proper level.
More specifically, regarding the validation process, it is necessary to:
- organize dedicated training sessions for professionals supposed to validate ML IRB models, as long as they need to be in the position of understanding (and replicating) the model rationale as well as the whole development process in details, during the qualitative assessment of the model.
- envisage an integration of the traditional IRB validation framework to assess crucial aspects of ML models, both from a qualitative and quantitative perspective, such as: stability, interpretability, overfitting, hyperparameters testing, representativeness, etc.; in this respect, the process of setting up thresholds for the traffic light approach (typically used by many Banks for IRB purposes) might be difficult initially, given the lack of benchmarks.
- Consider a higher frequency of model monitoring or validation, in light of the instability or tendency to a fast deterioration in predictive power of some ML models.

6.d) Resources needed to perform the validation (e.g. more time needed for validation)?

d) Resources needed to perform the validation
Alongside with the proper training of the Validation Function, which is generally not enough, at least a senior expert in ML models should supervise the validation process during the first rounds, with an eye on the most critical aspects described before. Moreover, in light of the considerations made at point c), it is necessary to prolong the elapsed time of the validation process as long as more tests could be performed and more aspects investigated. Finally, it would be beneficial in some specific cases (e.g. complex models, unstable models, etc.) to increase the frequency of the model monitoring or validation process, or, at least, to design a process where the typical traffic light approach triggers a more frequent validation in case of bad outcomes or alerts during a validation round.

7: Can you please elaborate on your strategy to overcome the overfitting issues related to ML models (e.g. cross-validation, regularisation)?

We typically adopt different techniques to overcome the overfitting, among them:
 Use of 3 samples to validate the model (training, test, hold-out).
 Use of Cross-Validation as well as Bootstrapping to select and test the model.
 Back-testing on different out of time sample, too, in addition to the out of sample and hold-out.
 Analysis of model stability.
 Back-testing of hyperparameters, by plotting the Predictive Power of the model on the development sample vs. its predictive power on different out-of-time snapshots, in order to catch the difference (and so a proxy of overfitting) and its deterioration tendency in terms of predictive power.

8: What are the specific challenges you see regarding the development, maintenance and control of ML models in the IRB context, e.g., when verifying the correct implementation of internal rating and risk parameters in IT systems, when monitoring the correct functioning of the models or when integrating control models for identifying possible incidences?

As already described in the previous answers, challenges in development and validation touch different aspects, while, from the point of view of ML models management, a more comprehensive validation framework as well as more frequent and more attentive monitoring and validations would be beneficial (with the inclusion of tests and thresholds related to new aspects to be assessed, such as stability, overfitting, predictive power decrease, etc., the results of which should trigger specific actions, for instance models refinement, re-estimation, in a wider model risk management perspective).
With regard to the implementation phase, the typical checks to make sure the model is implemented correctly, the soundness of inputs and outputs, and all other usual Data Quality checks should be met, as for a traditional IRB model.
Finally, the main challenge in the implementation phase consists in the rating attribution process, where, if the credit officers are not involved in the development process or if they do not understand the model and its effects, they might tend to reject the model, or not be able to apply any override, or, even worse, they would tend to apply too many overrides, very often generating a sort of “double counting” in the accounting of factors, as in most cases the model already considers the factor the analyst would like to take into account, but the credit officer simply doesn’t know it or understand it. Model interpretability tools, at both global and local level, plays a key role in mitigating such risk as better described in answer 15.

9: How often do you plan to update your ML models (e.g., by re estimating parameters of the model and/or its hyperparameters) Please explain any related challenges with particular reference to those related to ensuring compliance with Regulation (EU) No 529/2014 (i.e. materiality assessment of IRB model changes).

It is beneficial and strongly recommended to test models more than once a year, for example twice a year or, at least, to increase the frequency of the model monitoring process. On average, ML models are re-estimated every three years or when the monitoring dashboard or the validation framework outputs trigger a re-estimation or refinement or any other material model change.
As regards the compliance with Regulation with respect to Material Model Changes, all requirements have to be met for other IRB models. In light of such requirements, the frequent re-estimation of ML models is not recommended and a good compromise was found in 3 years (even if in some cases it might be beneficial a more frequent re-estimation to keep or increase the model’s performance). It is clear that, in order to meet the target of model re-estimation every about 3 years, the performance of the model shall be approximately stable over this time window avoiding the systematic trigger of anticipating re-estimations because of sudden drops in the predictive power. This is one of the reasons why we strongly concentrate in the control of overfitting phenomena as better described in answer 6.

10: Are you using or planning to use ML for credit risk apart from regulatory capital purposes? Please specify (i.e. loan origination, loan acquisition, provisioning, ICAAP).

Most of the ML models we have developed are used for managerial purposes, in particular: loan origination, loan monitoring, loan pricing, limit setting, provisioning, income estimation, collateral evaluation, collection / recovery and their optimization.

11. Do you see any challenges in using ML in the context of IRB models stemming from the AI act?

In general, we see all the challenges stated in the AI Act, especially in case of more complex methods, such as Neural Networks. On the other side, if the ML model is not too complex and developed / validated properly, following all typical IRB standards, as well as if it meets all interpretability requirements and finally the awareness about ML methods gets widespread into the Institution, then we do not see any issue.

12. Do you see any additional challenge or issue that is relevant for discussion related to the use of ML models in the IRB context?

The use of Web data in IRB models poses challenges with respect to its storage, reliability, auditability and replicability of models.
From a methodological standpoint, the perspective likely tendency to use as many information pieces and drivers as possible in the model estimation, without analysing the potential drawbacks simply because it is almost zero-cost (due to the power of current Big Data infrastructures) and aimed at finding hidden patterns, might pose serious challenges in the correlation between regulatory parameters, with subsequent dangerous potential underestimation of the capital (for instance in the use of the same transactional data in PD and LGD models), due to the characteristics of the RWA functions, which do not account for correlation between PD and LGD, if not through the LGD downturn add-on. Such evidence is truer in case of complex AI methods, where the interpretability of the model is questionable or where the model is a black-box by nature.

13: Are you using or planning to use ML for collateral valuation? Please specify.

Yes, we are using ML models for collateral evaluation. For example, for RE evaluation model, which for some Institutions we have been working with is an input of IRB LGD models.

14. Do you see any other area where the use of ML models might be beneficial?

Concerning the credit risk space, we see the use of ML model beneficial in the areas described in the answer at Question 10. Moreover, we see it very useful in the following areas:
 Automatic Data Quality treatment (i.e. ML framework to detect and treat DQ issues).
 Validation of traditional models using ML challenger models.
 Detection of hidden connections among clients as well as definition of the group of connect clients (e.g. supply chain, economic groups of clients, etc.).
Regarding other risk areas, application of ML models would cover a dozen use cases, such as: AML, fraud detection, operational risk, pre-payment models, etc..

15: What does your institution do to ensure explainability of the ML models, i.e. the use of ex post tools to describe the contribution of individual variables or the introduction of constraints in the algorithm to reduce complexity?

We adopt the following approach as far as the interpretability of the final model:
 Shapley Values for general model interpretability.
 LIME test for model interpretability.
 Graphical analysis on each factor and on the overall model.

The use of such tools allows for an easy global model interpretability, i.e. it allows to explain the impact of a specific feature on the model output for the entire training data set. Global interpretability is necessary to ensure the comprehension of the main drivers that generate the outcome of the model, but the use of advanced ML models that are able to capture non-linear effects, makes the analysis of global interpretability generally not enough. Actually, it may be that the model changes its behaviour in correspondence of low/high values of the variables or when a particular combination of model drivers is obtained.
For this reason, it is needed to explain how the model behaves in proximity of the instance being predicted using, for example, Shapley Values for local model explainability. Based on cooperative game theory, Shapley values propose how to fairly distribute the payout/credit (in the IRB case: the final model score for example) among the features depending on their contribution towards the total reward. In this way it is possible to understand the model output at the single-instance level (e.g. single client level) providing a significant support in eliminating the black box effect which often characterizes advanced ML models.

With respect to the introduction of constraints, we prefer simpler models than very complex ones, especially if the difference in predictive power and accuracy is not that relevant to justify the complexity and its consequences. We usually select the model sparingly, i.e. by plotting the complexity of the model (expressed in terms of number of factors and their interpretability) vs. its predictive power and stop the selection mechanism when the best trade-off between complexity and increase of performance is found, on an expert basis.
Moreover, we perform the following analysis during the development process, in order to find the best candidate:
 Use of Cross-Validation as well as Bootstrapping to select and test the model.
 Back-testing on different out of time sample too, in addition to the out of sample and hold-out.
 Analysis of model stability.
 Back-testing of hyperparameters, by plotting the predictive power of the model on the development sample vs. its predictive power on different out-of-time snapshots, in order to catch the difference (and so a proxy of overfitting) and its deterioration tendency in terms of predictive power.

16. Are you concerned about how to share the information gathered on the interpretability with the different stakeholders (e.g. senior management)? What approaches do you think could be useful to address these issues?

We are concerned to the extent there is not enough awareness about ML models, which is usually a potential issue for the time being. As for the ML models we described in the previous answers, we are not concerned, as all the proper measures are taken in terms of interpretability. On the other side, we would be concerned if we had to explain complex ML methods, where interpretability cannot be ensured at the moment.
In our idea, the best approach to proceed is to prepare a document where the test carried out are represented and explained in order to prove how the model works in real life as well as practical examples of the model use in the daily business as well as evidences of use of the model by credit officers and other relevant stakeholders.

17: Do you have any concern related to the principle-based recommendations?

No, on the contrary principle-based recommendations, similar to the recommendations provided for traditional IRB models, would be very beneficial in making clarity and most likely encouraging Institutions to apply with ML models for IRB approval, thus closing the gap between regulatory and managerial models, which are likely to become more problematic in the future.

Name of the organization

Prometeia SpA