Journal of Systems & Management ›› 2022, Vol. 31 ›› Issue (2): 255-269.DOI: 10.3969/j.issn.1005-2542.2022.02.005

Previous Articles     Next Articles

A Credit Risk Evaluation Model for Imbalanced Data Classification Based on Class Balanced Loss Modified Cross Entropy Function

YANG Lia1,2,SHI Baofeng1,2, DONG Yizhe3   

  1. 1. College of Economics and Management, Northwest A&F University, Yangling 712100, China; 2. Research Center on Credit and Big Data Analytics, Northwest A&F University, Yangling 712100, China; 3. University of Edinburgh Business School, Edinburgh, EH8 9JS, UK
  • Online:2022-03-28 Published:2022-04-07

基于Class Balanced Loss修正交叉熵非均衡样本信用风险评价模型

杨莲1,2,石宝峰1,2,董轶哲3   

  1. 1 西北农林科技大学 经济管理学院,陕西 杨凌 712100;2 西北农林科技大学 信用大数据应用研究中心,陕西 杨凌 712100;3 爱丁堡大学商学院,英国 爱丁堡 EH89JS
  • 作者简介:杨莲(1988-),女,博士生。研究方向为农村金融
  • 基金资助:
    国家自然科学基金面上项目(71873103,72173096);国家自然科学基金重点项目(71731003);中央农办农业农村部乡村振兴专家咨询委员会软科学研究项目(2021-22);中和农信星空计划项目(K4030218167);西北农林科技大学仲英青年学者项目

Abstract: To address the problem that imbalanced credit scoring data sets lead to over-recognition for non-default samples and under-recognition for default samples, this paper creates a novel credit risk evaluation model by introducing the class balanced loss function. It compares the BPNN-CBCE (back propagation neural network-class balanced cross entropy) with the BPNN-CE (back propagation neural network-cross entropy), the SVM (support vector machines), the DT (decision tree), the RF (random forest), and the KNN (K-nearest neighbor) to verify the effectiveness of the BPNN-CBCE model in predicting the credit risk of 1 534 farmers, loan data of a financial institution in China. In addition, it tests the robustness of the BPNN-CBCE model by using the German credit data published by UCI (University of California). The results show that for farmers, loan data, the default recall of the BPNN-CBCE is 41. 3% higher than those of other models, and the AUC (area under curve) of the BPNN-CBCE is 15. 6% higher than those of other models. For German credit data, the BPNN-CBCE model is also better than the BPNN- CE, the SVM, the DT, the RF and the KNN models in AUC and default recall, Therefore, the BPNN- CBCE credit risk evaluation model has a good ability to identify the default samples in the imbalanced credit data of farmers, and can reduce the losses caused by misjudgment of default customers by financial institutions. This paper is contributive because the balance factor ω in class balanced loss is used to adj ust the weight of non-default and default samples loss in target loss, which compensates for the defect that the cross-entropy loss function cannot adjust the weight? and overcomes the excessive recognition of non-default samples and the insufficient recognition of default samples caused by the sample imbalance. In addition, the random covering method is used to sample non-default or default samples without putting them back until the whole sample space Xnon-defauit or Xdefault is fully covered, and the number of effective samples for non-default or default loan customers is calculated. Moreover, the use boundary of class balanced loss expanded, providing new ideas for solving the credit risk evaluation of imbalanced samples. This research method has a good robustness and can be directly applied to the credit risk assessment of financial institutions.

Key words: credit evaluation, class balanced loss, PB (back propagation) neural network, cross entropy, microfinance

摘要: 针对传统信用风险预测模型存在对非违约样本识别过度、对违约样本识别不足的问题,将平衡损失CassBaanced Loss函数引入信用风险评价,构建CassBaancedLoss修正交叉熵的非均衡样本信用风险评价模型。利用所建模型与交叉熵神经网络、支持向量机、决策树、随机森林和K最近邻5种分类模型进行对比,验证BPNN-CBCE对中国某金融机构1 534笔农户贷款数据信用风险预测的有效性;在此基础上,利用UCI公开的德国信贷数据验证BPNN-CBCE模型的稳健性。研究表明,对于农户数据, BPNN-CBCE模型在AUC违约召回率Default recall方面普遍优干BPNN-CE、SVM、DT、RF和KNN模型。其中,BPNN-CBCE的Default recall相比5种对比模型提升了41.3个百分点,AUC相比5种对比模型提升了15.6个百分点: 对于德国数据集,BPNN-CBCE评级模型在AUC违约召回率Default recall方面也均优于5种对比模型。因此,BPNN-CBCE信用评价模型对农户不均衡信贷数据中的违约样本具有较好的识别能力,可有效降低金融机构客户误判带来的损失。创新与特色:(1)利用CassBalanced Loss中的平衡因子w,增大违约样本在目标损失中的权重,降低非违约样本在目标损失中的权重,客观调节正负样本损失在目标损失中权重,弥补交叉熵承数无法调节两类样本损失权重的缺陷,克服由样本不均衡带来的评价模刑对非讳约样本识别讨度、对讳约样木识别不足。(2)通讨考虑数据重叠,利用随机覆盖方法,分别对贷款数据中讳约,非违约样本进行不放回采样,以对全样本空间X违约、X非违约进行不重叠覆盖,计算两类贷款客户的有效样本数量。既反映由于真实数据之间的内在相似性,随着样本数量的增加,新添加样本很可能是现有样本近似重复的客观事实,也保证基于有效样本对两类样本损失进行重新加权的客观性。将图像识别领域中的Class Balanced Loss函数引入信用评价领域,既拓展了ClassBalanced Loss的使用边界,也为解决不均衡样本的信用风险评价提供了新的研究思路。

关键词: 信用评价, Class Balanced Loss, BP神经网络, 交叉熵, 小额信贷

CLC Number: