Ann Oper Res (2014) 216:191–203
SVM classification for imbalanced data sets using a multiobjective optimization framework
Ays¸egül As¸kan · Serpil Sayın
Published online: 15 January 2013 © Springer Science+Business Media New York 2013
Abstract Classification of imbalanced data sets in which negative instances outnumber the positive instances is a significant challenge. These data sets are commonly encountered in real-life problems. However, performance of well-known classifiers is limited in such cases.
Various solution approaches have been proposed for the class imbalance problem using either data-level or algorithm-level modifications. Support Vector Machines (SVMs) that have a solid theoretical background also encounter a dramatic decrease in performance when the data distribution is imbalanced. In this study, we propose an L1-norm SVM approach that is based on a three objective optimization problem so as to incorporate into the formulation the error sums for the two classes independently. Motivated by the inherent multi objective nature of the SVMs, the solution approach utilizes a reduction into two criteria formulations and investigates the efficient frontier systematically. The results indicate that a comprehensive treatment of distinct positive and negative error levels may lead to performance improvements that have varying degrees of increased computational effort.
Keywords SVM · Imbalanced data · Multiobjective optimization · Efficient frontier 1 Introduction
The class imbalance problem, also known as the Imbalanced Data Set Problem is a remarkable challenge that is studied extensively in the knowledge discovery and data mining field (Japkowicz and Stephen 2002 and Chawla and Japkowicz 2004). The majority of the classification algorithms assume training sets are well balanced and misclassification costs are equal. The aim of these algorithms is usually maximizing overall accuracy. However such
Garanti Teknoloji, Evren Mahallesi, Koçman Caddesi No:34 Günes¸li, 34212, ˙Istanbul, Turkey e-mail: firstname.lastname@example.org
S. Sayın ()
College of Administrative Sciences and Economics, Koç University, Sarıyer, 34450, ˙Istanbul, Turkey e-mail: email@example.com 192 Ann Oper Res (2014) 216:191–203 a balance does not exist in some real-life data sets. Instead, one of the classes can be represented by only a few examples where the other class is represented by a large number of examples and the problem arises when the class of interest is the minority in the sample set.
In a binary classification problem, the class having more examples is called the majority class and the one having fewer examples is called the minority class. Without loss of generality, we assume the majority class consists of negative examples and the minority class consists of positive examples.
Some of the application domains in which imbalance exists are, detection of fraud in telephone calls (Fawcett and Provost 1997), detection of credit card fraud (Chan and Stolfo 1998), response rate in direct marketing (Ling and Li 1998), rare medical diagnoses (Witten and Frank 2000), detection of oil spills in satellite radar images (Kubat et al. 1998) and spotting unreliable telecommunication customers (Ezawa et al. 1996). The ratio of imbalance in these data sets may vary from 1 : 10 to more dramatic situations like 1 : 100,000 (Provost and Fawcett 2001).
For an application domain like credit card fraud detection, the search is focused only on the minority class. In other words the aim of the classification is to distinguish the minority class. However, when there is an imbalanced distribution in the data set a typical classifier would be biased towards one class because it has the goal of maximizing overall accuracy.
In other words the classifier would tend to classify all examples as negative leading to a high overall accuracy but in the meantime it would possibly misclassify many—if not all— positive examples. In applications such as diagnosing cancer, such false negatives would not be acceptable.
In this paper, we propose an SVM model with three objective functions in order to deal with positive and negative empirical errors explicitly in addition to the typical margin maximization objective. Such a multiobjective framework makes it possible to characterize the complete trade-off information among three objectives. Since obtaining the efficient frontier in its entirety is challenging for the three objective case, we propose reductions to bicriteria optimization models for which the entire efficient frontier may be traced rather practically. Through experimentation on real data sets, we investigate whether a thorough search benefits the accuracy especially with respect to the classification of positive instances. The organization of the paper is as follows. In the next section, we provide an overview of the literature on imbalanced data sets. In Sect. 3, we describe the three objective SVM model and introduce our solution approach. Section 4 contains results of our computational experiments. We conclude in Sect. 5 with a discussion of contributions, limitations and directions for further research. 2 Overview of approaches for the data imbalance problem
Many strategies have been proposed in order to handle the class imbalance problem. Surveys
Japkowicz and Stephen (2002), Visa (2005), Chawla and Japkowicz (2004), Kotsiantis et al. (2006), Weiss (2004), Gu et al. (2008) are suggested for detailed information. Mainly, the proposed approaches can be classified into two groups as data level and algorithm level approaches.
Data level approaches are based on the idea of resampling the data to remedy the imbalance. These methods come in various forms like random undersampling, random oversampling and improved sampling techniques. The major drawback of undersampling the data is loss of information which may hurt the data mining process. The major drawback of oversampling is that it increases the likelihood of overfitting because positive instances are replicated until a balance is achieved (Gu et al. 2008). Additionally, random