Fraud Detection using Python

September 17, 2018 in Python Articles

Written by Kevin McCarty


An important issue confronting retailers and other businesses today is the preponderance of credit card fraud. This issue recently hit home, as my son was a victim a week prior to me writing this.

We can apply machine learning to help detect credit card fraud, but there is a bit of a problem in that the vast majority of transactions are perfectly legitimate, which reduces a typical model’s sensitivity to fraud.

As an example, consider a logistic algorithm running against the Credit Card Fraud dataset posted on Kaggle.  You can download it here:

https://www.kaggle.com/mlg-ulb/creditcardfraud

To follow along you will need an installation of Python with the following packages:

NumPy
Pandas
SciKit-Learn

You can get all those packages, and many more, with the Anaconda installation which you can find at:

https://www.anaconda.com/download/

To begin with, start off with the necessary imports.

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn.metrics import f1_score, recall_score

We need NumPy for some basic mathematical functions and Pandas to read in the CSV file and create the data frame. We will use a number of sklearn.metrics to evaluate the results from our models.

Next, we need to create a couple of helper functions. PrintStats will compile and display the results from a model. Here is the code:

def PrintStats(cmat, y_test, pred):
   # separate out the confusion matrix components
   tpos = cmat[0][0]
   fneg = cmat[1][1]
   fpos = cmat[0][1]
   tneg = cmat[1][0]
   # calculate F!, Recall scores
   f1Score = round(f1_score(y_test, pred), 2)
   recallScore = round(recall_score(y_test, pred), 2)
   # calculate and display metrics
   print(cmat)
   print( 'Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
   print( 'Cohen Kappa: '+ str(np.round(cohen_kappa_score(y_test, pred),3)))
   print("Sensitivity/Recall for Model : {recall_score}".format(recall_score = recallScore))
   print("F1 Score for Model : {f1_score}".format(f1_score = f1Score))

PrintStats takes as parameters a confusion matrix, test labels and prediction labels and does the following:

  1. Separates the confusion matrix into its constituent parts.
  2. Calculates the F1, Recall, Accuracy and Cohen Kappa scores.
  3. Prints the confusion matrix and all the calculated scores.

We also need a function, called RunModel, to actually train the model and generate predictions against the test data. Here is the code:

def RunModel(model, X_train, y_train, X_test, y_test):
   model.fit(X_train, y_train.values.ravel())
   pred = model.predict(X_test)
   matrix = confusion_matrix(y_test, pred)
   return matrix, pred

The RunModel function takes as input the untrained model along with all the test and training data, including labels. It trains the model, runs the prediction using the test data, and returns the confusion matrix along with the predicted labels.

With these two functions created, it’s time to see if we can create a model to do fraud detection. Fraud detection is generally considered a two-class problem. In other words, a transaction is either:

Class #1: Not fraud

Or

Class #2: Fraud

Our goal is to try to determine to which class a particular transaction belongs. Step #1 is to load the CSV data and create the classes. This code will do the trick:

df = pd.read_csv('../Datasets/creditcard.csv')
   class_names = {0:'Not Fraud', 1:'Fraud'}
   print(df.Class.value_counts().rename(index = class_names))

It generates the following result:

Not Fraud    284315
Fraud           492
Name: Class, dtype: int64

This is a fairly typical dataset. Out of nearly 300,000 transactions, 492 were labelled as fraudulent. It may not seem like much, but each transaction represents a significant expense. Together, all such fraudulent transactions may represent billions of dollars of lost revenue each year. It also poses a problem with detection. Such a small percentage of fraud transactions makes it more difficult to weed out the offenders from the overwhelming number of good transactions.

Step #2 is to define the features we want to use. Normally, we want to apply some dimension reduction and feature engineering to our data, but that is another article (or two). Instead we’ll just use the whole dataset here with the following code:

feature_names = df.iloc[:, 1:30].columns
target = df.iloc[:1, 30: ].columns

data_features = df[feature_names]
data_target = df[target]

With the dataset defined, step #3 is to split the data into training and test sets. To do this, we need to import another function and run the following code:

from sklearn.model_selection import train_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(data_features,    data_target, train_size=0.70, test_size=0.30, random_state=1)

The train_test_split function uses a randomizer to separate the data into training and test sets. 70% of the data is for training and 30% is for testing. The random seed is initially set to ensure the same data is used for every run.

For step #4 , we pick a machine learning technique, or model. Perhaps the most common two-class machine learning technique is logistic regression. We will use that for this first test:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

The output from this run should look like this:

[[85293    15]
 [   57    78]]
Accuracy: 99.92%
Cohen Kappa: 0.684
Sensitivity/Recall for Model : 0.58
F1 Scorre for Model : 0.68

You might initially think the model did a good job. After all, it got 99.92% of its predictions correct. That is true, except if you look closely at the confusion matrix you will see the following results:

85293 transactions were classified as valid that were actually valid
15 transactions were classified as fraud that were actually valid (type 1 error)
57 transactions were classified as valid that were fraud (type 2 error)
78 transactions were classified as fraud that were fraud

So, while the accuracy was great, we find that the algorithm misclassified more than 4 in 10 fraudulent transactions. In fact, if our algorithm simply classified everything as valid, it would have an accuracy of better than 99.9% but be entirely useless! So, accuracy is not the reliable measure of a model’s effectiveness. Instead, we look at other measures like the Cohen Kappa, Recall, and F1 score. In each case, we want to achieve a score as close to 1 as we can.

Maybe another model will work. How about a RandomForest classifier? The code is similar to logistic regression:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, n_jobs =4)
cmat, pred = RunModel(rf, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

Trying this classifier will get you results similar to the following:

[[85297    11]
 [   31   104]]
Accuracy: 99.95%
Cohen Kappa: 0.832
Sensitivity/Recall for Model : 0.77
F1 Scorre for Model : 0.83

That’s quite a bit better. Note the accuracy went up slightly, but the other scores showed significant improvements as well. So, one way to improve our detection is to try different models and see how they perform. Clearly changing models helped. But there are other options too. One is over-sampling the sample of fraud records or, conversely, under-sampling the sample of good records. Over-sampling means adding fraud records to our training sample, thereby increasing the overall proportion of fraud records. Conversely, under-sampling is removing valid records from the sample, which has the same effect. Changing the sampling makes the algorithm more “sensitive” to fraud transactions.

Going back to the logistical regression classifier, let’s see how some under-sampling might improve the overall performance of the model. There are specific techniques, such as SMOTE and ADASYN, designed to strategically sample unbalanced datasets. In our case, let’s under-sample in order to achieve an even split between fraud and valid transactions. It will make the training set pretty small, but the algorithm doesn’t need a lot of data to come up with a good classifier:

fraud_records = len(df[df.Class == 1])
# pull the indicies for fraud and valid rows
fraud_indices = df[df.Class == 1].index
normal_indices = df[df.Class == 0].index
# randomly collect equal samples of each type
under_sample_indices = np.random.choice(normal_indices, fraud_records, False)
df_undersampled = df.iloc[np.concatenate([fraud_indices,under_sample_indices]),:]
X_undersampled = df_undersampled.iloc[:,1:30]
Y_undersampled = df_undersampled.Class
X_undersampled_train, X_undersampled_test, Y_undersampled_train,    Y_undersampled_test = train_test_split(X_undersampled,Y_undersampled,test_size = 0.3)
lr_undersampled = LogisticRegression(C=1)
# run the new model
cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train,    X_undersampled_test, Y_undersampled_test)
PrintStats(cmat, Y_undersampled_test, pred)

Now look at the new results:

[[138    1]
 [ 22  135]]
Accuracy: 92.23%
Cohen Kappa: 0.845
Sensitivity/Recall for Model : 0.86
F1 Scorre for Model : 0.92

The accuracy went down, but all of the other scores went up. Looking at the confusion matrix, you can see a much higher percentage of correct classifications of fraudulent data.

Unfortunately, there is no free lunch. A higher number of fraud classifications almost always means a correspondingly higher number of valid transactions also classified as fraudulent too. Now try the “new” logistic regression classifier against the original test data:

cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

This time, the results are:

[[83757    1551]
 [   16     119]]
Accuracy: 98.17%
Cohen Kappa: 0.129
Sensitivity/Recall for Model : 0.88
F1 Scorre for Model : 0.13

The algorithm was far better at catching fraudulent transactions (16 misclassification to 57) but far worse at mislabeling valid transactions (1551 to 15).

As a data scientist, you have to determine at what point the tradeoff is worth it. Generally, the costs of missing a fraudulent transaction is many times greater than misclassifying a good transaction as fraud. Your job is to find that balance point in your model training and proceed accordingly.


 

Accelebrate offers Python training onsite and online.


Kevin McCartyKevin McCarty is a freelance Data Scientist and Trainer. He teaches data analytics and data science to government agencies, military services, and businesses in the US and internationally.


Written by Kevin McCarty

Kevin McCarty

Kevin has a PhD in computer science and is a data scientist consultant and Microsoft Certified Trainer for .NET, Machine Learning and the SQL Server stack. He also trains and consults on Python, R and Tableau. Kevin has taught for Accelebrate all over the US and in Africa.



Contact Us:

Accelebrate’s training classes are available for private groups of 3 or more people at your site or online anywhere worldwide.

Don't settle for a "one size fits all" public class! Have Accelebrate deliver exactly the training you want, privately at your site or online, for less than the cost of a public class.

For pricing and to learn more, please contact us.

Contact Us Train For Us

Toll-free in US/Canada:
877 849 1850
International:
+1 678 648 3113

Toll-free in US/Canada:
866 566 1228
International:
+1 404 420 2491

925B Peachtree Street, NE
PMB 378
Atlanta, GA 30309-3918
USA

Subscribe to our Newsletter:

Never miss the latest news and information from Accelebrate:

Microsoft Gold Partner

Please see our complete list of
Microsoft Official Courses

Recent Training Locations

Alabama

Huntsville

Montgomery

Birmingham

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

San Francisco

Oakland

San Jose

Orange County

Los Angeles

Sacramento

San Diego

Colorado

Denver

Boulder

Colorado Springs

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Miami

Jacksonville

Orlando

Saint Petersburg

Tampa

Georgia

Atlanta

Augusta

Savannah

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Ceder Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

Baton Rouge

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Hagerstown

Frederick

Massachusetts

Springfield

Boston

Cambridge

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Saint Paul

Minneapolis

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Reno

Las Vegas

New Jersey

Princeton

New Mexico

Albuquerque

New York

Buffalo

Albany

White Plains

New York City

North Carolina

Charlotte

Durham

Raleigh

Ohio

Canton

Akron

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Tulsa

Oklahoma City

Oregon

Portland

Pennsylvania

Pittsburgh

Philadelphia

Rhode Island

Providence

South Carolina

Columbia

Charleston

Spartanburg

Greenville

Tennessee

Memphis

Nashville

Knoxville

Texas

Dallas

El Paso

Houston

San Antonio

Austin

Utah

Salt Lake City

Virginia

Richmond

Alexandria

Arlington

Washington

Tacoma

Seattle

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Edmonton

Calgary

British Columbia

Vancouver

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan

© 2013-2019 Accelebrate, Inc. All Rights Reserved. All trademarks are owned by their respective owners.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.