Using Python to Find Fraud

September 17, 2018 in Python Articles

Written by Kevin McCarty


An important issue confronting retailers and other businesses today is the preponderance of credit card fraud. This issue recently hit home, as my son was a victim a week prior to me writing this.

We can apply machine learning to help detect credit card fraud, but there is a bit of a problem in that the vast majority of transactions are perfectly legitimate, which reduces a typical model’s sensitivity to fraud.

As an example, consider a logistic algorithm running against the Credit Card Fraud dataset posted on Kaggle.  You can download it here:

https://www.kaggle.com/mlg-ulb/creditcardfraud

To follow along you will need an installation of Python with the following packages:

NumPy
Pandas
SciKit-Learn

You can get all those packages, and many more, with the Anaconda installation which you can find at:

https://www.anaconda.com/download/

To begin with, start off with the necessary imports.

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn.metrics import f1_score, recall_score

We need NumPy for some basic mathematical functions and Pandas to read in the CSV file and create the data frame. We will use a number of sklearn.metrics to evaluate the results from our models.

Next, we need to create a couple of helper functions. PrintStats will compile and display the results from a model. Here is the code:

def PrintStats(cmat, y_test, pred):
   # separate out the confusion matrix components
   tpos = cmat[0][0]
   fneg = cmat[1][1]
   fpos = cmat[0][1]
   tneg = cmat[1][0]
   # calculate F!, Recall scores
   f1Score = round(f1_score(y_test, pred), 2)
   recallScore = round(recall_score(y_test, pred), 2)
   # calculate and display metrics
   print(cmat)
   print( 'Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
   print( 'Cohen Kappa: '+ str(np.round(cohen_kappa_score(y_test, pred),3)))
   print("Sensitivity/Recall for Model : {recall_score}".format(recall_score = recallScore))
   print("F1 Score for Model : {f1_score}".format(f1_score = f1Score))

PrintStats takes as parameters a confusion matrix, test labels and prediction labels and does the following:

  1. Separates the confusion matrix into its constituent parts.
  2. Calculates the F1, Recall, Accuracy and Cohen Kappa scores.
  3. Prints the confusion matrix and all the calculated scores.

We also need a function, called RunModel, to actually train the model and generate predictions against the test data. Here is the code:

def RunModel(model, X_train, y_train, X_test, y_test):
   model.fit(X_train, y_train.values.ravel())
   pred = model.predict(X_test)
   matrix = confusion_matrix(y_test, pred)
   return matrix, pred

The RunModel function takes as input the untrained model along with all the test and training data, including labels. It trains the model, runs the prediction using the test data, and returns the confusion matrix along with the predicted labels.

With these two functions created, it’s time to see if we can create a model to do fraud detection. Fraud detection is generally considered a two-class problem. In other words, a transaction is either:

Class #1: Not fraud

Or

Class #2: Fraud

Our goal is to try to determine to which class a particular transaction belongs. Step #1 is to load the CSV data and create the classes. This code will do the trick:

df = pd.read_csv('../Datasets/creditcard.csv')
   class_names = {0:'Not Fraud', 1:'Fraud'}
   print(df.Class.value_counts().rename(index = class_names))

It generates the following result:

Not Fraud    284315
Fraud           492
Name: Class, dtype: int64

This is a fairly typical dataset. Out of nearly 300,000 transactions, 492 were labelled as fraudulent. It may not seem like much, but each transaction represents a significant expense. Together, all such fraudulent transactions may represent billions of dollars of lost revenue each year. It also poses a problem with detection. Such a small percentage of fraud transactions makes it more difficult to weed out the offenders from the overwhelming number of good transactions.

Step #2 is to define the features we want to use. Normally, we want to apply some dimension reduction and feature engineering to our data, but that is another article (or two). Instead we’ll just use the whole dataset here with the following code:

feature_names = df.iloc[:, 1:30].columns
target = df.iloc[:1, 30: ].columns

data_features = df[feature_names]
data_target = df[target]

With the dataset defined, step #3 is to split the data into training and test sets. To do this, we need to import another function and run the following code:

from sklearn.model_selection import train_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(data_features,    data_target, train_size=0.70, test_size=0.30, random_state=1)

The train_test_split function uses a randomizer to separate the data into training and test sets. 70% of the data is for training and 30% is for testing. The random seed is initially set to ensure the same data is used for every run.

For step #4 , we pick a machine learning technique, or model. Perhaps the most common two-class machine learning technique is logistic regression. We will use that for this first test:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
cmat, pred = RunModel(lr, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

The output from this run should look like this:

[[85293    15]
 [   57    78]]
Accuracy: 99.92%
Cohen Kappa: 0.684
Sensitivity/Recall for Model : 0.58
F1 Scorre for Model : 0.68

You might initially think the model did a good job. After all, it got 99.92% of its predictions correct. That is true, except if you look closely at the confusion matrix you will see the following results:

85293 transactions were classified as valid that were actually valid
15 transactions were classified as fraud that were actually valid (type 1 error)
57 transactions were classified as valid that were fraud (type 2 error)
78 transactions were classified as fraud that were fraud

So, while the accuracy was great, we find that the algorithm misclassified more than 4 in 10 fraudulent transactions. In fact, if our algorithm simply classified everything as valid, it would have an accuracy of better than 99.9% but be entirely useless! So, accuracy is not the reliable measure of a model’s effectiveness. Instead, we look at other measures like the Cohen Kappa, Recall, and F1 score. In each case, we want to achieve a score as close to 1 as we can.

Maybe another model will work. How about a RandomForest classifier? The code is similar to logistic regression:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, n_jobs =4)
cmat, pred = RunModel(rf, X_train, y_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

Trying this classifier will get you results similar to the following:

[[85297    11]
 [   31   104]]
Accuracy: 99.95%
Cohen Kappa: 0.832
Sensitivity/Recall for Model : 0.77
F1 Scorre for Model : 0.83

That’s quite a bit better. Note the accuracy went up slightly, but the other scores showed significant improvements as well. So, one way to improve our detection is to try different models and see how they perform. Clearly changing models helped. But there are other options too. One is over-sampling the sample of fraud records or, conversely, under-sampling the sample of good records. Over-sampling means adding fraud records to our training sample, thereby increasing the overall proportion of fraud records. Conversely, under-sampling is removing valid records from the sample, which has the same effect. Changing the sampling makes the algorithm more “sensitive” to fraud transactions.

Going back to the logistical regression classifier, let’s see how some under-sampling might improve the overall performance of the model. There are specific techniques, such as SMOTE and ADASYN, designed to strategically sample unbalanced datasets. In our case, let’s under-sample in order to achieve an even split between fraud and valid transactions. It will make the training set pretty small, but the algorithm doesn’t need a lot of data to come up with a good classifier:

fraud_records = len(df[df.Class == 1])
# pull the indicies for fraud and valid rows
fraud_indices = df[df.Class == 1].index
normal_indices = df[df.Class == 0].index
# randomly collect equal samples of each type
under_sample_indices = np.random.choice(normal_indices, fraud_records, False)
df_undersampled = df.iloc[np.concatenate([fraud_indices,under_sample_indices]),:]
X_undersampled = df_undersampled.iloc[:,1:30]
Y_undersampled = df_undersampled.Class
X_undersampled_train, X_undersampled_test, Y_undersampled_train,    Y_undersampled_test = train_test_split(X_undersampled,Y_undersampled,test_size = 0.3)
lr_undersampled = LogisticRegression(C=1)
# run the new model
cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train,    X_undersampled_test, Y_undersampled_test)
PrintStats(cmat, Y_undersampled_test, pred)

Now look at the new results:

[[138    1]
 [ 22  135]]
Accuracy: 92.23%
Cohen Kappa: 0.845
Sensitivity/Recall for Model : 0.86
F1 Scorre for Model : 0.92

The accuracy went down, but all of the other scores went up. Looking at the confusion matrix, you can see a much higher percentage of correct classifications of fraudulent data.

Unfortunately, there is no free lunch. A higher number of fraud classifications almost always means a correspondingly higher number of valid transactions also classified as fraudulent too. Now try the “new” logistic regression classifier against the original test data:

cmat, pred = RunModel(lr_undersampled, X_undersampled_train, Y_undersampled_train, X_test, y_test)
PrintStats(cmat, y_test, pred)

This time, the results are:

[[83757    1551]
 [   16     119]]
Accuracy: 98.17%
Cohen Kappa: 0.129
Sensitivity/Recall for Model : 0.88
F1 Scorre for Model : 0.13

The algorithm was far better at catching fraudulent transactions (16 misclassification to 57) but far worse at mislabeling valid transactions (1551 to 15).

As a data scientist, you have to determine at what point the tradeoff is worth it. Generally, the costs of missing a fraudulent transaction is many times greater than misclassifying a good transaction as fraud. Your job is to find that balance point in your model training and proceed accordingly.


 

Accelebrate offers Python training onsite and online.


Written by Kevin McCarty

Kevin McCarty

Kevin has a PhD in computer science and is a data scientist consultant and Microsoft Certified Trainer for .NET, Machine Learning and the SQL Server stack. He also trains and consults on Python, R and Tableau. Kevin has taught for Accelebrate all over the US and in Africa.


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan