Predicting Fraud with Autoencoders and Keras

0
115
Predicting Fraud with Autoencoders and Keras


Overview

In this submit we are going to prepare an autoencoder to detect bank card fraud. We may even display how one can prepare Keras fashions within the cloud utilizing CloudML.

The foundation of our mannequin would be the Kaggle Credit Card Fraud Detection dataset, which was collected throughout a analysis collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on large information mining and fraud detection.

The dataset incorporates bank card transactions by European cardholders remodeled a two day interval in September 2013. There are 492 frauds out of 284,807 transactions. The dataset is extremely unbalanced, the optimistic class (frauds) account for less than 0.172% of all transactions.

Reading the information

After downloading the information from Kaggle, you possibly can learn it in to R with read_csv():

library(readr)
df <- read_csv("data-raw/creditcard.csv", col_types = checklist(Time = col_number()))

The enter variables encompass solely numerical values that are the results of a PCA transformation. In order to protect confidentiality, no extra details about the unique options was offered. The options V1, …, V28 had been obtained with PCA. There are nonetheless 2 options (Time and Amount) that weren’t reworked.
Time is the seconds elapsed between every transaction and the primary transaction within the dataset. Amount is the transaction quantity and may very well be used for cost-sensitive studying. The Class variable takes worth 1 in case of fraud and 0 in any other case.

Autoencoders

Since solely 0.172% of the observations are frauds, we’ve got a extremely unbalanced classification drawback. With this type of drawback, conventional classification approaches often don’t work very properly as a result of we’ve got solely a really small pattern of the rarer class.

An autoencoder is a neural community that’s used to study a illustration (encoding) for a set of knowledge, usually for the aim of dimensionality discount. For this drawback we are going to prepare an autoencoder to encode non-fraud observations from our coaching set. Since frauds are alleged to have a special distribution then regular transactions, we count on that our autoencoder could have increased reconstruction errors on frauds then on regular transactions. This implies that we are able to use the reconstruction error as a amount that signifies if a transaction is fraudulent or not.

If you need to study extra about autoencoders, a very good place to begin is that this video from Larochelle on YouTube and Chapter 14 from the Deep Learning guide by Goodfellow et al.

Visualization

For an autoencoder to work properly we’ve got a robust preliminary assumption: that the distribution of variables for regular transactions is totally different from the distribution for fraudulent ones. Let’s make some plots to confirm this. Variables had been reworked to a [0,1] interval for plotting.

We can see that distributions of variables for fraudulent transactions are very totally different then from regular ones, apart from the Time variable, which appears to have the very same distribution.

Preprocessing

Before the modeling steps we have to do some preprocessing. We will break up the dataset into prepare and take a look at units after which we are going to Min-max normalize our information (that is accomplished as a result of neural networks work significantly better with small enter values). We may even take away the Time variable because it has the very same distribution for regular and fraudulent transactions.

Based on the Time variable we are going to use the primary 200,000 observations for coaching and the remainder for testing. This is sweet follow as a result of when utilizing the mannequin we need to predict future frauds based mostly on transactions that occurred earlier than.

Now let’s work on normalization of inputs. We created 2 capabilities to assist us. The first one will get descriptive statistics in regards to the dataset which are used for scaling. Then we’ve got a perform to carry out the min-max scaling. It’s vital to notice that we utilized the identical normalization constants for coaching and take a look at units.

library(purrr)

#' Gets descriptive statistics for each variable within the dataset.
get_desc <- perform(x) {
  map(x, ~checklist(
    min = min(.x),
    max = max(.x),
    imply = imply(.x),
    sd = sd(.x)
  ))
} 

#' Given a dataset and normalization constants it'll create a min-max normalized
#' model of the dataset.
normalization_minmax <- perform(x, desc) {
  map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}

Now let’s create normalized variations of our datasets. We additionally reworked our information frames to matrices since that is the format anticipated by Keras.

We will now outline our mannequin in Keras, a symmetric autoencoder with 4 dense layers.

library(keras)
mannequin <- keras_model_sequential()
mannequin %>%
  layer_dense(items = 15, activation = "tanh", input_shape = ncol(x_train)) %>%
  layer_dense(items = 10, activation = "tanh") %>%
  layer_dense(items = 15, activation = "tanh") %>%
  layer_dense(items = ncol(x_train))

abstract(mannequin)
___________________________________________________________________________________
Layer (sort)                         Output Shape                     Param #      
===================================================================================
dense_1 (Dense)                      (None, 15)                       450          
___________________________________________________________________________________
dense_2 (Dense)                      (None, 10)                       160          
___________________________________________________________________________________
dense_3 (Dense)                      (None, 15)                       165          
___________________________________________________________________________________
dense_4 (Dense)                      (None, 29)                       464          
===================================================================================
Total params: 1,239
Trainable params: 1,239
Non-trainable params: 0
___________________________________________________________________________________

We will then compile our mannequin, utilizing the imply squared error loss and the Adam optimizer for coaching.

mannequin %>% compile(
  loss = "mean_squared_error", 
  optimizer = "adam"
)

Training the mannequin

We can now prepare our mannequin utilizing the match() perform. Training the mannequin within reason quick (~ 14s per epoch on my laptop computer). We will solely feed to our mannequin the observations of regular (non-fraudulent) transactions.

We will use callback_model_checkpoint() as a way to save our mannequin after every epoch. By passing the argument save_best_only = TRUE we are going to carry on disk solely the epoch with smallest loss worth on the take a look at set.
We may even use callback_early_stopping() to cease coaching if the validation loss stops lowering for five epochs.

checkpoint <- callback_model_checkpoint(
  filepath = "mannequin.hdf5", 
  save_best_only = TRUE, 
  interval = 1,
  verbose = 1
)

early_stopping <- callback_early_stopping(endurance = 5)

mannequin %>% match(
  x = x_train[y_train == 0,], 
  y = x_train[y_train == 0,], 
  epochs = 100, 
  batch_size = 32,
  validation_data = checklist(x_test[y_test == 0,], x_test[y_test == 0,]), 
  callbacks = checklist(checkpoint, early_stopping)
)
Train on 199615 samples, validate on 84700 samples
Epoch 1/100
199615/199615 [==============================] - 17s 83us/step - loss: 0.0036 - val_loss: 6.8522e-04d from inf to 0.00069, saving mannequin to mannequin.hdf5
Epoch 2/100
199615/199615 [==============================] - 17s 86us/step - loss: 4.7817e-04 - val_loss: 4.7266e-04d from 0.00069 to 0.00047, saving mannequin to mannequin.hdf5
Epoch 3/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.7753e-04 - val_loss: 4.2430e-04d from 0.00047 to 0.00042, saving mannequin to mannequin.hdf5
Epoch 4/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.3937e-04 - val_loss: 4.0299e-04d from 0.00042 to 0.00040, saving mannequin to mannequin.hdf5
Epoch 5/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.2259e-04 - val_loss: 4.0852e-04 enhance
Epoch 6/100
199615/199615 [==============================] - 18s 91us/step - loss: 3.1668e-04 - val_loss: 4.0746e-04 enhance
...

After coaching we are able to get the ultimate loss for the take a look at set through the use of the consider() fucntion.

loss <- consider(mannequin, x = x_test[y_test == 0,], y = x_test[y_test == 0,])
loss
        loss 
0.0003534254 

Tuning with CloudML

We could possibly get higher outcomes by tuning our mannequin hyperparameters. We can tune, for instance, the normalization perform, the training charge, the activation capabilities and the scale of hidden layers. CloudML makes use of Bayesian optimization to tune hyperparameters of fashions as described in this weblog submit.

We can use the cloudml bundle to tune our mannequin, however first we have to put together our mission by making a coaching flag for every hyperparameter and a tuning.yml file that can inform CloudML what parameters we need to tune and the way.

The full script used for coaching on CloudML will be discovered at https://github.com/dfalbel/fraud-autoencoder-example. The most vital modifications to the code had been including the coaching flags:

FLAGS <- flags(
  flag_string("normalization", "minmax", "One of minmax, zscore"),
  flag_string("activation", "relu", "One of relu, selu, tanh, sigmoid"),
  flag_numeric("learning_rate", 0.001, "Optimizer Learning Rate"),
  flag_integer("hidden_size", 15, "The hidden layer dimension")
)

We then used the FLAGS variable contained in the script to drive the hyperparameters of the mannequin, for instance:

mannequin %>% compile(
  optimizer = optimizer_adam(lr = FLAGS$learning_rate), 
  loss = 'mean_squared_error',
)

We additionally created a tuning.yml file describing how hyperparameters ought to be diversified throughout coaching, in addition to what metric we wished to optimize (on this case it was the validation loss: val_loss).

tuning.yml

coachingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  hyperparameters:
    aim: MINIMIZE
    hyperparameterMetricTag: val_loss
    maxTrials: 10
    maxParallelTrials: 5
    params:
      - parameterName: normalization
        sort: CATEGORICAL
        categoricalValues: [zscore, minmax]
      - parameterName: activation
        sort: CATEGORICAL
        categoricalValues: [relu, selu, tanh, sigmoid]
      - parameterName: learning_rate
        sort: DOUBLE
        minValue: 0.000001
        maxValue: 0.1
        scaleType: UNIT_LOG_SCALE
      - parameterName: hidden_size
        sort: INTEGER
        minValue: 5
        maxValue: 50
        scaleType: UNIT_LINEAR_SCALE

We describe the kind of machine we need to use (on this case a standard_gpu occasion), the metric we need to reduce whereas tuning, and the the utmost variety of trials (i.e. variety of mixtures of hyperparameters we need to take a look at). We then specify how we need to differ every hyperparameter throughout tuning.

You can study extra in regards to the tuning.yml file on the Tensorflow for R documentation and at Google’s official documentation on CloudML.

Now we’re able to ship the job to Google CloudML. We can do that by operating:

library(cloudml)
cloudml_train("prepare.R", config = "tuning.yml")

The cloudml bundle takes care of importing the dataset and putting in any R bundle dependencies required to run the script on CloudML. If you’re utilizing RStudio v1.1 or increased, it’ll additionally assist you to monitor your job in a background terminal. You may monitor your job utilizing the Google Cloud Console.

After the job is completed we are able to accumulate the job outcomes with:

This will copy the recordsdata from the job with one of the best val_loss efficiency on CloudML to your native system and open a report summarizing the coaching run.

Since we used a callback to avoid wasting mannequin checkpoints throughout coaching, the mannequin file was additionally copied from Google CloudML. Files created throughout coaching are copied to the “runs” subdirectory of the working listing from which cloudml_train() known as. You can decide this listing for the newest run with:

[1] runs/cloudml_2018_01_23_221244595-03

You may checklist all earlier runs and their validation losses with:

ls_runs(order = metric_val_loss, lowering = FALSE)
                    run_dir metric_loss metric_val_loss
1 runs/2017-12-09T21-01-11Z      0.2577          0.1482
2 runs/2017-12-09T21-00-11Z      0.2655          0.1505
3 runs/2017-12-09T19-59-44Z      0.2597          0.1402
4 runs/2017-12-09T19-56-48Z      0.2610          0.1459

Use View(ls_runs()) to view all columns

In our case the job downloaded from CloudML was saved to runs/cloudml_2018_01_23_221244595-03/, so the saved mannequin file is on the market at runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5. We can now use our tuned mannequin to make predictions.

Making predictions

Now that we educated and tuned our mannequin we’re able to generate predictions with our autoencoder. We have an interest within the MSE for every commentary and we count on that observations of fraudulent transactions could have increased MSE’s.

First, let’s load our mannequin.

mannequin <- load_model_hdf5("runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5", 
                         compile = FALSE)

Now let’s calculate the MSE for the coaching and take a look at set observations.

pred_train <- predict(mannequin, x_train)
mse_train <- apply((x_train - pred_train)^2, 1, sum)

pred_test <- predict(mannequin, x_test)
mse_test <- apply((x_test - pred_test)^2, 1, sum)

A very good measure of mannequin efficiency in extremely unbalanced datasets is the Area Under the ROC Curve (AUC). AUC has a pleasant interpretation for this drawback, it’s the chance {that a} fraudulent transaction could have increased MSE then a standard one. We can calculate this utilizing the Metrics bundle, which implements all kinds of widespread machine studying mannequin efficiency metrics.

[1] 0.9546814
[1] 0.9403554

To use the mannequin in follow for making predictions we have to discover a threshold (ok) for the MSE, then if if (MSE > ok) we take into account that transaction a fraud (in any other case we take into account it regular). To outline this worth it’s helpful to have a look at precision and recall whereas various the edge (ok).

possible_k <- seq(0, 0.5, size.out = 100)
precision <- sapply(possible_k, perform(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(predicted_class == 1 & y_test == 1)/sum(predicted_class)
})

qplot(possible_k, precision, geom = "line") 
  + labs(x = "Threshold", y = "Precision")

recall <- sapply(possible_k, perform(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(predicted_class == 1 & y_test == 1)/sum(y_test)
})
qplot(possible_k, recall, geom = "line") 
  + labs(x = "Threshold", y = "Recall")

A very good place to begin could be to decide on the edge with most precision however we might additionally base our choice on how a lot cash we’d lose from fraudulent transactions.

Suppose every guide verification of fraud prices us $1 but when we don’t confirm a transaction and it’s a fraud we are going to lose this transaction quantity. Let’s discover for every threshold worth how a lot cash we’d lose.

cost_per_verification <- 1

lost_money <- sapply(possible_k, perform(ok) {
  predicted_class <- as.numeric(mse_test > ok)
  sum(cost_per_verification * predicted_class + (predicted_class == 0) * y_test * df_test$Amount) 
})

qplot(possible_k, lost_money, geom = "line") + labs(x = "Threshold", y = "Lost Money")

We can discover one of the best threshold on this case with:

[1] 0.005050505

If we wanted to manually confirm all frauds, it might price us ~$13,000. Using our mannequin we are able to scale back this to ~$2,500.

LEAVE A REPLY

Please enter your comment!
Please enter your name here