Credit card fraudsters are always changing their behavior, developing new tactics. For banks, the damage isn’t just financial; their reputations are also on the line. So how do banks stay ahead of the crooks? For many, detection algorithms are essential.
Given enough data, a supervised machine learning model can learn to detect fraud in new credit card applications. This model will give each application a score — typically between 0 and 1 — to indicate the likelihood that it’s fraudulent. The banks can then set a threshold for which they regard an application as fraudulent or not — typically that threshold will enable the bank to keep false positives and false negatives at a level it finds acceptable.
False positives are the genuine applications that have been mistaken as fraud; false negatives are the fraudulent applications that are missed. Each false negative case has a direct financial impact on a bank as it corresponds to a financial loss. False positives are a little trickier to map to a financial figure since they represent the opportunity cost of losing a customer.
Given the impact fraudulent activity can have, it is vital to have an effective detection system. With fraudsters changing behavior constantly, a detection system is only effective if it can match this rate of change. This is the problem that we were helping with at one of our clients.
A lengthy process
For the past few years, we’ve worked with a global financial institution, on a variety of projects including one to help improve their credit card application fraud detection.
Figure 1: The existing system's behavior
1. Customer applications go to the client’s application service
2. The service sends the application to an external vendor.
3. The vendor sends a decision on whether that application is fraudulent to the service
4. The service approves or denies the application.
Fraud decision is made in three steps:
Filter the application through the rules, or known signs of fraud. For example, the rules would flag an application if it contained a known fraudulent e-mail or SSN.
Filtered applications go to the model. This model is a decisioning algorithm trained on historical data. For each application, the model will give it a score from 0 to 1 representing the fraud probability.
The strategies determine what decision to make based on the threshold.
As fraudsters’ behavior change, the rules, model, and strategies need to adapt. Any change to the rules, model or strategies requires a governance process to ensure the changes are not biased. An example bias would be denying credit card applicants based solely on age or location.
What’s more, these changes have to get on the backlog and release cycle for any third party service providers. The vendor would then prioritize the changes against requests from other clients. This proved to be a lengthy process. When combined with the governance delays, it could easily take a year or two to get updates into production. And that gives the fraudsters a huge opportunity to exploit the system.
Introducing continuous delivery to machine learning
We had two goals for this system:
Reduce the time taken to update the model
Increase the model’s accuracy
Figure 2: Proposal for achieving the first goal
1. The customer’s application goes to the application service
2. The vendor receives it and filters it through the rules
3. The vendor sends the application (plus any extra data) back to the service
4. The service sends the application to all Challenger models and the Champion model
5. The champion model sends the score back to the service
6. The service sends the score back to the vendor
7. Given the score, the strategies decide if the application is fraud
8. The service approves or denies the application
We introduce the concept of Challenger models, which are models that score applications in production but will not make fraud decisions. The client’s fraud strategy team looks at these scores and analyze the model performance. When a model is performing well enough, the strategy team promotes it to become the champion. The score from the champion model goes to the strategies and helps make the fraud decisions. With challenger models we ensure that the most accurate model score is used to make a fraud decision.
The champion model and all the challenger models exist inside the client’s system. This removes the need to wait on the vendor's backlog when the champion model needs updating.
This process is practicing continuous delivery with machine learning. Consider the challenger models as a staging environment for testing models against production data. When a model is ready to 'be deployed to production,' it’s promoted to the champion model. This promotion process reduces the time from a model’s inception to being deployed in production.
This system allows data scientists to experiment with different model algorithms, parameters, and feature sets against production data. With the champion model comes the risk of making decisions in production. Fortunately, because it has performed well in shadow mode, this risk is minimized.
Digging into data
To reach the second goal of improving the model’s accuracy, data munging — transforming and mapping data into more useful formats — needs to occur before model training. Data munging for historical data can take on many forms, for instance, filtering, transforming and deriving features.
One metric that can determine what happens with the data is cardinality — how many unique values are in a set. For instance, fields/columns that have a cardinality of 1 aren't useful for training at all. This is because a model cannot learn anything from a single value that has no variation across a dataset.
The same problem occurs for columns with very high cardinality. Example columns with high cardinality are first name, email address, and street address. High cardinality makes the training process slower and causes overfitting — when a model is trained well against data it knows but scores poorly on new data.
In both cases, it is better to drop these columns from the dataset before training the model.
In other cases, instead of filtering or dropping data, it may be more useful to transform it. Transforming high cardinality columns, such as email address, will reduce variation. A good example of this transformation is to drop the first part of the email and only keep the domain name.
In the example above, we moved down from a cardinality of 8 (eight unique emails) to a cardinality of 3 (three unique domain names). In a real-world scenario, millions of email address will reduce to a handful of domain names. Thus, the model can now determine the likelihood of fraud based on certain domain names.
Another technique data scientists use is engineering “features” that improve the model's accuracy. We can derive certain features from patterns in the data. A good example is the distance between the credit card application's zip code and the applicant’s known IP address location. If the distance is greater than a certain number, we could infer fraudulent activity.
Figure 3: Matching stated addresses with IP addresses
Needless to say, someone could legitimately apply for a credit card while on a business trip. Many credit card partners, such as airlines, offer inflight credit card application service. While this is a use case that could trigger "fraud," a well trained model should handle it because their decisions are based on particular combination of variables. A single variable won't bias the decision too much.
Taking advantage of these techniques will produce a more effective model. With more accurate models and the ability to switch them out quickly, there is a better chance of catching the fraudsters. To make this possible, we needed to redesign their fraud detection system.
Designing the system
The system's new architecture has two parts: the workflow of the data scientist and developer. The goal of the data scientist workflow is to take data and output a trained model. The developer workflow then uses the trained model to score applications in production.
Data scientist workflow
The goal of the data scientist workflow is to use historical data to train a machine learning model. Our goal was to provide a space for experimentation with transformation code and training algorithms.
Figure 4: Algorithm training
1. The training data goes through data transformation
2. H2O receives the transformed data and the desired decisioning algorithm
3. H2O outputs the trained model and training statistics as java code
4. The java code is packaged as a jar file
5. A binary repository stores the jar file
With this system, we wanted to promote data science independence. The data scientists should be able to write all the transformation code in a language they are familiar with. In this system, the transformation code is written in Python.
For training the model, we use the AI framework H2O. One of the benefits was that it could use Python for the training and output the model as Java code, which is what the majority of the client’s systems were written in.
This system encourages Continuous Integration for a team of data scientists. As the team builds a model, a pipeline can track the transformation code and check the model’s accuracy. Besides training the model, H2O also outputs model statistics, including the validation score. To ensure quality, the build will fail if its validation score falls below a predefined level.
The data scientist workflow ends with storing models as jar files in a binary repository. The first step in the developer workflow is to service-enable these models. To do this, the developer takes a jar file and wraps it in a simple Java service. This can be a simple service because it only needs to expose the model's score method. Because there aren't many complexities, we use Spark Java as the web framework.
Figure 5: Service enabling models
We need to ensure the scoring hasn't changed after transforming the model into a Java service. We achieve this through certification testing. We include an example application and its H2O model score in the jar file. When we send the example application to the service, we compare the resulting score to the H2O score. If they are within a certain tolerance range, due perhaps to differences in machine precision, we know the conversion from Python to Java went well.
Now that our models have passed certification testing, we deploy them in “shadow mode.” This means they will begin running against production data but not influencing real decisions. In the diagram below, we see how the production data reaches models in shadow mode. The application service receives applications and sends them to the vendor for rule filtering. The diagram below illustrates how the service works with our system to get the score from the model. Once the service has the score, it sends it to the vendor to filter through the strategies.
Figure 6: Deploying models in shadow mode
1. The application service sends the application to the Decisioning & Analytics platform
2. The platform sends the application to both the champion model and onto a message queue
3. The Champion model sends the application’s score to the platform
4. The platform sends the score to the application service
5. The shadow mode models pick up applications from the message queue
6. The shadow mode models sends scores to the message queue
7. The platform receives scores from the message queues
8. The platform sends the scores to a model predictions score
The model may need us to do some transformation to the application data in the Decisioning & Analytics Platform. In this case, the data scientists write this transformation code in Python, and we use Tornado, a Python web framework, to turn this transformation code into a service.
Steps one to four are synchronous since the system needs the champion model score to make a final decision. Asynchronously, the models in shadow mode pick up applications from the message queue. If there was only one model in shadow mode, there would be no need for the message queue: the platform could send the application straight to the model. The queue allows for more models in shadow mode without updating the platform for each model.
We store the shadow model scores so that the fraud strategy team can use them to measure model performance. When a model performs well, the team can decide to promote a shadow model to become the champion model.
One of our goals was to decrease the time it takes to update a model. This system does that by introducing continuous delivery to machine learning. In shadow mode, we see how models perform in production but avoid the risk of them making the wrong decision. In this way, shadow mode is a staging environment for machine learning models. Practicing continuous delivery with machine learning can be very beneficial. As stated earlier, fraudsters are changing continuously. With the above system, we are enabling our system to change at their pace.
Tying it all together
Below is an overview of the value stream we were delivering to the client.
Figure 7: The client's value stream
Data ingestion involves munging the client’s historical data. After cleaning, filtering and transforming the data, it's passed on for model training. After training, model metadata gets published for evaluation by a governance group. This metadata includes:
Training data quality metrics. What are the missing fields and how many of them are there?
Training and validation periods. How far apart is the validation period from the training period?
Most important features contributing to model decisions
Model accuracy reports (ROC curve)
When governance approves the metadata, the model gets published to a binary repository. The development team then takes the model and publishes it as a service. These services get deployed in "shadow mode." The client’s fraud strategy team monitors their performance within this environment.
Once a given model outperforms others, it gets passed on for a final governance evaluation. This happens before promoting the model to become "champion." Once a model becomes "champion," it is now responsible for producing the score used to make a fraud decision.
The additional governance mitigates the risk that comes from making active decisions. Deploying models to "Shadow Mode" is much less risky since they do not make decisions. Instead, they only log scores. The final governance step is to ensure the best model is making decisions.
This value stream is not specific to credit card application fraud. It can be used to deploy any kind of machine learning models. For example, these models can tackle other problems such as:
Determine a pricing strategy for loans or interest rate decisions.
Determine credit card marketing and preferences based on consumer habits
What this value stream is delivering is a custom machine learning platform as a service.
A big thank you to our fellow Thoughtworkers — Karun AB, David Johnston and Gareth Morgan, for their valuable feedback and suggestions for this article.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.