Fighting Fraud with Real-time Data Hubs

6 min readJan 3, 2020

For financial services firms, fraud is not just a cost and a regulatory concern: each false positive or false negative also represents a disruptive experience for the customer.

Unsurprisingly, enhancing fraud detection is an ongoing priority — and many firms have turned to advanced analytics and machine learning models to deliver improvements. In fact, fraud detection has shown up as the number one use case for machine learning and AI in banking in a number of surveys. Vendors are building machine learning and AI into their fraud detection products and banks have been experimenting building their own models.

Common to many of these initiatives is the use of new data streams. In particular, there are three data related trends that stand out:

Enriched transaction data: Payment schemes worldwide are adopting the ISO20022 standard. This can mean a richer set of data travels with the payment instruction — and this may be useful in payments related fraud detection models. Similarly, with the introduction of the 3D Secure 2 (3DS2) for online card payments — richer data is exchanged during the transaction and is used to estimate risk and to decide what form of additional verification may be required. 3DS2 moves beyond the original 3D Secure approach of enforcing a basic multi-factor verification for ecommerce transactions.
Data streamed from multiple transactional sources in real-time: Banks have started to build what some have termed ‘fraud hubs’ — where data is streamed from multiple transactional systems in near real-time and then used for fraud detection in a holistic way. Data feeds may include card transactions, payments, online banking and ATM activity and even loyalty partner interactions. Data may include not just transactions but also authentications, navigation, beneficiary management actions and more. Interestingly, these same combined datasets can often be useful for other purposes such as cross-sell and upsell modelling, credit default prediction and more.
Cross-bank data sharing for payment fraud detection — and particularly AML efforts: While at an early stage, some banks have started discussions with others around sharing data for these use cases. An example of such an early stage effort is the ‘Transaction Monitoring Nederland’ initiative that aims to evaluate transactions from a number of Dutch banks. Models used for fraud detection in these multi-party scenarios and the security approach required if transaction payloads are not to be shared vary considerably to the approach a bank may take on its own.

The remainder of this article focuses on the second trend above.

Streaming data into a hub offers the obvious benefit of detecting abnormal customer behaviour — and hence possible fraud — in a more detailed way. Online card purchases, cardholder present transactions in store, authentications, location info and payments initiated in mobile banking interfaces, ATM visits and more can all be combined as part of the anomaly detection process. So a card holder present transaction in a different location to an ATM withdrawal done at the same time may be identified as anomalous. Beyond near real-time fraud detection these models may also contribute to viewing longitudinal changes in customer behaviour that should be investigated as part of a broader know-your-customer responsibility.

Let’s look at some of the considerations in successfully implementing a fraud hub.

Ingesting data at the rate required (which may vary considerably by time of month and season) is a key challenge. In particular consideration must be given to:

Integration into the source system pipelines — either via an enterprise service bus (which might present capacity constraints), through direct integration to a pipeline or via the use of pre-built adaptors that might work with queuing or event hub systems in use.
The rate of change in source system data schemas and how this will be managed as part of fraud hub operations. For example, activity logs for a frequently refreshed mobile banking app may see frequent updates necessitating flexibility and resources to accommodate these updates.
The possible need to calculate (on the fly and at speed) derived features that may be useful for modelling. For example, a rolling average number of card-holder not present transactions per month or maximum and average ATM withdrawal values. These derived features may be helpful in creating anomaly detection models. Some form of time series smoothing might also be required for certain transaction streams.
In what format to land master (customer and product holding) and transactional data.
How to justify infrastructure expenditure to process and store the data when the business impact of future models is often not fully known.
Cloud environments have been popular in early efforts due to their dynamic scalability.

Security for the data and modelling environment is a further important consideration — particularly when public cloud resources are used for the hub:

Some form of anonymization on key PII attributes might be done in the source pipeline before transmission to the hub. Security principles at the bank may dictate the technique used. Where possible, hashing attributes are more useful in preserving patterns that models may recognize. For example, in one case the customer’s email domain was significant in detecting phishing originated attacks that lead to anomalous transactions.
Microsoft’s open source Presidio libraries are a useful tool for inline anonymization at scale and across platforms.
Most data platforms provide for security at rest and in transit. Structured data platforms often provide dynamic data masking and related features.
Since many of the transactional data sets will land as large semi-structured files that may be combined with structured master or other transactional data some form of unified data access control may be necessary. Tools such as Blue Talon are useful in this space. Usage auditing is often also necessary.
Although the number of data scientists initially building models for anomaly detection may be small it is worth thinking ahead to further use cases for the real-time data hub and the convenience with which security can be applied to a wider user community.
For cloud environments, security policy enforcement across the cloud configuration with ongoing monitoring may also be applied.
Trusted Execution Environments (TEEs) appear as an emerging option for cases where data from multiple banks is being brought together and limited sharing of transaction payload data is required — or perhaps where a bank runs different product lines in a very highly federated organization structure.

Banks may choose to use a combination of ISV devised detection models and bespoke models developed by themselves. Model development tooling is likely to be driven by the data science team’s experience and preference. Considerations may include:

The availability of training data that connects verified fraud incidents with the combination of data that will be aggregated in the hub. In some cases training data may have to be accumulated once the hub has been built.
The ability to deploy models to a highly scalable execution environment.
Model retraining, evolution and release management — including testing of releases.
Leveraging off-the-shelf anomaly detection, time series smoothing and other services in conjunction with bespoke model approaches.

Other areas to give thought to include:

Data models to be used when landing data in the hub — particularly for the master data. An option that has come available recently is to use Microsoft’s Banking Accelerator open Common Data Model for master customer and product data and to land transactional datasets in Azure Data Lake — this supports interaction with data stored in the Common Data Model.
The orchestration of actions post detection. This may involve initiating cases in existing fraud case management systems, suspending cards or profiles, generating outbound text messages to customers or similar. Data available in the hub may provide insight into the best action to take or communication channel to use. Cloud-based tools bring orchestration and integration capabilities.

Fraud hubs offer exiting potential with uses that extend beyond fraud detection. Business results will vary considerably based on data ingested, sophistication of existing fraud solutions and the quality of models built and maintained — but reduction of missed fraud events and reduction of false positives both in excess of 10% appear to be achievable.

With the right attention given to security, cloud environments are often a great fit for building a fraud hub.

Fighting Fraud with Real-time Data Hubs

Written by Rupert Nicolay