Electricity + Control

The rising cost of insurance in South Africa is compounded by an increase in fraud plaguing the industry. To reduce this cost, researchers in the Department of Computer Science at the University of Pretoria investigated the development of a secure framework using Big Data Science to predict whether an insurance claim is fraudulent.

This was done within the bounds of the fact that alternate new legislation, such as the Protection of Personal Information (POPI) Act, limits how data about a person can be stored and what it can be used for. Insurance fraud can be defined as the wrongful or criminal deception of an insurance company for the purpose of wrongfully receiving compensation or benefits.

Using Big Data Science to predict insurance claims fraud

Insurance fraud can further be split into hard and soft fraud or planned and opportunistic fraud. Criminals who create false accidents or injuries are seen as an example of planned (hard) fraud. In comparison to this, policy holders who overinflate their claims to increase their monetary gain can be seen as an example of opportunistic (soft) fraud. Having large data sets adds little value unless enhanced insight can be gained from it. Therefore, there is an overlap between Big Data, data science and predictive analytics.

The financial services and insurance sector is becoming increasingly susceptible to cyber-security threats, such as malware, phishing and the abuse of internal access rights, as well as theft or loss of hardware and information. To mitigate this risk, the private and public sectors must undertake preventative measures, such as technical (hardware and software) prevention, training and vendor management. One prevention method involves masking any personal information that can be used to uniquely identify a person. If data is hashed enough so that policy holders’ information is no longer valuable to an external attacker, then the cyber-security risk can be seen to be partially mitigated. Research framework The Big Data Science process can be split into four phases: data preparation, machine learning, knowledge application and model maintenance.

The data preparation phase comprises data pre-processing (removing outliers, fixing or removing missing values and repairing discrepancies in fields such as names), data masking, data extraction, data cleaning, data import and data transformation. Open-source tools such as ‘OpenRefine’ are used to standardise the fields in order to eliminate any discrepancies that may affect the results. In normal operating procedures, pre-processing would ideally happen after data extraction. However, due to the sensitivity of the data, pre-processing needs to occur before data masking. Data masking is necessary in cases where anonymity is important.

Owing to the fact that this research aims to protect privacy, data masking is an integral step in the process. The data needs to be anonymised in such a way that a policy holder cannot be identified by a record, but where the data would still be usable in the machine learning process. This includes being able to uniquely identify a policy holder based on alternate personal information, such as gender, postal code and birth date. Data extraction is the process whereby the data that has already been pre-processed and masked is moved from the source system into the target data source.

Using Big Data Science to predict insurance claims fraud1Considerations such as different operating systems, software platforms, communication protocols and the structure of the data need to be taken into account. To apply this research to short-term insurance in countries other than South Africa, the data extracted should be generic enough to apply to property and casualty insurance on the global domain. To do this, the Association for Cooperative Operations Research and Development (ACORD) standards for data extracts should be used. Once the data has been moved into a staging area for data cleaning, the second attempt at data pre-processing takes place.

The original data pre-processing step exists to prevent erroneous data after masking. The data-cleaning step, however, includes the bulk of the pre-processing. Once the data has been cleaned, it is stored. In a Big Data Science solution, the data is often stored in a cluster computing file system. This is due to the fact that the cluster computing framework accommodates large-scale analytics. Although data transformation in an extract, transform and load (ETL) process often includes data cleaning, this step focuses on filtering and converting a finite set of input values to a finite set of output values. To run Apriori Association Rules on a model, it is often expressed that the variables in the model need to be discrete instead of continuous.

To do this, the variables are discretised; an example of this is placing the continuous variables in factors that express a numeric range instead of a floating point number. Another example of data transformation that exists in this use case would be to create calculated fields. For the short-term insurance application, the date difference between the policy start date and the date of the claim is calculated. Once the data has been fully pre-processed, cleaned, masked and transformed, the advent of machine learning takes place.

Machine learning techniques to predict fraud are not commonly published due to the fact that they need to be kept confidential for security reasons. The machine learning phase comprises two sub-phases: model training and accuracy testing. To train the model, one needs to import the data from the Big Data structure. Predictive model creation is often seen as an iterative process instead of a single-phase approach. The model is trained by adding half of the data set to the model and applying the data mining or machine learning algorithm to this data set. This process is repeated with different combinations of variables.

The second half of the data set is used to test the model. Once the model is created, the accuracy needs to be tested. Here, the variables that skew the results or are unnecessary are removed from the training dataset. Applying knowledge after it has been extracted to systems is a key step in the Big Data Science process. The main purpose of knowledge elicitation is to provide end-users with automatic recommendations. Instead of requiring analysts to portray the important knowledge and rules generated from these systems in the form of rules, ratings and concept maps, it can be beneficial to create automatic rules that can be applied to existing systems. The rules that are generated by algorithms are often not immediately usable by end-user systems. Therefore, the rules are converted to XML and imported into the system and applied to new claims without having to rebuild hard-coded business rules.

If these business rules are applied to new claims, automatic recommendations can be made as to whether there is a fundamental issue with a claim. If the Apriori Rule indicates that a claim is possibly fraudulent, this recommendation is made. If a business rule is broken based on Apriori Rules, this will also be indicated. Once the model has been generated and applied, the rules should not be static. Very few organisations maintain their models long after they have been created.

Ensuring a successful predictive analytics model is an integral step as it can minimise overheads, could result in increased reuse of the model and could increase performance. To facilitate the aforementioned process, a specific architecture to perform Big Data Science and produce automatic recommendations is suggested. File systems such as HDFS within Hadoop can be maintained using a data warehouse structure such as Hive. Hive translates SQL-like language to MapReduce to facilitate data manipulation. To perform machine learning on the data within Hadoop, analytics tools are widely used. These Data Science tools include software such as R, ‘RapidMiner’, SAS, SQL and Python, of which R is the most commonly used. Alternate versions of R, such as Microsoft R Open and Microsoft R Server (Revolution R), have been developed to make the analytics performed by R more scalable and work with bigger data sets.

Alternately, there are machine learning tools that form part of the Apache Software Foundation, such as Mahout. It is suggested that R is used for optimal results with smaller data sets. Mahout is suggested for bigger data sets to increase performance. Testing the model on insurance claims fraud From the aforementioned process and architecture, the application of Big Data Science on insurance claims fraud was tested with privacy preserving data mining (PPDM). The validity of a rule is often classified using values such as lift, confidence and support. Lift indicates the measure to which event A and event B are not independent.

Confidence, however, is the ratio of the number of transactions with the correct input and output factors to the number of transactions with the correct input factors. Lastly, support is all items with the correct input and output factors that generate a rule; the support can also be seen as a percentage of the total number of transactions in the record set. For this research, the researchers used lift as the most valuable indicator of the importance of a rule.

They therefore created a model with 34 distinct fields. These fields included information such as sum insured, total policy revenue, insurer, broker, policy start date, policy end date, personal information, and calculated fields such as loss ratio and the difference between the date of the claim and the policy start date. From these fields, the Apriori model is run with a confidence threshold of 0.5, and a support threshold of 0.4. With these thresholds, 181 distinct rules are generated. Importantly, personal information has not filtered through to the rules. This is due to the fact that the measures of validity for a rule include support. The personal information of a policy holder would not occur frequently enough to create a rule in a sufficiently large dataset. However, it should be said that having high lift, confidence and support might reduce the knowledge gained from the machine learning algorithm. One could therefore reduce the support threshold to infer more information.

This could be done to gain knowledge based on personal details where the number of times that the same information occurs is low. This will mean that the data-masking step of the Big Data Science process is highly important. If it is assumed that the intent of the machine learning algorithm is to find rules based on any facts. For example, if a syndicate committed fraud multiple times with the same alias across different insurers, brokers and agents, it can be said that the necessity to mask personal data is valid.

This can be seen as a viable way to apply a machine learning algorithm to data while maintaining anonymity. From the research presented, the value of using Big Data Science to predict and prevent insurance claims fraud can be seen. Techniques that add value and can reduce cost will surely be welcomed by the insurance industry of South Africa and abroad, due to the increase in the cost of insurance.

Although predicting insurance claims fraud through Big Data Science cannot be said to be completely accurate, the value that is gained is the indication that there are fundamental issues with a claim and that a claim breaks rules. Finally, if the antecedent of an association rule is the fraudulent claim indicator, then one can predict with confidence that the claim is fraudulent. Due to the POPI Act, the use of Big Data Science reduces the chance of cross-broker and cross-insurer use.

With the use of PPDM, data hashing and the anonymisation of data, there is a higher chance that adopting Big Data Science will be acceptable. Insurers and brokers can have a higher assurance that they are not at risk of losing policy holders to competitors if they share this information, as there is not enough information to identify a person. In conclusion, it is proposed that if the insurance industry can agree to share information in this way, incidence fraud can be reduced. This will benefit them separately and as a whole.

Courtesy of David Kenyon and Professor Jan Eloff, University of Pretoria