Blog article
See all stories »

Curious Case of Actuarial Science, Geocoding and Machine Learning

“Life insurance is losing its appeal in the U.S. In 1965, Americans purchased 27 million policies, individually or through employers. In 2016, a population that was more than 50 percent larger still bought only 27 million policies. The share of Americans with life insurance has fallen to less than 60 percent, from 77 percent in 1989. Why this is happening remains a puzzle.” 

- Peter R. Orszag, Vice Chairman of Investment Banking at Lazard and Bloomberg Columnist

This article illustrates how Geocoding uncovers the untapped value within generally overlooked Financial Risk Management categories, such as Life and Annuity, and how it can help address modern day business challenges remarked by Orszag. While Geocoding in Big Data is gaining prominence within Property & Casualty (P&C), we believe the real opportunity lies in actuarial adoption of AI framework capable of processing consumable inputs that weren’t visible in the erstwhile 'Ease of Geocoding' era.

Establishing this premise for Life and Annuity, we then pivot towards crafting a general purpose Geo-inclusive architecture that can help actuaries of all disciplines apply Machine Learning to solve new generation of business problems, such as, dwindling subscribers or risk-attributed challenges, such as, Adverse Selection.


Nearly all of the data in the insurance business has a location attribute, e.g. address of a policyholder or location of an incident. However, many insurance companies have not fully utilized this component besides billing and mailing purposes. Randall E. Brubaker, FCAS, observes that several providers are still relying on zip code-based risk calculations, notwithstanding the fact that zip codes are incongruent and subject to change. A location, on the other hand, helps determine precise address of risk.

With the advent of Geocoding, location-based risk computation is not only simplified, but the efficiency of Big Data helps drive millions of such computations in matter of seconds. As a quick reminder, Geocoding is the process of assigning a unique identifier (UID) to an address or location or geographic shape. This capability helps an analyst view an underwritten business through the lens of demographic or geographic attributes that could be combined in such representations to investigate or search for possible relationships. Let's see how this applies within Life and Annuity.

Life and Annuity

Life and Annuity firms set aside reserves to hedge on market indexes based on certain assumptions generated with actuarial models. These models create buckets of population based on a few factors, usually five or less. These factors are used in segmenting a book of business into even set of distributions so that aggregate ratios on attrition, withdrawals, and other factors can be computed. This helps with several analytical workloads such as trend analysis done on year over year basis.

According to IBM fellow of Cognitive Decision Support Services, Sankar Virdhagriswaran, five or less variables make it nearly impossible to identify emerging trends attributed to a specific population or location. With Adverse Selection from limited set of variables, policies are priced based on broad characteristics shifting burden of coverage to Underwriting. The resultant wide classification of risk categories leads to higher pricing on annuity contracts and, subsequently, larger pool of fund allocated for market hedging, which in turn, leads to limited investment options.

This form of unyielding impairs insurers from reacting to changing market conditions quickly or creating timely products that could be marketed at locations better suited for specific subscriber population like Gen X, Millennials and Gen Y. This could be one of the contributing factors to Orszag’s puzzle of drop in Life subscription from 77 percent in 1989 to just 60 percent today.

What if a rich vocabulary of statistically significant ‘Geocodable’ variables was available not just at zip code level but rather at a granular Block Group level to help address risk entities quarterly, not yearly?

Addressing Adverse Selection,what if a rich vocabulary of statistically significant ‘Geocodable’ variables was available not just at zip code level but rather at a granular Block Group level to help address risk entities quarterly, not yearly? E.g. how does a daily commute from a low traffic density area to a gridlocked location impact my mortality? Should my Life premium be adjusted if I relocated from clean outskirts to an industrial belt within the same zip code? How much does my life expectancy change if I shifted from a hurricane prone region in the east to an earthquake zone in the west? What if actuarial assumptions in Life accounted for similar set of external attributes to determine risk of attrition, withdrawal and surrender?

In 2009, JAMA Internal Medicine ran experiments revealing three groups of statistically significant variables applicable for health risk calculations. Table here, lists them along with their levels of variance (R2).

This builds a case for the need of a similar trial to identify ‘Geocodable’ variables for modeling, not just within Life and Annuity, but all disciplines of actuarial science. Moreover, the process of surfacing these hidden attributes has to be made simple and scalable for actuaries to adapt without having to invest time and resources in technology. Let’s evaluate how a self-servicing framework can help.

Geo-Inclusive AI Framework for Actuarial Learning

A modern intelligent framework cannot sustain without considering for Economies of Scale. A study by IDC and IDG Enterprise shows companies are experiencing an exponential growth of data led by Geolocation with IoT coming in a close second. As more data continues to be collected, streams of significant attributes are expected to follow with Geolocation, Wearables and Socioeconomic data continuously dropped into Big Data.

With massive affordability of Big Data (at one-twentieth the cost), it’s inescapable for insurance firms to extend their modern data architectures to be Geo-inclusive and enjoy low cost of failures with enterprise-wide experimentation. It may be noted that Geocoding in Big Data has potential of introducing hundreds of clean and ready-for-consumption variables, requiring actuaries to include Dimension Reduction, a common process used in Machine Learning, to consolidate the number of random variables.

This leads us to envision an Information Rendering Framework (IRF) that can recommend relevant attributes. Mixing incumbent variables with Geocoded ones, IRF will apply Dimension Reduction techniques like Principal Component Analysis (PCA) and Correlation-based Feature Selection (CFS) to select the best collection of variables for training and evaluation for best fit. These variables can then be applied towards Machine Learning techniques such as REPTree, Multi-Linear Regression, Artificial Neural Network and Random Tree Classifiers within actuarial cycles to enhance the accuracy and relevance with better precision to predict risk of applicant.

quantitative exercise in Kaggle by Noorhannah Boodhun and Manoj Jayabalan on a dataset with 128 attributes, volunteered by Prudential Life, revealed that REPTree algorithm showed highest performance for the CFS method, whereas Multi-Linear Regression showed best performance for PCA.

Machine Learning enhanced the accuracy and relevance within actuarial cycles with better precision to predict risk of applicant.

With better classification and higher representation, IRF not only solves Adverse Selection issue but also certifies results for minimal biases and reduced probability of overlooked ‘Geocodable’ attributes. To validate this, let’s study two cases quoted from Institute and Faculty of Actuaries, where actuaries have successfully identified Geocoded datasets applying strategies similar to IRF.

Exposure Management

Supervised Machine Learning was successfully applied in Exposure Management where learning was applied as a data cleansing tool to predict missing ‘Geocodable’ property attributes such as year built and number of floors. The results showed that Stochastic Gradient Boosting was the most accurate learning algorithm for both of the variables.

The Total Insured Value or property value was the most influential feature in predicting the number of floors of a building. The latitude and longitude of a property proved to have the greatest influence in predicting the year a building was built. Other key findings were that the ‘number of floors’ model had an accuracy or Poisson deviance error of 1 floor. The ‘year built’ model had an accuracy or root mean squared error (RMSE) of 11.51 years. This report finds that Machine Learning is beneficial in finding definite patterns and predictions from trained data to complete blank unfilled data.

Interest Rate Forecasting

Interest rate forecasting is of importance in various actuarial practice areas such as Life Insurance, Asset Liability Management, Liabilities Valuation and Capital Modelling. This case study describes a model which reads and provides sentiment analysis on central bank communications. Central banks, like the Bank of England (BoE), exert vast influence on the level of interest rate via monetary policies. The tone or sentiment in central banks communications sets an expectation in the market.

Supervised Machine Learning techniques, such as Geocoding location expressions in twitter messages, were used to train an ensemble model that classified BoE communications in a fully automated and scalable way. The results of the sentiment analysis were used in an interest-rate forecasting model. Given the inherent uncertainty in making forecasts, the interest-rate forecasting model provided a range of feasible outcomes.


Actuarial professionals are no stranger to Big Data and analyzing unstructured datasets, thereby rendering adoption of IRF, an easier curve to follow. Moreover, Geocoding and Machine Learning relieves actuaries from the requirement that all data be assigned to limited groups of pre-determined classifications.

Geocoding on Big Data opens up a broader range of analysis to be performed with regards to how an insured entity relates to its location, demographic segment etc. While Big Data allows modelers to run assumptions and subsequent computations on hundreds of millions of views based on multiple classifications, Machine Learning helps automate complex attribute validation involving a lot more than just five variables.

Change Management

Organizationally, actuaries will face several friction points when fitting a data science lifecycle within actuarial control cycles owing to the fact that one is not regulated while the other is heavily regulated. Thus an integrated path from strategy planning to production is critical for such initiatives to succeed.

To avoid risk of derailing of such projects, V-Squared Founder and Chief Data Scientist, Vin Vashishta, suggests an organizational strategy of taking the data science team and making it a startup within the company to quickly remove any incompatible parts of the process. Instead of taking 1-2 years, the integration timeline drops down to 3-6 months. This also brings clarity from an ROI standpoint as revenue planning is now based on validated prototypes and not just hopeful projects. Companies adopting this strategy will stand to win the next generation of challenges faced by the Insurance industry.

If you are an actuary or a FRM professional and have comments or questions please feel free to reach out to me at


To see other content and research from Chirag:



Comments: (0)

Blog group founder

Member since




More from member

This post is from a series of posts in the group:

Financial Risk Management

This network brings together professionals involved in the oversight and management of their company's financial risks and exposures as well as solution vendors, in order to discuss risk issues including interest rate risk, foreign exchange risk and commodity price risk, among others.

See all

Now hiring