Table of contents

Matching rules and confidence scores

You should decide, and clearly document, what rules you will use for matching. For example, whether you will allow synonyms for first names such as William and Bill and how you will deal with common difficulties.

You should also decide which extra attributes to use to increase confidence in your matches. For example, passport number. You can then design your matching service to prompt the GOV.UK Verify Hub to request additional attributes from the identity provider.

Your matching rules should aim to reduce the number of false positives and negatives. A false positive is where your matching service returns a match that is wrong. A false negative is where your matching service returns no match, but a record for the individual does exist in the datasource(s).

You can minimise false positives by making sure your datasource(s) are clean and accurate before using them for matching.

You can minimise false negatives by thoroughly testing your matching strategy on your datasource(s), analysing the results, and iterating your matching rules.

Fuzzy matching

As well as looking for exact matches, you should consider looking for records that are an approximate match. The process for finding approximate matches is known as fuzzy or probabilistic matching.

Much like finding exact matches, you will need to decide the rules for this matching process. The more relaxed your rules, the more ‘fuzzy’ your matching. You’ll be more likely to receive false positives. The tighter your rules, the less ‘fuzzy’ your matching. You’ll be more likely to receive false negatives.

However, if the risk of the wrong fuzzy match is low, you may decide to allow a fuzzy match with a low confidence score to access your service.

Deterministic matching relies on comparing the exact data from the identity provider with an existing record to find a precise match; fuzzy or probabilistic matching instead looks at whether there is enough shared information to be considered a match.

There are several fuzzy matching techniques you can use to search for matches, including:

  • Levenshtein
  • phonetic algorithms

Levenshtein

It can also be useful to assess how much the record and data from the identity provider disagree. You can do this by measuring the number of changes that need to be made to a string for it to match with another. The number of changes is known as the edit or Levenshtein distance.

The higher the Levenshtein distance, the greater the difference between the 2 strings.

For example, ‘Smith’ and ‘Smithy’ only requires one inserted character, y for it to match so it has a Levenshtein distance of 1. However, ‘Smithy’ and ‘Smithe’ has a Levenshtein distance of 2 as it requires the removal of one character y and the insertion of character e.

If you want to reduce the risk of false positives, you might choose to restrict fuzzy matching to cases with a maximum Levenshtein distance or 2 or 3.

Phonetic algorithms

You can use phonetic algorithms such as Metaphone and Soundex help to identify words or names that sound similar.

These algorithms can help encode homophones in the same way so they can be matched by your matching service. For example, Metaphone transforms words with a ‘ck’ sound to ‘k’.

This can be particularly helpful if your data was collected through telephone calls with users. For example, if a user called Philip gave his information to a telephone operator who entered his first name as Phillip, phonetic algorithms would help you search for records matching Philip with a single l.

Confidence scores

You should assign a ‘score’ to indicate how confident you are in a match. You should aim to produce a single match with the highest confidence score.

For example, a high confidence score could suggest first name, last name, date of birth, and address all match to a record in your local datastore(s).

The risk of a mismatch to your service will affect the level of confidence required. For example, if your service is at high risk of identity fraud and uses LOA2, you will need a higher confidence in your matches than a service with a low risk of identity fraud.

You should test your matching rules and confidence scores regularly and iterate them based on the results.