drawing

Tech5 - Face SDK FAQs

1. What is the operating score threshold of a system and guidelines on choosing it?

Biometric systems as we know will always have error rate due to various factors (like quality of data) and choosing the operating threshold of system is usually decided based on the tradeoff between the acceptable error rates. In generic terms, one chooses how much risk can be taken and how much inconvenience that the system will cause is acceptable. This translates to the risk of missing a bad person (false reject) vs inconvenience caused due to stopping innocent people (false accept). This choice is based on the use cases and applications. In most real life applications, the relevant error rate used as a base line is the false accept rate (FAR) as it directly relates to the operational cost of the system (inconvenience due to secondary or manual checks). Thus system operators pick a FAR and check if the FRR realized by the algorithm (risk of missing a person) at that FAR is acceptable. Note: the cost due FAR is hard to calculate and depends on the use case (for example missing one terrorist vs missing one person trying to get extra food stamp)

Let’s take an example, if your use case allows you to accept the following:

If 1 million fraudulent people attempt to get approved and out of which only one is ok to get access to the system then you choose the score threshold correlated to false acceptable rate of 1 in 1 000 000 (one million) i.e you consider the chance 1:1 000 000 (one million) to let the intruder in as acceptable FAR. For this you can choose the score which correlated with FAR 10-6 which is 6 in our case (- logFAR).

Note: This means in a 1:1 kind of use case (eKYC or authentication) 1 million people have to try to break the system using the same ID where one will get approved which is impractical. Now one would say why can’t we set the threshold higher so there is no false accept. But as you reduce false accept rate the false reject go higher. In other words, the system will start rejecting genuine people. This means the operator has to make a cautious decision and choose the sweet spot.

As a system operator it is common to tune the system over time to find a good tradeoff between both error rates. Or the other option is to take an already existing database in the system and perform a comprehensive test to decide what will be acceptable threshold. It should also be noted that the logic of choosing is slightly different between 1:1 vs 1:N use cases.

In case of 1:N higher threshold are recommended as each search means comparing against entire database and lower threshold will mean higher false matches and thus false alarms and extra work.

The best and recommended way to choose a threshold is to perform testing on your own test database, which can be a small representative part of your real database or the whole database. Testing will involve measuring error rates and different score thresholds and deciding which makes sense for the use case in hand. We can offer tools (ROC plotting ) and guidance to perform this step. This ensures that the algorithm scoring and the thresholds chosen are correlated to the system data. In case if the customer is not in a position to perform such testing we can recommend some thresholds based on the use case based on our vast experience in this field and the measurements we have done on large databases.

For example: (These are just examples ,we strongly recommend to consult with us)

  1. For eKYC (1:1) we can recommend a score threshold of 6 which means 1 in 1 million false accept rate (FAR) and for good quality (ID category) enrollment we can expect an accuracy of 99-99,9% depending on the database.
  2. For 1:N search kind of application we suggest to set a threshold of 9 to get 99% if your database has 100 million people enrolled.

2. What is the simplest way to choose the score threshold?

Choose the score depending on your expectation of acceptable chances to mis identify a person, or chances to get false matches and corresponding accuracy (TAR =1- FRR)you get that score threshold:

Score 5: 1 in 100,000 chance the person is different
Score 6: 1 in 1,000,000 chance the person is different
Score 7: 1 in 10,000,000 chance the person is different
Score 8: 1 in 100,000,000 chance the person is different

Be aware that this estimations are not universal and depend on your particular database. It’s highly recommended to test chosen score on your particular database.

We can help you to choose the score, if you provide us with information:

  1. Use case description:
  1. Ethnic structure of you DB

  2. Samples of your DB (~50-200 images of enrolled and probe)

3. How does scoring work for face modality?

Score is -logFAR

These nodes load Tech5 biometrics technologies SDK to create templates and return quality scores. Each node loads only one technology to avoid disruption in services due to other technology failures. These are stateless nodes and thus can be configured for auto recovery and automatic scaling.

Score in biometrics is a measure of similarity or dissimilarity in biometrics. In other words, what is the level of confidence if the person is same or the person is not the same. Each vendor has a different range and if the range is same the scores in between can have a totally different meaning. Tech5 use -logFAR scores.

The -logFAR score is an abstract number, which correlates to the false match rate which is a measure of dissimilarity. Simply speaking, if the score is 6, it means the algorithm estimates a FAR of 10-6 i.e. the chance of this being a false match is 1 in 1million or there is a 1 in a million chance that the person is not the same.

So it is simple to interpret that as the higher the score the higher the confidence the algorithm is not making a mistake.

The score is in logarithmic scale:

Score 1 means FAR of 1/10 i.e. 1 in 10 chance the person is different
Score 2 means FAR of 1/100 i.e. 1 in 100 chance the person is different
Score 3 means FAR of 1/1,000 i.e. 1 in 1,000 chance the person is different
….
Score 6 means FAR of 1/1,000,000 i.e. 1 in 1,000,000 chance the person is different

Score 9 means FAR of 1/1,000,000,000 i.e. 1 in 1,000,000,000 chance the person is different and so on

To standardize the interpretation of scores as shown above the algorithm’s native similarity are mapped to -logFAR. To do so, our research department use large databases to plot the native scores and then find a corresponding mapping to the -logFAR score. We use this method of scoring as is the most objective way among all scoring methods used by the industry.

Some companies use native score – like from 0 to 1 or from 0 to 100%. This method of scoring is very subjective and cannot be easily correlated to the expected operating point of the system. As we all know, any decision making system like biometrics has error rates such as false accept rate and false reject rate. One choose to operate the system at acceptable error rates based on the use case and hence knowing the relationship is critical. Since the 0 to 100% does not easily correlate to these error rates, choosing the score threshold is left to interpretation of the operator. This type of scoring can be seen as a simplified method of scoring to gives a false assurance of confidence to the non-specialist operating the system. The biggest pitfall of this method is that it hides the false match rate. 98% match by itself means nothing as we can get the same confidence at different false match rates. In other words, the system is saying I am 98% sure this person is the same but I can’t tell how many other non match people who will attempt will get the same score.

It is important to note that accuracy of the algorithm should not be compared on the absolute scores produced by the system as they mean nothing by themselves. The accuracy should be compared based on error rates and thus knowing the relation between error rates and score value makes it is easy to do so. Thus some vendors have FAR-related score, where you can understand what is behind these numbers, because it correlates with FAR errors. Some vendors use normalized native score range or %, which is simply designed to make it look more “simple” and “understandable” for users with limited understanding.

4. Do we support 32 bit?

Only Android SDK for ARMv7 architecture supports 32bit.

5. Questions need to be answered to size a system (like what is my db size, how many transactions in parallel at peak etc)

For ABIS or server side FR SDK

  1. Is the use case for 1:1 or 1:N
  2. Number of faces in the enrollment DB, if it’s 1:N
  3. Number of requests per 1 s (probe images which are sent to be matched against database).
  4. Peak parallel transaction. This is important especially when there is a hard SLA/restriction on maximum response time.
  5. Response time expectation or restriction.
  6. Description of workflow or requested features (like detection and recognition or detection only/recognition only, any additional features like age and gender recognition).
  7. Can the template creation be done the edge/client devices? This helps to leverage the edge compute power and require less hardware in the backend.

For EagleEye

  1. Number of cameras or video streams that will be connected to one system
  2. Number of faces in the enrollment DB
  3. Estimated traffic: how many people per 1 second are passing by each camera/all cameras.
  4. Length of the track – how many images per 1 person do we need to store in the archive. By default we store 1 best frame per person, but we also can store more.
  5. Time of storage of the archive (where we store all detected faces). In other words how long are the stores faces expected to be retained.
  6. Is there a requirement for ad hoc face recognition 1:N searches on the archive.

6. Can you recognize identical twins?

Face recognition works on analyzing facial features and measure the similarity between them. The algorithms are designed to be robust to handle variations in some features like hairstyle, skintone, changes in skin texture (makeup), facial hair and still maintain accuracy in the large database.

By nature all identical twins have similar DNA, which means they have similar facial features. Identical twins constitute a small % of the total population and algorithms are designed to solve the problem of the larger %. In many cases, even humans are unable to tell the difference. Given that algorithms are designed to be robust and work on large populations the standard version of the algorithm is unable to distinguish between identical twins by its design. Some claim there are specific algorithms that can reliably distinguish between identical twins either call for special equipment or high resolution images or work ons small databases. Both of which are not practical for real life use cases. . The algorithm wouldn’t be robust to age gaps for instance. The fantastic possibilities of FR to recognize faces with heavy make up, after plastic surgeries, with big age gap, etc. do not allow to recognize twins, because all differences between their faces lie in the normal distribution of differences in the face of a single person.

Why do some companies claim that they can recognize identical twins?

A mentioned above in a limited dataset with constraints one can tune the system to achieve some differentiation in twins. But such systems or algorithms will not be robust to handle large variation in faces observed in real life large scale practical systems thus making such algorithm only good for very specific use case or marketing purposes only.

Following in the reference from NIST report:

https://nvlpubs.nist.gov/nistpubs/ir/2018/NIST.IR.8238.pdf

“Twins: One component of the residual errors is that which arises from incorrect association of twins. The more accurate recognition algorithms tested here are incapable of distinguishing twins, not just identical (monozygotic) but also same-sex fraternal (dizygotic) twins. A twin, when present in an enrollment database will invariably produce a false positive if the twin is searched. Of the five algorithms tested, all incorrectly identify twins against eachother, except in many cases where the fraternal twins are of different sex. The inset table shows how often Twin A is not retrieved when Twin A, or Twin B, is searched. Twins constitute around 3.4% of all live infants in 2016 such that system operators might annotate twins in databases, and establish training and procedures to handle false positive outcomes.”

These are few of the FAQs, if you have further questions please do not hesitate to ask us at
support@tech5-sa.com

THANK YOU