Created
September 6, 2020 22:08
-
-
Save ollie314/72f7ff1294d4719779c57fbb477b7054 to your computer and use it in GitHub Desktop.
Simple sample to resolve identity
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module IdentificationProcessing | |
open System | |
open Domain | |
open ContractDomain | |
type DataSubjectIdentityResolver = | |
string // the name of the type | |
-> obj // the object instance to look into | |
-> string // the owner identity | |
// here the tricky part of the identification process since w may have information about the isntance of the | |
// object to look into but we may also have no inforamtion about that. In such case, we are using the internal | |
// scoring processor which is actualy able to score about data matching | |
// The default case of the resolver should use the DataSubjectIdentifier but the implementation must be adapted | |
// so that definitions fit with actual needs | |
let resolveOwner : DataSubjectIdentityResolver = | |
fun s i -> | |
match s with | |
| "ContractEvent" -> | |
// TODO: use a generic approach since we don't neccessary have dependencies in the scope of the processing | |
let c : ContractEvent = downcast i | |
c.PartnerId | |
// to be continued... | |
| _ -> "-" | |
(* | |
The idea behind that is to reduce the complexity by filtering eligible data subject. | |
At first, we are looking for data subject in the index by performing a query | |
on the index using all fields presents in the source vector, then, it uses | |
the result to perform a proximity evaluation for each items in the result set. | |
lets imagine the following object | |
const o = {email: "[email protected]", firstname: "john", lastnaem: "doe", phone: "+41795477978"} | |
The system will transform that to the following LUCENE Query | |
q = "email:[email protected] OR firstname:john OR lastnaem:doe OR phone:+41795477978" | |
Then it will perform the research and, based on the result, it will perform an evaluation using both | |
a distance on each provided fields and a score according to the presence of the field in the result. | |
A result may contains only a part of provided fields. For instance, it may contains only the email, | |
the phone and the firstname. | |
A table gives the weight of each fields in the comparison. | |
Note: In our case, the weight of each field should be relative which means the fact that the sum represents | |
a probablity doesn't matter. | |
For instance, we may have this table: | |
tabble = { | |
email: 0.6 | |
firstname: 0.05 | |
lastname: 0.05 | |
phone: 0.3 | |
} | |
In such case, we follow the idea that a matching email pvides a confidence of 60% that both items match. | |
According to our example, we only have the email field and the phone field. | |
We are fixing the probablity gate to 0.8 which means that a full match (email and phone) will pass and | |
all other comparison will fail which is a bit aggresive in terms of filtering. In order to tackle that | |
we add the distance comparison which will ponderate the result. The distance is transformed to a match | |
rate which is multiplicate to the field's weight to obtain the final result. | |
Let's express that mathematically. | |
let | |
- fw: the field's weight, | |
- d: the distance between the field in the result and the field to check for | |
- p: proximity between two vector | |
we have | |
n | |
---- | |
p = \ (fwi * di) | |
/ | |
---- | |
i = 0 | |
where | |
- fwi is the field weight at the position i in the vector | |
- and di is the distance result of the the item at the position i in the vector | |
As we states earlier, the threshold is 0.8 (which actually should be fixed by the business). So if the | |
matching probability is over this threshold, we have a potential match. | |
Rules for result processing should be defined by the business. For isntance, we can use the following definition | |
- if no match is found, the identity will be created (and indexed by the way - we have to take care about | |
potential latency of the index vs the velocity of the event stream) | |
- If only one match is found, we are associating the event with the identity | |
- if nore than one match is found, we are generating an alert to let a manager operates the identity selection | |
and the system associates the event with all potential identity (the association is made with the flag 'potential'). | |
When the manager selects the correct identity, events are review so that all event with other identities wiill be dropped | |
and the flag potential will be removed. | |
In order to tackle the concern regarding the velocity of the stream versus the latency of the indexing | |
process, we can make a ttl based local record (ttl at least equal to the average maximum indexing | |
process duration). All local record will be added to the search result. Since all records are ttl based, we will only keep relevant | |
and acceptable amount of records in the processing window. | |
*) | |
// defines a partial representation of a datasubject | |
// See the fsharp implementation of the event hub stack to gather information | |
// about the real representatio nf a data subject (whcih is more abstract) | |
type DataSubjectInfo = { | |
id: string // unique id of the data subject in our system | |
Email: string // email of the data subject | |
Firstname: string // firstname of the data subject | |
Lastname: string // lastname of the data subject | |
Phone: string // phone of the data subject | |
// to be completed... | |
} | |
// defines a resolver able to change the name of a field to a float | |
and FieldWeightResolver = string -> float | |
// this defines a data source to look into in order to load data subject | |
and DataSubjectSearchProvider = | |
string // the query to launch for searching | |
-> DataSubjectInfo list // the list of data subject found | |
// Defines a service ble to calculate distnace between two string | |
and DistanceCalculator = (string * string) -> int | |
// Return the rate associates to the distance between two strings | |
and DistanceScorer = DistanceCalculator -> (string * string) -> float | |
// this type defines a scorer | |
and DataSubjectIdentityMatcher = | |
DistanceScorer // dependency: reference to the scorer to use | |
-> DistanceCalculator // dependency: reference to the service able to calculate the distance | |
-> FieldWeightResolver // dependency: reference to the service able to ponderate | |
-> DataSubjectInfo // data subject to check for | |
-> DataSubjectInfo // data subject in to check accross | |
-> float // identity matching rate | |
// define the filtering process to apply on the search results | |
and DataSubjectIdentityFilter = | |
DataSubjectIdentityMatcher // dependency: reference to the service able to match two identity | |
-> DistanceScorer // dependency: reference to the scorer to use | |
-> DistanceCalculator // dependency: reference to the service able to calculate the distance | |
-> FieldWeightResolver // dependency: reference to the service able to ponderate | |
-> float // configuration: the threshold over which a matching is accepted | |
-> DataSubjectInfo // the data subject to check the identity for | |
-> DataSubjectInfo list // the list of data subject to check accros | |
-> DataSubjectInfo list // the list of matching identity | |
// Defines the data subject identifier services | |
and DataSubjectIdentifier = | |
DataSubjectSearchProvider // dependency: the service able to look for data subject | |
-> DataSubjectIdentityFilter // dependency: the filter to use to filter the request | |
-> DataSubjectIdentityMatcher // dependency: the service able to filter identities | |
-> DistanceScorer // dependency: reference to the scorer to use | |
-> DistanceCalculator // dependency: reference to the service able to calculate the distance | |
-> FieldWeightResolver // dependency: reference to the service able to ponderate | |
-> float // configuration: the threshold over which a matching is accepted | |
-> DataSubjectInfo // the data subject to fetch identity for | |
-> DataSubjectInfo list // the list of matching identities | |
(* ==================== IMPLEMENTATION ======================= *) | |
// Dummy implementation based on the rule presented in the documentation | |
let resolveFieldWeight: FieldWeightResolver = | |
fun s -> | |
match s with | |
| "email" -> 0.6 | |
| "firstname" -> 0.05 | |
| "lastname" -> 0.05 | |
| "phone" -> 0.3 | |
| _ -> 0.0 | |
// simple implementation of the levenstein distance calculation | |
let levenshteinDistance: DistanceCalculator = | |
fun (s1,s2) -> | |
let s1' = s1.ToCharArray() | |
let s2' = s2.ToCharArray() | |
let rec dist l1 l2 = match (l1,l2) with | |
| (l1, 0) -> l1 | |
| (0, l2) -> l2 | |
| (l1, l2) -> | |
if s1'.[l1-1] = s2'.[l2-1] then dist (l1-1)(l2-1) | |
else | |
let d1 = dist (l1-1) l2 | |
let d2 = dist l1 (l2-1) | |
let d3 = dist (l1-1)(l2-1) | |
1 + Math.Min(d1, Math.Min(d2,d3)) | |
dist s1.Length s2.Length | |
// naive implementation of a scorer | |
let naiveDistanceScorer: DistanceScorer = | |
fun distance (a,b) -> | |
((1 |> double) - (distance (a,b) |> double) / (a.Length |> double)) | |
// resolve a property to a string * string | |
let resolve (i: Reflection.PropertyInfo) (ds: DataSubjectInfo): (string * string) = | |
(i.Name, i.GetValue(ds).ToString()) | |
// Transform a dataSubjectInfo to a list of properties (reflection) | |
let toPropertyList (ds: DataSubjectInfo) = (Array.toList(ds.GetType().GetProperties())) | |
// transform a data subject info into a list of string * string | |
let tuplize (ds: DataSubjectInfo) = | |
let rec f (l: Reflection.PropertyInfo list) (acc: (string*string) list) = | |
match l with | |
| [] -> acc | |
| head :: tail -> f tail ((resolve head ds) :: acc) | |
f (ds |> toPropertyList) [] | |
// Transform a list of fields for a data subject to a lucene query | |
let makeLuceneQuery (ds: DataSubjectInfo) : string = | |
let rec f (l: Reflection.PropertyInfo list) (acc: string list) = | |
match l with | |
| [] -> acc | |
// TODO: drop empty field here | |
| head :: tail -> f tail ((resolve head ds |> fun (a,b) -> (sprintf "%s:%s" a b)) :: acc) | |
// ["email":"...";"phone":"..."[;...]] | |
let parts = f (ds |> toPropertyList) [] | |
// email:... OR phone:... [...] | |
parts |> String.concat " OR " | |
let dataSubjectIdentityMatcher: DataSubjectIdentityMatcher = | |
fun score dist resolveWeight refDs ds -> | |
let s = score dist | |
let refDs' = refDs |> tuplize | |
let ds' = ds |> tuplize | |
let rec f i acc = | |
match i with | |
| 0 -> acc | |
| _ -> | |
let (k,v) = refDs'.[i-1] | |
let (_,v') = ds'.[i-1] | |
let w = resolveWeight k | |
let sc = s (v,v') | |
f (i-1) (acc + sc * w) | |
f (refDs'.Length) 0.0 | |
let dataSubjectIdentityFilter: DataSubjectIdentityFilter = | |
fun matches score dist resolveWeigth t refDs ds -> | |
let s = matches score dist resolveWeigth | |
ds |> List.filter (fun d -> (s refDs d) > t) | |
let dataSubjectIdentifier: DataSubjectIdentifier = | |
fun search filter matches score dist resolveWeight threshold ds -> | |
let f = filter matches score dist resolveFieldWeight threshold | |
let l = ds |> makeLuceneQuery |> search | |
f ds l |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment