On Photos' Face Recognition algorithm
I wrote an earlier post about face tagging in Photos and the challenge for users to work through their library and get everything tagged right. I promised then also some thoughts on how the face recognition algorithms themselves could maybe be improved. I continue to think that improving the user workflow around face tagging is actually more important than improving the algorithms, but there's no reason not to fight on both fronts.
Mainly, the trouble comes from the fact that my kids look alike. Photos can easily recognize the difference between my own face and my wife's, but it can't tell apart my 2-year old (let's call him S) from my 4-year old (let's call him T). I've come across many (hundreds) of photos where S was tagged T, and have come to notice the things that help me as a human tell the faces apart where the Vision API falls short.
First off, the big one: what mostly confuses Photos is that in the 2-year old pictures of T when he was 2, he looks very close to what my S looks like now at 2. It compares these pictures without noticing the passing of time, and gets confused. You could say this is an edge case - but families taking pictures of their children, and siblings looking alike at young ages, isn't exactly rare. Maybe Photos could somehow take the passing of time into account - either internally separate the Person into different age blocks as if they were different People altogether, or parametrize the DNN or SVM that they use? I have no idea what the right way forward is, but this is the #1 failure mode for us.
Second, I've had dozens of pictures where it suggests that S is T, even though T is standing right next to him and is already tagged in the same picture. If a Person is already in a picture, the prior probability of a second face also being that Person could be reduced (though probably not to zero, there's mirrors complicating things, although whether or not you want faces in mirrors tagged is a matter of personal taste).
Third, it regularly suggests faces of People who were nowhere near a certain event, even when most of the surrounding pictures were correctly tagged. Since it perfectly knows the time and location of these photos and has already grouped them into a Moment, maybe there could be some way to take that into account when picking a most likely Person out of the candidate matches? If a Person is already in a Moment, the prior probability of that Person being in a picture could be increased.
Fourth and last, I have some Live Photos where the face is hard to recognize in the key frame, but easy to recognize in the frames before or after. I wonder if the face recognition algorithms could be fed all the frames of a live photo? It would drastically increase the number of datapoints for the algorithms to train on (though also the runtime it would take for the analysis to be complete).
I'm sure once the above is taken into account, there will be more things to work on - but for now, in my experience, getting these 4 fixed would already bring us from the 70% or so that it tends to score today up to 90%. Which would be a welcome step forward for us!
Mainly, the trouble comes from the fact that my kids look alike. Photos can easily recognize the difference between my own face and my wife's, but it can't tell apart my 2-year old (let's call him S) from my 4-year old (let's call him T). I've come across many (hundreds) of photos where S was tagged T, and have come to notice the things that help me as a human tell the faces apart where the Vision API falls short.
First off, the big one: what mostly confuses Photos is that in the 2-year old pictures of T when he was 2, he looks very close to what my S looks like now at 2. It compares these pictures without noticing the passing of time, and gets confused. You could say this is an edge case - but families taking pictures of their children, and siblings looking alike at young ages, isn't exactly rare. Maybe Photos could somehow take the passing of time into account - either internally separate the Person into different age blocks as if they were different People altogether, or parametrize the DNN or SVM that they use? I have no idea what the right way forward is, but this is the #1 failure mode for us.
Second, I've had dozens of pictures where it suggests that S is T, even though T is standing right next to him and is already tagged in the same picture. If a Person is already in a picture, the prior probability of a second face also being that Person could be reduced (though probably not to zero, there's mirrors complicating things, although whether or not you want faces in mirrors tagged is a matter of personal taste).
Third, it regularly suggests faces of People who were nowhere near a certain event, even when most of the surrounding pictures were correctly tagged. Since it perfectly knows the time and location of these photos and has already grouped them into a Moment, maybe there could be some way to take that into account when picking a most likely Person out of the candidate matches? If a Person is already in a Moment, the prior probability of that Person being in a picture could be increased.
Fourth and last, I have some Live Photos where the face is hard to recognize in the key frame, but easy to recognize in the frames before or after. I wonder if the face recognition algorithms could be fed all the frames of a live photo? It would drastically increase the number of datapoints for the algorithms to train on (though also the runtime it would take for the analysis to be complete).
I'm sure once the above is taken into account, there will be more things to work on - but for now, in my experience, getting these 4 fixed would already bring us from the 70% or so that it tends to score today up to 90%. Which would be a welcome step forward for us!
 
Comments
Post a Comment