Borders and Biometrics: Boundaries of Computer of Vision
10:55PM Aug 2, 2020
facial recognition systems
Hello everyone and welcome back for the final talk and.
Before we jump into the talk, we want to again thank you all attendees presenters and volunteers, your commitment to hope and our community have made this event possible, even in these difficult and trying times. So next up is Charlie Meyers Charlie's worked on autonomous vehicles has begun a PhD program in Sweden, that focuses on the intersection of computer security embedded devices and machine learning, his talk today is titled borders and biometrics boundaries of computer of computer vision, and we'll discuss how machine learning, in particular facial recognition technology has been rapidly adopted and deployed by law enforcement as a way to justify since sentencing policing focus and border control. We'll have a q&a session with Charlie after his talk so be sure to submit your questions via the matrix chat window. Now, over to you, Charlie, take it away Ground Control.
I hope you're enjoying hope 2020, and I'm simply mathematics, my GitHub handle at least. And this talk is from borders to biometrics the boundaries of surveillance. So a little bit about me, I have a Bachelors of Science in applied mathematics and a Master's of Science and data science both from CUNY. I founded a nonprofit in New York City. That provides free Wi Fi and built. supernode and wrote some firmware for it. And then I worked on self driving vehicles, Volkswagen on the verification validation team. And now I investigate adversarial attacks on machine learning, as part of my PhD. So my research interests include machine learning, and machine learning security, especially in the context of data privacy in an optimization and machine learning is a huge field, I'm less interested in building new models and more interested and verifying ones that are applying and making sure that they can handle outliers. And that, for example, self driving car system trained in Germany will work in other places around the world, whether due to difficulties in predicting different weather different destination patterns or just detecting people differently based on the color of their skin.
this isn't been so much of a problem yet, but in the future.
adversarial attacks on machine learning systems will be cheap and widely available. And so we need to prove that our machine learning models are capable of ignoring these or defending against them. And a lot of this comes down to, distributed and parallel processing, because we have to do a lot of things very quickly in machine learning that's that's the whole point to deal with gigabytes or terabytes or petabytes petabytes of data, and in particular for this field. It's important because you end up doing a lot of subset analysis or a lot of like model optimization in ways that can be totally parallelized. So you can either throw more cores at it or for more time at it and usually processes are cheaper. And so, and but going into this. You know there's there's an upper limit to the number of cores, you can have. And so, for certain models and certain applications. It has to be done in a distributed fashion. And so, collectively, this is called transfer in machine learning. So, growing kind of new field and it's pretty interesting check it out. And so in this time, you should expect a brief crash course on machine learning. A Brief History of facial recognition systems of survey and machine learning based surveillance systems, and finally his degree. So what is AI, well it's kind of a hard question to answer these terms don't mean a whole lot, and encompass a lot of statistical and probabilistic ideas that wouldn't be totally foreign to anyone taking undergraduate math courses like linear algebra, or calculus.
And they're deterministic so they're really mathematical machines, you put some numbers in, you get a result, and it can be
replicated over and over.
So it's inaccurate to say that we don't have any idea how machine learning or particular neural networks work. It's just that the decision steps they take are not this decision steps we take, and we have to do extra work to make their decision steps interpret. And really this is a weakness, rather than a magic feature there, it's, you know, we think of it as like emergent intelligence, but, as we'll see today it's just as much emergent stupidity. So machine learning, pipelines, as they're called are made up of a bunch of interlocking cooperative components where you have some data on there is fed into like a model that's fit to it. There may or may not be like new user input
in. And then there's some kind of output. Since we're looking at facial recognition systems, it will, you know, that you'll see person, a person B as the choice on all of these diagrams and a little note here is that your, you can only train your facial recognition recognition systems on an explicit set of people. If I train a model to take people out of the Senate, they will not make any sense to apply that model to people in the House of Representatives. So, that's not necessarily obvious, a lot of people but the the boundaries of your model are defined explicitly by the boundaries of your data set, and for these kinds of classification problems. There's not really any work that's gone into exploring beyond explicit classification.
The problem is there's a bunch of ways to attack it. These systems, you can poison.
You can evade detection. And you can attack the model itself and get access to its parameters, called model and version attack, or you can look at the output from a bunch of model queries.
Maybe by using their paid service or
similar or using a proxy model you build it yourself. You can reproduce the original data, which is terrifying in the context of facial recognition systems, because that means that, particularly the motivated attacker could find all of the individuals that comprise the original dataset. So there are a couple ways to divide, machine learning systems, and they find the like regression or classification category where regression is like this, you may be familiar with Excel where you have a bunch of data, and you're trying to draw a line. In this case, in the, in this case, this might be housing prices, and you have a bunch of other variables that are going to determine that and you fit your mind to it. And then there are classification problems where you're less trying to find a function that approximates your the collection of data, you're trying to function. If you try to find a function that separates your data. As you can see here, these are a bunch of different ways to do that. Additionally, this is a supervised system in that your housing prices are a known value, the y value and your other in your other variables are awesome. x. And so, you, you can evaluate the error explicitly. Now these are unsupervised systems. The orange and blue data are there for us to see how these things, clustered them, but there is to say like no correct answer here. And because this is unsupervised these don't have inside it outside labels. And so you have to come up with other techniques to build some kind of confidence in the results.
So feature making ism systems have been around
for decades since the 60s some of the earliest systems
on computers were built for facial recognition system,
precisely because of the military applications.
So, because of the military applications. It was very important to get this done. And it originated by laboriously measuring people's faces like the width of the nose or the distance between their eyes and creating a data sheet for each person, which would stand up pretty well to like any problems with scaling or rotation, or aging. And so, but, um, it was very laborious and it took a long time to do these measurements. So you needed kind of a captive audience of VA volunteers to get the faces from. And so Microsoft IBM in the US Army created internal database created internal databases of this that were massive in the 90s so that they can move beyond the small scale tests popular within like a small university, computer science department. One of these days as its became the ferret data set, which has been widely cited and when we use. Now, models rely on 3d face dance images or videos and have upwards of 99.999% accuracy for some groups of people. However, there are problems with gender and racial disparity, and MIT researcher joy bull and weenie, I hope I'm not butchering her name. And the current George Floyd street protests have brought this into this technology into the spotlight, and several vendors have withdrawn from the market, including Microsoft and IBM.
Historically, there have been different detection methods, there are like matrix decomposition techniques that work well and other types of signal data like. Like for example, one one technique is very good for isolating sound in the noisy room. The by and the mask does not work out very well for visual detection systems it peaks around 80% accuracy landmark based detections do do better, because they're robust to scaling aging and facial features involve facial feature changes. And then there's correlations ages which work on, like, average faces. Looking at like a sum of average faces. for that define a face or. I've ridging each class so that each class has a kind of like meta prototypical face and comparing these to some input face. But these are also like matrix decomposition algorithms beaten by glasses facial hair aging head rotation and meaning minor variation in the person's face. And so neural networks, overcome all of these and can figure out their own features like, no, like the location of the nose,
by themselves, but they do this at the costs to security.
So here's an example of the like matrix decomposition
faces. These are called eigen faces where essentially you have.
I think collection of independent, these are not in individual spaces, but these are kind of prototypical faces that can be used to distinguish individual faces and the math of this is maybe like four or five ends up being like four or five lines of code, versus modern systems where you have these massive neural networks that have a bunch of way, have a bunch of convolutional layers that pass these cascading filters across us and compression and then you repeat this process several times. And then at the end you get an output of probabilities and you find the one that has the highest and you say that's that's the face, you think it is. And it's like very strange mathematical machine that's can have many millions or billions or hundreds of billions of parameters. But, you know, in between those decomposition systems and neural networks, there is the violet, violet Jones algorithm. And what it does is it turns inside of Gmail, it turns an image into a set of geometric features, and it quickly rejects images that don't have certain features and so on the way it does that is by looking at these looking at an image through these filters, so it will take some raw image data in black and white, and we'll pass it through, or sorry, it will multiply it against another matrix, which in this case. In the case of a would be zeros on the left hand side and 255. On the right hand side for a three bit image, or three byte image. So, and what that does is essentially looks for long vertical line. And then the B will look for like a short horizontal line, the C will have for sure, shorter vertical line and the D will look for diagonal, and it picks in the algorithm. Picks these filters in such a way that is designed to quickly reject things that are not faces, because it has to do this kind of filtering and across many different orders of scale. This algorithm works on an image with a single face, as well as images of a crowd of people. So the problem is that a lot of the research in the 90s was based on these groups that do not reflect the human population. They were like overwhelmingly white and male, as we can see from this chart. I intelecom kind of more typical data sets the ferret, developed by the army in the 90s, and the label faces in the wild data set, which is a collection of the most famous celebrities, so overwhelmingly like Hollywood actor is
disproportionately American and white.
And then there's fair phase which is a data set that was intentionally created to be as close as possible to an actual human random sample. So, as you can see that when we look at the luminance value of the people in these images, or at least the foreground.
We find that
you had almost like a bimodal distribution for the fare database which is the smallest one down here. And so it's not very good, we would expect this to be normally distributed.
then for the Fw dataset, it is normally distributed. And it looks fine until we realize that both of these data sets are significantly lighter than what we would expect from a data set built from that built from randomly from the people on the planet.
what the, what this means exactly is that these detection systems that are used to inform you know where phases are in the animation. Then get used as part of the data collection process. As part of the training process and as part of this like cyclical kind of like spiral of things being ignored and you know it's it's not so bad. And because, as we can see, things are generally detected on the theory data set. Our performs the worst here because it includes like profile pictures as I'm like people turning to the left or right or 45 degree angles. And there are different file regions detectors are trained specifically on those angles, and this one was trained just specifically for frontal. So this verifies that, like, this does not work very well on rotational changes. But what we see here is that the open face detector performs much better on the celebrity data set, than it does on the real population data set, and perhaps that's due to problems in the training stage. I don't really know because the open face system is is proprietary, and, or at least the data they use to turn it on its proprietary or not released. I don't know if anyone actually has business interest in it. And we can see that this carries out in the ratio of performance and when fair face when we look at different, different racial groups of people we find that this you know the difference between the smallest and largest gap is around 1%, which is totally fine. That would totally would be definitely in their range of statistical randomness. But that is not true for the, like, eight and a half, and 11% when you see in bare metal of W. So we see that when we use these bias detectors to build facial recognition systems or facial recognition data sets that become those systems. We have this cascading problem, where 10% of the people who are, where is 99% accurate for white males and 89 to 85% accurate for minor minority groups. And so I started looking into.
Why, why this would be happening.
And I looked into open face, and open CV, which are too big.
libraries for doing this kind of facial recognition system, and it's an in their source code that deep down that they're both installing pillow inside a dealer runtime dealer, which is a library developed in the 90s that uses this Viola Jones detection algorithm trained on the overwhelming white data set. And what we see here is precisely that.
when we have these. When we train it on these overwhelmingly white data sets our performance goes way, way down. So this is the open. This is the D lib detector versus the neural network based one in open face. And we've seen the best case, our detection system is about as good as a coin flip. In the worst case, it only detects
110 percent of the faces.
And, you know, that was really abstract and kind of like hard to conceptualize but. So, I found this very good report from poll about this algorithm called pulse, which is, which deep picks leaked images.
I looked into posts and found what it was using as a base detector and it's in the generative adversarial network, it was using to create the new image, the novel images, And it turns out I was using D lib. And so, D lib is taking this pixelated picture of Obama on the left, and because of the way it is encoded white and maleness, it produces the images on the right. And so there's a saying in machine learning. That is always always always true. Garbage in, garbage out, and the data used to train these systems is very, very important. So, this is part of the fair face study. And they evaluated several different data sets for the racial composition. And so here's what it is. So one of the ways that our models become inaccurate is through bias bias that we didn't intend to include, and it's bias that comes from larger systems in society in data collection methods in you, what kinds of things get funding and, etc. And then there's intentional attacks, which we have discussed a bit, and the famous example is having a few pieces of tape on a stop sign will make this a vehicle think this is a speed limit sign, which could have disastrous consequences. And so as I said a little bit for, there's four different types of attacks poisoning, where you inject data to into the training process so that you can manipulate the model. And there's a vision where you're trying to get the model to misclassify. Some, some data, either you, you want it to. For example, misclassify your spam as ham or misclassify your botnet as a legitimate user. And then there's data extraction, where you are extraction attacks, where you extract the model itself and create a copy of it through examining enough of the input output pairs, and then you can actually create a copy of the input data by doing the same thing. So we've got poisoning invasion, and some poisoning attacks the model and evasion attacks the user input. And they're essentially the same, but mathematically speaking or conceptually speaking they just happen at different points in the process, like whether it happens before or after training, but the general idea is that you have some kind of like class boundary. And with poisoning you introduce a new data point that shifts this class boundary, such that things are misclassified. And with evasion. You do this in such a way that your malicious attack falls on the other side of line here. And so they're essentially the same. And they're rooted in this like geometric geometric interpretation of the data. And here you know we have these two dimensional data sites and when this gets tricky, is we have very when we have very large data sets, they might have like 10s or hundreds of dimensions. So like your average. megapixel image has a million dimensions, because each pixel becomes its own independent vector. And so,
You know it's it becomes impossible to like represent images in this way.
And so what this means is, tiny, tiny little numerical coincidences can sum up in such a way across these millions this million vectors that we can have things very confidently misclassified by detection systems. And as you can see you add into imperceptible amount of noise to this image of a panda and you can convince the computer that is a given. And in fact, it's partly it is more confident. This adversarial adversarial given exists, than it was that this benign panda existed. And then there's inversion attacks, which look at the output from the model.
And so, model output.
These days, is from these neural networks where you have kind of delusional systems,
these, there's a mathematical formula at the top, where you have two sons and two functions f and g, where
f is just sliding across
the framing, that he that is G. So if g is like a 10 by 10 grid, then f is sliding across that 10 by 10 grid, so that every block of cells is evaluated separately and at different scales. And that's what a convolution is this fancy star symbol. And the neural networks are just like a bunch of convolutions with some weights. So the convolution is happening in the dots, and then the waves are the lines between the notes in the ad up, and then you have like some input space when you have like image data, and then you have a bunch of convolutional layers and you have some kind of like output layer that produces a probability that you that you can interpret as a class classification.
Now I know that was
a little hard to understand that hopefully this image helps, and so you know what convolution does is you have some like input image, G, and some function f and then you get some feature map, and then it keeps doing this kind of like reduction and compression until you end up with just a probability at the end. And so, but the the main problem is when we look at all of these systems on there just kind of simple linear algebra steps like you have some additions you've got some multiplications I mean you've got some vector multiplication, but in no case, but none of these functions are really safe or cryptographically safe. That is a variable. And so all you really need is a certain number of observations, for a certain number of parameters. It's in those have to end on to determine some function that inverses. The model itself. And so, what we find is that this is now just a linear problem, where we with linear constraints and that means that if we have 10 parameters in the model we only need 10 observations and certainly this slows us down a little bit on this, in the sense that neural networks can have hundreds of billions of parameters. But I can also train a proxy model that approximates your model fairly accurately, with a much smaller number of parameters so I may only need a few million or, you know, 10s of thousands, which in the age of automation and cloud computing is relatively trivial, especially since a lot of these services are set up as paid services so they wouldn't necessarily see your exploratory attacks as malicious if you're paying them for them.
Now outside of the context of facial recognition systems.
These kinds of attacks have huge potential to disrupt like financial markets, because you have high speed traders competing with each other, to make to move very large chunks of the market, very quickly. And if, um, Goldman Sachs knows that Chase is going to move a certain direction, but at the same time chase does, then they can hedge against that and manipulate the market.
And at the
same time, if a botnet or a hacker wanted to create a DDoS attack. It would be it would be easy to do this with these techniques, because you could have a massive distributed system that is constantly probing like CloudFlare Google to get through the DDoS protections and figure out what parameters do and don't work, and at the same time.
This attack also allows
for deep fakes, as we'll see.
So there is an inference attack.
Which instead of just exposing the model exposes the original data. And it works under the same mathematical assumptions, but
you need as many input output pairs, as we're to mention leverage data. Again, that's a kind of oversimplified term. You could
use, you can potentially do this with a small fraction of the number of dimensions. So if this is a one megapixel image that's you know a million dimensions, then you might be, you might be able to get away with only 10,000 input output pairs, which sounds like a lot but isn't in the scale of cloud computing. And so what does that mean well there's there's widespread use facial recognition systems are used by Frontex to track immigrant across Europe, particularly because a lot of people are coming from war torn places and don't necessarily have identity documents. They're already being used in airports in the US and the EU in the EU, I ran into one last year that had that was being used to replace passports. In the wake of COVID. I guess I was this year. Wow. Seems like it was a long time ago. So it was it was being used to replace passports and no one had to touch anything. And it was running Windows XP, which was a little terrifying, and even recently they were they've been used in upstate New York. For safety, to make sure that like everyone on campus was supposed to be on campus. And that was very controversial. Additionally, the, there has been reports that the NYPD has been using Amazon's facial recognition system to find suspects using just composite sketches, which is you can see the system is doing a good job of matching the sketches, but humans are failing to accurately represent the people in their memory, when the sketch is happening. And so these systems are, you know, with very high confidence managing people, but these people that people imagine are not guilty. In this runs against nYp these internal data that says there are only five misidentifications out of what appears to be nearly, or slightly more than 10,000 uses of the technology. So there's some disagreement between the press and the NYPD about what's going on, how widespread and how useful these tools are.
So I mentioned.
Joy Boolean weaning weenie
at the beginning of the tire, but I wanted to bring up her.
Her research. And so she did a great job of evaluating commercial systems. And so she, you know, use IBM and open face and Microsoft and agents systems commercially to evaluate them against disparities in gender and skin tone. And as you can see like males outperform females and lighter people outperform darker people. And this is from 2017. And we find that after her first word was published, things began to change. And you can see that the audit produce reduced error across all on all areas in a year, but particularly for darker female and darker male, which is good because those are being under those who are not doing well.
this follow up research as you can see was done in 2018. So, in 2020. The suit sues the Trump administration, calling these threats to civil liberties and privacy, because Americans cannot take precaution against the covert or massive capture of images. San Francisco and Boston banned the use of facial recognition by law enforcement. The EU considered a ban, but stalled. Amid COVID, and some difficulties in dealing or dealing with Frontex, and the Interpol. And this has already been enrolled, but however this has been rolled out in airports in the US and Europe and the EU ban wouldn't have covered it anyway since it only was considered for public spaces. And these tools are already used for repression and surveillance and China's. We're internment camps. And so this is like actively a huge problem. And so, but what's interesting is, is they're still very fragile, like we went over all the ways to attack them and all you're really doing is shifting this boundary condition which no matter where I drive this boundary condition, I can always craft an example that allows me to shift it.
Here's a T shirt.
As you can see, all these other people are detected, but this guy is not even shown up to be a person. This woman is confused is about to be a bunch of license plates because she's overloading the sensors. This breaks the viola Jones detection algorithm because it undermines the central functions about eyebrows and the bridge of one's nose and how those appear geometrically to a camera, and the Department of Homeland Security is now worried about covid masks and how they play a role in the George Floyd protests and how. And they're afraid that subversive groups will use the occasion of face of public face mass to do vandalism and other writers act.
Which is funny because they sell these anti surveillance, like face masks on Amazon that combined the techniques, seen in the T shirt with the just covering your face. And this is even into the world of high fashion. This is called
dazzle makeup. And this designer
has come up with different ways to bring facial recognition systems. As you can see, they rely on like playing with the geometries. And the assumptions in the shadow and human faces.
finally, we've got this other system which print objects, constantly more to face on your own, and kind of like a version of this. And so, you know, these, these systems are trivial to break. They're very dangerous. They're very accurate, but only under narrow circumstances that are intended by the original researchers, and I hope you know at least a little bit of how to deal with these systems, and hope you enjoyed my talk.
So Thanks Charlie, that was a super enlightening talk. So, could you tell us a little bit more about or a little bit about how you got into the field of facial recognition. So I wouldn't
I wouldn't necessarily say that's my field on the mathematics is pretty generic and works for all kinds of classification problems. So, I'm the same, you can, if you have a different data set the same tools can be used to differentiate a stop sign from like a speed limit sign for example, I chose the topic though because it was. It's been in the news a lot recently and I was particularly inspired by an MIT researcher, joy, bulan, leaning, I'm sure I butchered her name, but she had investigated the commercial offerings like G Cloud and AWS, and the Microsoft tools for facial recognition that we know to be used by law enforcement and evaluated their efficacy on rating people based on things like gender and skin tone and race. And since she had established that these algorithms were were racist. I was what I was interested in why. And so I took this approach where I looked at a bunch of different datasets to come to, to show that it wasn't like a function of the models but it's a function of the data, and there's like long history of like how these things get built, and how these like systemic things influence these models like decades later. So how
many parallels are there between the work you did on Tom's vehicles and what you're doing now is it.
So I wasn't I didn't, I didn't really do. So I was on the very Cape ification invalidation team so my, my goal, like, I would get vehicles, more or less, from the factory. And plus, our computer systems and I would do provisioning and set up, and that kind of stuff. and I would make sure and then I would verify like the continual operations of the systems. But, um, where the the idea for the actual paper came up was from a discussion over lunch there where I was talking to one of the guys from the computer vision team. And I was I asked him how they were dealing with this like this racism problem, and if they had you know come up with some statistical technique, in particular because we are in Germany, which isn't you know known to be very diverse. You know how are these things going How are these cars going to then play out in Dubai, or India or New York City where things are, where people look very different, and we became a big company wide discussion actually around this topic of like, you know, how do we how do we prevent this from happening to our systems. And, I mean, we really didn't have an answer at the time, so I wanted to take some of the first steps. And then I found this PhD program, which was expanding on this idea of like, how do we how do we trust a model, beyond like some accuracy like what it what does it mean for 90, what does it mean for it to be 99% accurate. Unlike some test data set, relative to the real world, or you know and can we find these implicit biases before, before we try to implement them.
So there's some, some back and forth in the chat, especially about the portion of your presentation talking about masks, the effective mass and trying to foil, facial recognition technology and elaborate a bit more on that, and particularly the question, falling onto that about wearing reflective sunglasses actually works such as like a mirror kind. So,
um, so what I found is, there is. So in general for these complex neural network systems in particular, there will always be some, some adversarial attack, you can do, and it might always be slightly different, but there will, there will always be some boundary condition that you can reshape or, like, play with play, or you can have some boundary condition where you just like stay on the line between person not person. At one point in the talk I show a woman in a dress with a bunch of license plates, and she's still very much looks like a woman, but because there's so many license plates, it expands the math, so that on the confidence goes towards the license plates rather than humans. And in the same way like masks may be an adversarial attack in the short term against facial recognition systems on there, you know no shortage of public fit pictures of people in masks and you know you could it wouldn't be hard to take to train a model on like a larger set of data, and then fine tune it to work with masks. If you only have pictures of like say just celebrities and mass but want to see how this applies to regular people like it. So DHS has worried about masks on being effective in stopping facial recognition. And as far as I'm aware during the George fillet protests the facial, the people caught by facial recognition we're not wearing masks. But, like, it is trivially math, it is like trivial from a technological standpoint, to make a new system that doesn't care about the masks.
So do you have any suggestions for oversight or auditing of the algorithms and their effects any recommendations or any ideas. So,
I'm so I'm gonna, I would defer to Joe again to Joy's research on this, and in the last couple of years she's developed an auditing platform available through media lab, edit, like media he works at Media Lab at MIT. I don't have the link handy, but joy boo hoo Lam weenie for people listening out there. And she's a black woman who is far more equipped to talk about how these things affect her community, and I would, and I was just trying to get to like some of the more root cause on, and I don't know if there's a solution other than open sourcing these models. In another case there's the model that guide sentencing guidelines compass, which is trained on previous arrest data and previous recidivism data, which in knowing what we know about the US Justice System has its historical racism problems. So it's trained on historically racist data, and on it, and someone sued campus sued to release this data. And they were and they lost, they were not allowed to see what was making the decision in the model because it was proprietary, even though is being used by public courts. And so I don't think courts are equipped to handle this, and I'm sure many of you will agree with this, if you think of some of the Facebook Congress interview questions and how lost our legislative branch seems to be against the tech sector.
Indeed, well we're at time but thank you so much for the talk, Charlie, you really fast enlightening for all of us and very important in this particular time that we live in. So on behalf of hope, 2020, all our attendees and presenters and volunteers. Thanks for sharing this with us today. Thank you. All right, all right, everyone. So join us the top of the hour for closing ceremonies, we're going to wrap this all up and a big Fiesta so over to you hope ground control.