InfoQ Homepage Podcasts Katharine Jarmul and Ethical Machine Learning

Katharine Jarmul and Ethical Machine Learning

Mar 16, 2019

Today on The InfoQ Podcast, Wes Reisz talks with Katharine Jarmul about privacy and fairness in machine learning algorithms. Jarmul discusses what’s meant by Ethical Machine Learning and some things to consider when working towards achieving fairness. Jarmul is the co-founder at KIProtect, a machine learning security and privacy firm based in Germany, and is one of the three keynotes at QCon.ai.

Key Takeaways

Ethical machine learning is about practices and strategies for creating more ethical machine learning models. There are many highly publicized/documented examples of machine learning gone awry that show the importance of the need to address ethical machine learning.
Some of the first steps to prevent bias in machine learning is awareness. You should take time to identify your team goals and establish fairness criteria that should be revisited over time. This fairness criteria then can be used to establish the minimum fairness criteria allowed in production.
Laws like GDPR in the EU and HIPAA in the US provide privacy and security to users and have legal implications if not followed.
Adversarial examples (like the DolphinAttack that used subsonic sounds to activate voice assistants) can be used to fool a machine learning model into hearing or seeing something that’s not there. More and more machine learning models are becoming an attack vector for bad actors.
Machine learning is always an iterative process.
Zero-Knowledge Computing (or Federated Learning) is an example of machine learning at the edge and is designed to respect the privacy of an individual’s information.

Subscribe on:

Show Notes

What is KIProtect?

01:15 We publicly announced it on the day GDPR came into effect.
01:25 We had been working on it for a couple of months before that, so it’s about a year old.
01:30 Myself and my co-founder, Andreas Dewes, are really passionate about the topic of making privacy and data security easy for machine learning scientists.
01:50 This was a passion of ours, and we knew it didn’t have to be as hard as it was, so we founded KIProtect (Künstliche Intelligenz Protect, the German for AI).

What is ethical machine learning?

02:20 There has been an explosion from the research learning community about creating ethical machine learning models.
02:25 I think a lot of this came from people who saw human biases reflected in machine learning models.
02:40 There are numerous examples: algorithms that downgraded resumes from women – which is really disturbing; it had learned it from real data.
02:55 There was an algorithm in Google Photos that classified black people as gorillas.
03:00 There was a Google Translate problem where if you translated “He’s a secretary; she’s a professor” it would translate it back into “She’s a secretary; he is a professor”.
03:15 These have been well-researched and documented from both consumer pressure and calling it out, as well as the research community: groups that are concerned about fairness and transparency in machine learning.
03:45 There was a recent paper which analysed predictive policing and showed that it could create a feedback loop which causes it to target certain races of people based on positive reinforcement problem.

How do you go about addressing biases?

04:10 It’s really problematic. Some people argue that we can’t ever tamper the data, because it represents reality; it shouldn’t change reality, it should reflect reality.
04:30 It’s a big debate in the community; some of the first areas that we’ll see change in first are ones in which they are more regulated.
04:40 For example, Google did some research on credit scoring and how you could create a fair credit scoring model.
05:00 Banks are held to higher standards of anti-discrimination, so if you’re going to automated something that has anti-discrimination laws, you want to build something that isn’t going to discriminate in an unfair way.
05:10 One of the words for machine learning is to build discriminators - we are asking the machine to make a decision.
05:25 There is this tension between the goals that we have as machine learning engineers and the fact that we don’t want to increase unfairness or unequal treatment in our world.

What are some of the strategies to get started?

06:00 A really good first step is awareness that it’s a problem.
06:05 Taking time to define what the team goals and criteria are, and if you use KPIs, what they are and how are they fair?
06:30 If it’s not built in to the team discussion, it’s difficult to integrate in at a later stage.
06:40 Once you define that, then it’s taking a look at the task and the data and defining some of the fairness criteria or measurements to determine if it’s behaving in a fair way or not.
07:10 Defining that - from an engineering perspective - means that you have the criteria and you can say that you have met the requirements.

There’s a lot of regulations both in Europe and the US that need to be followed.

07:55 We have GDPR in Europe, which gives citizens extra rights, and it also puts an extra burden on software teams making sure that privacy by design is implemented.
08:15 It’s quite unfortunate to see a lot of mis-information about GDPR - the amount of websites I couldn’t visit the following week after it came into effect was surprising.
08:40 It’s interesting to think about GDPR and what it means to handle data.
08:50 If you work in healthcare or finance in the US, you’ve already been working on this and are held to stronger regulations than in other industries.
09:00 For example, some of our customers in the US in those areas understood GDPR and it was only a few extra hoops for certification.
09:20 For heathcare, there’s the HIPA regulation; for finance there’s a lot of different regulations and case work.
09:25 We’re seeing a little bit of a societal backlash against bigger technical companies - we’re seeing that with Amazon’s headquarters being pushed out of New York.
09:45 We’re seeing a bit of negativity towards some of the technology.
09:50 One of the larger concerns is about privacy and data rights - I don’t want you to sell my data to anyone, or be part of a data leak, or have to change my password.
10:10 These are all data privacy and security questions that people are starting to ask - and this is causing some regulators proposing new data privacy regulations.
10:25 There was one passed in California which talked about selling user data, that you needed to get their permission first.
10:40 There have been several other states that are proposing new privacy regulations since the beginning of 2019.
10:50 We can debate whether regulation is the best way to go, but they are reacting to public opinion.
11:10 This is essentially a political campaign that a lot of politicians aren’t necessary running on.

What regulations are there in the US?

11:40 I think most of the US ones are about data markets and buying and selling customer data, as well profiling of users.
11:50 For example, when using automated decisions when you apply for a bank account or when applying for a job, should an algorithm be able to decide whether you move forward in the process?
12:10 This is something that a lot of people want to help automate, so we have to find a balance between what’s comfortable for an average person.
12:20 I call it the “mom test” – if you build something, would you be OK if mom entered her data and filled it out?
12:30 That’s an easy test to apply.
12:40 So we need to automate things in a more transparent way - when you insert transparency then you’ll be able to make an informed decision about whether or not to use it.

What does an attack vector for a learning model look like?

13:20 People involved with machine learning or AI will probably have heard of ‘adversarial examples’.
13:30 These are type of security attacks where we can fool the machine learning model to see something that’s not there, or hearing something that’s not there.
13:40 The Dolphin attack [https://arxiv.org/pdf/1708.09537.pdf] used sub-sonic or ultra-sonic sounds to activate voice assistants.
14:00 These types of vectors trick the machine learning model into doing something or seeing something which is quite popular.
14:10 I gave a talk at CCC [https://media.ccc.de/v/34c3-8860-deep_learning_blindspots] on different ways of using adversarial learning.
14:20 There are also new attack vectors which aim to extract information from the model or steal the model itself.
13:40 There are numerous attacks which can extract personal or sensitive information from models, depending on the access or how the models were trained.

How do you execute an attack?

15:00 Usually the attack vector involves a series of queries to an API, and so you think of it like a pentest perspective.
15:10 You query the API and you plan to attack it from a vulnerable space from the feature space.
15:15 You try to give it as little information as possible and see how much information it could give you.
15:20 Then you can utilise the information or the responses or probabilities that you get back, and use that to form an attack vector - which is trying to maximise or minimise the probability.
15:40 This can give you information about the multi dimension space of the decision algorithm or decision tree.

What other things should people consider when deploying machine learning models?

16:05 We are asking how we can make machine learning more secure - for us, the important thing is to make sure no personal information is stored in the model.
16:25 From an engineering perspective, you want to make sure that it doesn’t memorise any personal data.
16:30 You also don’t want the machine learning model to be a huge security risk.
16:40 As it becomes increasingly easier to attack these models - the research is progressing in this area - then this means you could be exposed to network scanners.
16:55 We need to think about machine learning models in the same way as other APIs that connect with the pubic internet.
17:05 If it’s going to touch the public internet then it should be just as secure as a database.
17:15 Think about API keys, request limits, and also think about removing or protecting personal data before you train the model.
17:30 We specialise in detection and protection of that private data via a cryptographic method that we use.

What other recommendations do you have regarding machine learning adoption?

17:45 One thing to think about with machine learning is that it’s always going to be an iterative process; if you think it’s going to be great and 97% accurate, you might notice an accuracy drop in production.
18:10 Deploying to production isn’t a one-click and go; it means maintaining your models in production.
18:20 How do you create a machine learning engineering pipeline, where you can iteratively improve your models and pull out models that aren’t working while looking forward to future evolutions?
18:40 Let’s say that it took two weeks or two months to train your first model - the sample of your training data is going to need to be continuously updated, for example, with population change.
19:10 You need to plan on releasing new models, scheduling when they are deployed and what happens if you need to roll back.
19:25 SageMaker from Amazon and others are pretty exciting.
19:30 If you’re already using CI and you think, "Yeah, I should automate this," but typically the data science team is in the corner with their Jypter notebooks and no-one has tested it.
19:55 They should be tested from a software engineering perspective and a security perspectives.
20:10 One of the things we’re working on releasing is robustness evaluation toolkit for learning models.
20:15 We’re calling it Algoneer - you can engineer your algorithms in a way to evaluate them for robustness and ethical treatment in a better way.
20:20 There are also a number of academic libraries that have been released around this - we’re going to try and pull in those that make sense.

What does robustness look like?

20:30 Robustness for us (and most of the research) means a few different things.
20:35 Firstly, security; when we think about the adversarial examples (also known as ‘wild patterns’), they have to be robust against attacks from them.
21:00 Secondly, it has to be robust from random input or random information - it can’t leak information or return non-sensical information.
21:15 I built a tool last year called Data Fuzz that helps do fuzzing on the data.
21:20 Thirdly, it must not leak personal or private information.
21:25 It cannot be used as a security breach attack on the data that was used to train it.

What do you think about moving machine learning to mobile devices?

21:55 There are a few things that come to mind.
22:00 Firstly, there’s quite a lot of work going on in terms of zero-knowledge computing, or distributed computing within a privacy space.
22:05 Essentially, this is the same concepts of blockchain but applied to machine learning - this is known as federated learning.
22:15 This is based on the idea that no one person should have complete knowledge of the data set or the training set, so that we can build a machine learning dataset together that respects the privacy of the individual information.
22:30 Note that the machine learning model does have the information in it, so you have to think about the security of the final model.
22:40 This is a goal for those who want to work with highly secure data sets, such as health data.
22:50 Secondly, when we think of the average machine learning at the edge, we think of a TensorFlow model on your phone.
23:00 The model might be trained to predict words in your text, for example.
23:15 What you need to address is that the models have been exclusively trained on private information - and if someone were able to get access to the model it would then be a personal database of contacts and other personal information.
23:30 When you back up that model, or transfer it between new devices, then we need to treat the model as any other private information - if we move it to the cloud we need to secure them against the attacks.
24:00 Thirdly, if you want to collect information from the edge - Apple is doing well on this.
24:10 Apple has released papers [https://machinelearning.apple.com] on differential privacy at scale [https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html].
24:15 Differential privacy is one way we can measure the risk to privacy based on the information we have and how we can query that information.

What do you mean by differential privacy?

24:40 The idea of differential privacy came about when it was proven that there is no way to do fully anonymised data release.
24:50 Cynthia Dwork wrote a paper on differential privacy [https://www.microsoft.com/en-us/research/publication/differential-privacy/] was able to prove this mathematically.
25:00 There might be other information publicly available that the attacker can use to obtain information about that person.
25:05 For example, you could have “Katherin is 3in shorter than the average US woman”, and then you find information about the height of the average US woman.
25:15 Even though this is anonymised, you have actually exposed something about Katherin.
25:30 This kind of private disclosure was proven mathematically.
25:40 So Cynthia proposed differential privacy, as a way of more precisely measure how much information could gain about any one person.
26:10 The idea is that you have a database and you have an attacker who knows the state of the database before, or has run multiple queries against it previously.
26:20 Then you add one person to the database, and the attacker is able to query the database again - you can calculate how much information they could gain by interpreting the responses.
26:35 When we think about differential privacy from a distributed computing or an edge problem - it’s probably the safest way to push things to the edge so that there’s no central storage.
27:00 iOS devices are collecting statistics about your usage, and one of the examples from the paper is about emoji use.
27:15 The keyboard designers wanted to know what emojis should they recommend, or should they sort by common use - so they wanted to collect statistics on their use.
27:30 They collected the information in a differentially private way - the statistics were differentially privatised and then sent to the server and aggregated in an anonymised way.
28:10 The idea that we can make informed data analytics choices even when we have anonymised collection protects the user while trying to derive business insights.
28:30 At KIProtect we’re releasing our first anonymised analytics platform which allows you to do just this.
28:40 That will be our first SaaS offering which allows you to sign up and collect fully anonymised with differential privacy guarantees analytics.
28:50 We hope this is the future of the way we can collect data at the edge, with respecting privacy and easy.

Tell us about your QCon AI keynote.

29:15 I’m diving more into Cynthia Dwork’s work: she’s been working on mathematical definitions of privacy for some time.
29:25 I’m also going to be diving into some of her work and work that’s been built upon that with connecting privacy to fairness.
29:35 When we think about fairness, we can come up with a lot of definitions: but some of those definitions can seem unfair.
29:45 Fairness is that I’m treated equally to people that are different from me in race, background, gender, religion and so forth.
30:00 This doesn’t necessarily mean that I automatically get a bonus because I’m a woman; this means I need to be fairly evaluated in some way.
30:10 Cynthia Dwork’s idea on this that I’ll present about: what if the algorithm or the model could never learn my gender?
30:25 If it cannot learn my gender then it cannot discriminate against me - and then it’s more likely that I’ll be compared to other people like me.

If you don’t have gender, how can you audit it?

31:00 This is taken into account when we train for privacy and fairness at the same time.
31:15 Firstly, we need to know the private information somewhere, so that we can determine whether that was learned or not.
31:20 Then what we use in the evaluation process is that we ask it whether it can guess this based on what it knows.
31:30 A lot of times, it will reveal other information that is linked in some way, but not directly to the personal information.
31:40 For example, a model may not be able to learn your gender, but it may know that you went to an all-women’s college - so you can guess.

Katherin Jarmul will be at QCon.AI in April.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.