Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Machine Learning 101

Machine Learning 101



Grishma Jena gives an overview of Machine Learning and delves deep into the pipeline used - right from fetching the data, the tools and frameworks used to creating models, gaining insights and telling a story.


Grishma Jena is a Data Scientist with the UX Research and Design team at IBM Data & AI in San Francisco. She works across portfolios along with user research and design teams and uses data to understand users' struggles and find opportunities to enhance their experience.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Jena: I'm just going to start off with a little bit about me. I work at IBM in San Francisco as a data scientist, and I work primarily with user research teams. That's been a very interesting journey. We look at all of the user data and try to understand how we can help make their journey seamless and just better and have them buy our products. I have a background in machine learning and natural language processing. Outside of this, outside of my day job, I love public speaking, which is why I am here. If anyone wants to talk to me about public speaking as well, feel free to go ahead. This is my 16th conference this year, so I'm going to end this year with a bang.

Before I start my talk, I have a few content trigger warnings to give out. Towards the end of the talk, we will briefly be touching on breast cancer and domestic abuse with the gaslighting. If anyone feels uncomfortable with this, please feel free to step out of the room or do whatever it is that you need to do. Self-care is number one. Just wanted to give that disclaimer.

Before we begin, I have two questions. How much data do you think is produced every year?

Participant 1: A lot.

Jena: Yes, definitely a lot, but let's try to quantify it.

Participant 2: Petabytes.

Jena: Petabytes. Ok.

Participant 3: One thousand petabytes.

Jena: One thousand petabytes. Ok. I love the numbers. Do you think it would be more than petabytes? Anyone?

Participant 4: Number of stars in the galaxy.

Jena: Number of stars in the galaxy. I like that comparison. The correct answer is 16.3 zettabytes. They actually had to come up with a new unit called zettabytes just to be able to measure the amount of data they have. One zettabyte is one trillion gigabytes. That's a lot of zeros. My next question is, how much data does the brain hold? I'm talking on a good day, not during a whiteboarding interview where it seems like it's zero bytes. We've all been there, done that. We'll probably do it again. Let's talk about a good-day scenario.

Participant 5: Zettabytes.

Jena: Ok. Anyone else who wants to enter a guess? Do you think it's going to be more than the amount of data we have in the universe? Ok, some yeses, some nos. The correct answer is 2.5 petabytes. One zettabyte was 1 trillion gigabytes, and 1 petabyte is 1 million gigabytes. It's by orders of magnitude smaller. This is just to give you an idea of the amount of data that's present in the world today. To give you a slightly more context, you can think of 2.5 petabytes as your favorite TV show playing for 3 million hours. I know you all watch "Friends" reruns. Just imagine that playing for 3 million hours.

We generate this huge amount of data, and we don't even know the amount of data we're generating. It's 2.5 exabytes per day, which is the amount of data that 5 million laptops can hold, or 90 years of HD-clarity video, which is a lot. To deal with all of this data, we definitely need some tools, some techniques, and machine learning is one of them. 2020 estimates, which is just a few months away, say that the amount of data that's going to be present in the digital universe is going to be 44 zettabytes, with every human being producing 1.7 megabytes per second. Can you just imagine the amount of data we'll all be generating at that rate? If you think of an iPad Air, which is 0.29 inches thick and it holds 128 GB memory, and if you try to condense all of the data in the digital universe into stacks of iPad mini, iPad iOS, it will reach the distance between the earth and the moon 6 times. You can just have a bridge between the earth and the moon of iPads.


Very important, let's try to take out some of the buzzwords, some of the technical jargon and clear out what exactly they mean. Very simply put, data is any piece of information that you can process, that you can store, that you can manipulate, that gives you some insight. Data science, particularly, refers to methods, steps, techniques that you use to get insights from the data that you have at hand.

There's no real consensus on how much amount of data is big data, but people say that traditionally your traditional database systems like your RDBMS, if that cannot contain the amount of data you have, it's big data, or an alternative definition is that if the amount of data you have cannot fit on your local machines, it's big data. Artificial intelligence particularly refers to the study of developing intelligent agents or agents which seem to have some sort of human cognition to them. Finally, machine learning, which is going to be our word or the term for the day, is how do you make computers or programs understand that there are patterns present within the data without explicitly telling them, "Look, here's a pattern" or, "There's a pattern?" That's what we'll be delving into deeper today.

This is what our data pipeline looks like. You start off with the question in the data, and then you go onto the processes of wrangling, cleaning, exploring, creating models, validating, telling your story, and hopefully, at the end of it, you have some an insight, some an action item that you can take away, and which would hopefully help your stakeholders and your business.

I threw out a bunch of words there, but what exactly do all of them mean? I'm going to hopefully make it clearer for you in the next few slides. We start off with understanding what is the question that we need to answer? That question could be of different types. It could be from very different fields. It could be something like, who are the next 1,000 customers your business is going to lose and why? It could be something like looking at credit card transactions and trying to predict if there is a fraudulent transaction, or it could be even something in the social and community domain where you're trying to predict housing prices, which I think the area needs because it's getting higher and higher by the day.

Data Everywhere

There's a lot of data and a million data sources available. Let's take a look at how we can categorize them. We can mainly put them into three buckets. The first one is structured, where you think of your row, column, your standard table format. That would be your databases, that could be our Excel spreadsheets. It's called structured because you expect some structure or some definition to be present in that data. The total opposite of that would be unstructured, where there is no inherent structure present. Think of your songs, your audios, your videos, and even documents. There is a third category that's called a semi-structured, which actually falls in between those two. You expect some amount of structure, but within that structure, you don't really know what to expect. The best example of this would be, for instance, an email message. You know what the email headers would be, a subject line, receiver, sender, what time it was sent, but within the email itself, you don't really know what the content is. It could be text, it could be images, it could be audio, it could be video, it could be pretty much anything. That would be an example of semi-structured data.

Hopefully, you have a question, and then we have the data at hand. Sometimes you have the data at hand and then you have to be, "What are the kinds of questions we can ask?" It's totally ok to do them one or the other way. After you have those at hand, you begin with the process of cleaning. Sounds boring and mundane, "I don't want to go through the data and see how many people have put 200 as their age or just declined to give me their state and then I can't do any analysis on that." It sounds very boring, and yes, it is harder than it looks, though. You'd be surprised to know that there's a statistic, there was a survey conducted for data scientists, and it was basically a survey of where do you spend most of your time? Turns out the data scientists spend 80% of their time cleaning the data. That's a massive time sink, but unfortunately, it is a unnecessarily evil. The reason that is because if you don't do it right the first time around, you might go through the steps of creating your model and then seeing it doesn't perform well, and then you have to go back to square one and start doing it all over again. It's a really crucial step that needs to be considered.

How exactly would you go about cleaning the data or wrangling the data? Let's say, for instance, you have a database where you're collecting survey responses from people, so their demographic information. Let's say someone writes "CA" for California and someone writes "California." How does the computer understand that CA and California mean the same thing? It's going to be a little difficult. You can start off with standardizing the data, where you can yourself manually go and say, "If it's CA, it's California," maybe have some a mapping of states, the abbreviations to the names.

Another thing could be, let's say you have a field called your zip code, which is a five-digit number, and then you have a field called age, which is maybe two to three digits. How do you tell the computer that, "Hey, just because age is smaller in magnitude than a zip code doesn't mean that the zip code should be given more importance?" You need some scaling to be done for your numerical values.

The third way to do is taking care of the missing values. Real-world data, unfortunately, is extremely messy. People are not going to answer things, people are going to write invalid values. You're just going to be wondering, "What the heck was this person thinking while doing that?" If you're lucky enough that you either have a huge amount of data at hand or you have a very small percentage of missing values, you can go ahead and discard that. Oftentimes, we are not as lucky. What you might do is you might try to replace those missing values using things like interpolation. If you have time series data and if the value at X is missing, you could look at value at X minus one, at X plus one, and then take the average of that. That's absolutely ok to do, considering the context of the variable.

Sometimes you might even have categorical variables. Let's say, for instance, you have a database of emails and their subject lines. The categories could be, maybe it's one for if it's a spam email or zero if it's not spam email. That's the category. You have to convert those two numbers. Again, there might be a lot of duplicates, so it makes sense to get rid of all of those duplicates so that the model doesn't wrongly understand that just because it's occurring multiple times means it needs to be weighted accordingly.

We went through the process of taking a question, getting some data, and then just trying to clean the data. Hopefully, we have a clean data set. Next, we go on to data exploration. Think of it as getting your feet wet with the data, where you're just trying to make yourself familiar with what are the different values present for all of the features. What do the distributions look like? There's a lot of graphing and visualization involved in this step because you're just trying to understand what is it that the data looks like normally, or are there any interesting points? Is there something surprising you found? "Wait, I didn't expect to see this." Or is this something that got confirmed where you're, "Yes, I did expect this. Let me try to form some side of hypothesis." Then at the end of this stage, you could take those hypotheses that you formed and then try to prove it or debunk it at the next few steps.

Model Building

We have the data, we have the question, we did the cleaning, we did the exploration, and now we're going into the model building, which is considered the meaty part, the proper machine learning part.

One of the first steps, before you go on to build a model, is the step of feature engineering, where hopefully, after the end of your data exploration stage, you have understood what might be indicative or important features. Using those, you could construct slightly more meaningful examples. That's where feature engineering comes in helpful. Domain knowledge is really important for this. Let's say you're dealing with the medical data. If you're not a medical practitioner, it's going to be hard for you to understand what the different terms mean. That's where you call in the medical practitioners who are the domain experts, and they help you understand what this feature means, or these two features mean the same thing, maybe you can just drop one.

One good example of this would be, let's say you have timestamp as one of the features. Timestamp on its own is not going to tell you much, but if you transform that timestamp to say what day of the week was it, what month of the year was it, what season was it and see if you have any daily patterns, monthly patterns, seasonal patterns, that would be an important feature. After you've done feature engineering, you actually take all of the data you have, and you do some a split on this data. You try to take the majority of the data and use it for training the model, and then the remaining portion, you use it for testing the model.

Just think of it as how we as humans learn. Let's say you're sitting in a classroom or taking a course, and that is your training part where you're trying to understand what the different concepts mean, what do some keywords mean, and trying to just explain to yourself, "Ok, this is what I've learned." The testing part is when you have to give a quiz on it or you have to take a test for it, and you're trying to just see how well have you captured the knowledge that you learned. That's training and testing. Machine learning models work the same way.

Then you go onto the process of creating the machine learning model. We'll go into the details of this in the next few slides. They can primarily be of two types, supervised and unsupervised. I'll say what that means in the next slide. Then there are some parameters associated with the model. Just like inherent ways of tweaking the algorithm and just for the algorithm to understand what decisions should it take. That is where the model parameters would come in.

Then you train the model on the training data. It's very important to monitor against the concept of overfitting. We'll go into this deeper in the next slide as well. Finally, you want to evaluate the model on unseen data, which is the test data that you had kept aside and not used to train the model. This is oftentimes an iterative process for different features, so you might not get the right answer the first time, but unfortunately, a lot of machine learning is like that ,where it's a process of trial and error, trial and error. Has anyone heard of this quote which wrongly gets attributed to Albert Einstein? It says, "The definition of insanity is doing the same things over and over again and trying to expect different results." I say that's actually machine learning. Of course, you can have multiple models chained together to develop a bigger, stronger model.

There's an interesting example - there were these scientists who basically wanted to create a machine learning model, taking pictures of huskies and werewolves and see if the machine learning model could identify how to differentiate between the two. It would basically just take an image as the input, and as the output, it would hopefully give, "It's a husky" if it is a husky, and a werewolf on other hand. These scientists spent a lot of time and the resource, a lot of resources, and they started getting around 90%, 95% accuracy, which was amazing. They were absolutely ecstatic, and they were, "We're going to go into conferences and we're going to present. We're just going to be bosses in the machine learning field, and everybody is going to look up to us and we are going to be just super famous, viral overnight."

Then what they realized was, when they actually started evaluating it, they hadn't built a machine to actually identify if it was a husky or a werewolf. What had happened was they had built, instead, a snow detector. How did we go from huskies and werewolves to snow? That doesn't make sense, does it? Actually, it does. The reason was that all of the images that they used for the huskies to train the model had snow as the background. The model wrongly correlated that if there's something white in the background or if there's snow present in the image, it's going to be a husky. Unfortunately for them, the model was right most of the times, if not all. This was the process of the model overfitting on the data it was given. You're trying to be extremely specific on the training data so much that your model fails to generalize or to become good over general input instead of that really specific thing. That's a warning for, please, be careful that you don't overfit the data over the model.

Machine Learning Approaches

We spoke about some approaches for machine learning, particularly supervised and unsupervised. Let's see what they mean. Supervised - just think of it as there's some a teacher or a supervisor present that's going to tell something to the model, but in this case, the teacher is just labels present in the data. This would be where you have the emails database, and you're trying to see if it's spam or not spam. The spam or not spam would be your labels, and those labels would give supervision to the model. There's a specific answer or a specific output that you're expecting.

The opposite end of the spectrum is unsupervised learning, which is basically, there is no supervision or there are no labels or categories present with the data, but you're ok with that because you are not really interested in finding out the labels and predicting, but you just want to know what is the inherent structure of the data. Or can you maybe find some clusters of some groups present of the data?

There is one approach that falls in between these two that's called a semi-supervised learning, which is a combination of both, exactly as it sounds. Let's say you have an image repository, and some of the images in that have captions with them. Your captions, in this case, would serve the labels, so that's the supervised learning part of it. Some of your images don't have any captions. What you could do is you could create a model and train it on the supervised spot, which is the ones that have labels or captions. Then you could test them out or you could try to predict the labels for the ones that don't have any. That's in between the two, which is why it's called semi-supervised learning.

There is another category called as reinforcement learning which is really popular for robotics. Think of it as a robot or an agent which has some understanding of the environment it is in, what are the actions it could take to meet a certain objective. Depending on the actions it takes or whether or not it meets the objective, you can either reward it or you can penalize it. It's like how you train your pets or your children "If you do this good, I'm going to give you a chocolate, I'm going to give you a treat. If you're doing this bad, maybe I'm going to snatch away your iPad and ground you."

For all of this, we actually use a tool that's called as a Jupyter Notebook. I primarily use Python for machine learning, for data science. Jupyter is really for that. Towards the end where I'll be showing the demo, and hopefully, you'll get to see some code in action, we'll take a look at the Jupyter Notebook, and I'll explain a little bit more about it. Basically, since machine learning is a little different from software development, you can't just run a script and cross your fingers. You need to evaluate at every step, at every stage, what is the output that you're getting, and maybe do some visualizations, do some tweaking. That's why Jupyter Notebook is really helpful in that case because it's interactive, and you can actually have these different cells or pieces of code that you can execute just a little bit and then see what it looks like.


Now we go on to the algorithms part of it. The first one that we are going to be covering is classification, which is a type of supervised learning. Remember, supervised learning has labels or categories; classification also has. You're trying to understand, given a data point, is it category A or is it category B? For example, is it a dog or is it a cat? A slightly more complex example of this would be, given a patient's medical history, how likely is it or is the patient at risk of suffering from a heart attack or cardiac arrest? The features that you'd be using for this would be the patient's medical notes, are there any symptoms being expressed? What are the lifestyle choices? What are the behavioral choices? So on and so forth. Again, this is supervised because you have that label which says, "It's likely that this person is going to get a heart attack" or "No, this person is healthy and he or she is going to be fine."

The second type of algorithm is called regression, which is, again, a type of supervised, but it's a little different from classification, in the sense, you're trying to predict a numerical value or a numerical output. Let's say, for example, given your company's performance, how likely is it or what is the expected revenue for the next quarter? That's going to be a numerical value, as opposed to some sort of a category. A slightly more complex example of this would be, given some features, how likely is it that there's going to be an outbreak of an epidemic around the world? I think this is for the flu epidemic from a few decades back. Basically, the features that you would use is, has there been a lot of international travel recently? Because that's going to help the pathogens to evolve. Has there been a lot of contamination of land or water sources? Those features would help inform your model, and then hopefully, you can say, maybe if the outbreak started here, then these are the communities that are at the highest risk of getting the disease.

The third type of algorithm is clustering, where, unlike supervised, there are no labels, no categories present, and you just want to understand, what is the inherent structure of data. Can you form some groups or some clusters with the data? A real-life example of this would be, given a person's transactions or things like that or just a huge database of user transactions, can you find some segments for the customers? Maybe there are some customers who like to shop daily, and they probably spend like $100 daily. I don't know who those shoppers are, but I'm guessing there are. Or maybe it could be a totally different type where they log in to Amazon or your shopping sites only in December and spend $1,000 worth for all of the Christmas and holiday shopping. Those would be two examples of two segments for customers that you see.

Another type of algorithm that we use in machine learning is called anomaly detection. Let's say for example credit card transactions. How many of you have received some sort of an alert or some call, "Was this actually you who used a credit card?" It happens often. That is where these machine learning systems have a baseline for what's considered as normal. For example, if the transaction is happening at a very odd time of the day, or if the transaction is happening in a totally different country which you have never been to and you don't live in. Those would be some red flags that trigger the anomaly detection algorithm, and that's when you actually receive an alert or an email from the bank saying, "I think something is up. This is suspicious activity."

Reinforcement Learning

Remember we talked about a third type of reinforcement learning. This is an example for it. What they're trying to do is they're trying to teach a robotic arm to flip pancakes. The scientist over here is trying to give it some input, "I'm trying to hold your hand and teach you how to flip pancakes," and try to take some input from that. Now the robotic arm wants to try it on its own. It's being really gentle and cautious. It's a little more daring, and that failed. That also failed. It's just failing and failing.

Then the scientists were, "Let's use motion capture and see if it improves." It's actually started to do pretty well, and it's doing it successfully, Finally, after 50 attempts, it started to flip pancakes successfully. That's going to be the future of breakfast, robot flipping pancakes.

Model Validation

We'll quickly touch upon a few more topics, and then we'll go into the demos, where, let's try to spend a little bit more time to see what all of these things and all of these steps look like in code.

You went through the process of getting the data, having a question, you want to answer, cleaning the data, exploring the data, building a model, trying to protect it against overfitting, because you don't want a snow detector. All that's great, but how do you know if a model is good enough? Or how do you validate the performance of a model? We have some metrics that are used in the field. Some of the most common ones are accuracy - how many it's getting right. Then we have the notion of false positives, false negatives. False positives is wrongly positive. That means it was actually negative but the system or the machine labeled it as positive, so it's a false positive, and conversely for false negative.

Using false positive and false negative, we calculate two other metrics that's known as precision and recall, which we'll be seeing in the demo. It's these metrics that help you understand how well the model is doing, or even at places where if the model is not doing well, are there any particular areas or any particular types of input where it's not doing well? We also use something that's known as a confusion matrix, which shows you the false positives and false negatives so you can see how and where the model is faring well or it's faring badly. When we talk about false positives and false negatives, do you think they would have equal weightage?

Participant 6: Depends on the use case.

Jena: Yes, that is the answer to all of the questions in machine learning, by the way. It depends. Unfortunately, if you're expecting good answers from me in the Q&A, I'm just going to say, "It really depends on your use case."

We were talking about credit card fraud detection. False positive would be, it wrongly saying that there was a fraudulent transaction, which means that it was actually indeed you using it but the system said, "It was an anomaly. It was a fraudster." That's a false positive. A false negative is where it was actually a fraudster using it and it says, "No, everything is fine. This person anyway spends a lot, so it's not a fraudster." Let's think about this.

What do you think is more important to catch? A false positive or a false negative? I'm hearing a mix of answers. Let's consider the worst-case scenario. Let's say a fraudster is indeed using your credit card and the system lets go of it. It's an identity theft, your credit card, maybe you lose out on some money, and the person racks up a lot of debt. Let's look at the converse where it was actually you using the credit card but the system detects that it was a fraudster. What happens? You get an alert, you get annoyed. You say, "It was me all along. I'm probably going to change my bank. You're good for nothing." You're going to be really annoyed for maybe a few hours or at most a day. What do you think is the outcome or the weightage over here? What's worse? It might be a little simpler in this case, but consider when these machines are employed everywhere in the world for absolutely different views.

For example, the customs at the airport wants to flag someone down for being suspicious and see if that person is a terrorist or not. In cases like these, it becomes a huge ethical challenge, and there needs to be a lot of talk about the ethical considerations and the social and the humanitarian considerations for, what is it that's worse? Is it actually catching someone who is not a terrorist and thinking they are terrorist, or is it actually letting go of a terrorist? It's not easy. Similarly, it could translate into the medical domain. What's worse? Telling someone they have cancer when they don't or telling someone they are healthy when they do have cancer? That's something to think about.

Data Visualization and Storytelling

Hopefully, you have created your models and you're happy with the validation, you're happy with the performance. To tie it all at the end, you need to have some storytelling and visualization tools where you actually go back to your key stakeholders, go back to your business problem and try to answer the original question that you had, whether it was something business-related or community-related, and it's really important for you to tell a story. There's this joke that goes, you spend about two years in trying to get access to the data because data access is difficult at times. You spend maybe another two years of trying to understand what that data means and creating models out of it. You spend probably a year to write out reports, and after the end of these - two plus two plus one - five years, you have one slide to present, and everyone's, "That's what you did for the past five years? One slide? What do I pay you for?"

It needs to be really powerful. You need to have great communication skills to be able to tell, "This is our machine learning model" and trying to explain the model or what do you think is happening, and that's where the whole talk off having interpretable models is really important. It's really necessary for you to actually go back and say, "This is the data we had, this is our business problem or this is our problem statement, and this is what the model gave as an output, so I think that these would be the next steps for us to take."

Before we jump to this, let's actually take a quick peek at the code. Our first example over here is, given the characteristic for a breast mass, can we predict if it's malignant or benign? This is what type of learning? Supervised, because you have two categories, malignant or benign. This data set was actually a collection of images, but it was reduced to just deriving the numerical properties from those images. What you have at the start is just a collection of different numbers for multiple features. They mainly talk about the properties of the breast mass, things like area, concavity, perimeter and so on.

You look at the different features present, and then you go through the process of cleaning the data and exploring the data. Let's look at the division of malignant and benign. About 60% is malignant and 30% or 40% is benign. That's Ok. It could be worse in terms of the imbalance, but this is Ok. The next step that we do is we try to look and see if there are any null values present of the dataset. That's part of data cleaning because those null values might interfere with our machine learning model creation. Similarly, we look at other rows and columns in the data.

This is where we do the process of data exploration. Like I mentioned, it uses a lot of visualizations and graphs. What we've done over here is we have plotted the two distributions for malignant and benign masses based on each of the features. This will hopefully help inform our decision for what could be really good features or what features don't make sense at all, or just helping us understand what is it that the model could output. The pink one is malignant, and the green is benign. Let's look at this one over here, worst concave points. That's the last one, the third one. You can see that there is a difference in the distributions. They seem to be almost linearly separable. If you draw that line in the middle, they're going to be separable. We can make the observation that maybe what's concave points is going to be an important feature for our model.

Then we go on through the process of taking all of the data, splitting it into training and testing because we don't want snow detectors. Then finally, we create a decision tree out of it, which is pretty much like the way humans function based on decisions. If you're hungry, hopefully, go eat, and if you're sleepy, hopefully, go sleep and don't spend like two hours on Netflix. This is the decision tree that the model gave. If you look over here, which is the second level of the tree, the feature is called worst concave points, which is the same as we expected it to be. Remember the almost linearly separable ones. That says that this decision tree from the models is that worst concave points is probably going to be an important, an indicator feature, which actually falls in line with the hypothesis that we had.

Then after you've created the model, you need to test out the model. In this case, precision would be, of all the patients that we have predicted have cancer, how many of them actually have cancer? Recall, while sounding similar is slightly different, of all the patients that actually have cancer, how many did we correctly predict that they have cancer? This is what the confusion matrix looks like. Over here, we end up with this model having 90% accuracy, which is good for the first trial. This is what the confusion matrix looks like. You can see where exactly is it that the model needs more work.

The next step is regression, again, supervised learning, where you're trying to predict, let's say the housing prices. You go on through the same steps of cleaning, exploring the data. There's one particular step that's a little different. Over here, we're plotting the distribution to see what the data looks like. This is a correlation matrix, which basically maps all of the features against each other, and then sees if there's any correlation between the two features. If you're trying to predict housing prices, what do you think are going to be important attributes? Location, maybe how affluent the neighborhood is considered. Number of rooms, the studios and one-beds are supposed to be cheaper than a five-bedroom.

This is what we are doing in the next step. This is the variable called L stat, which basically says that the proportion of your neighborhood that's considered lower-income and below. You can see this, actually, negative correlation with the housing price, because the more affluent the neighborhood means the higher the price it would be. This is the opposite for that characteristic, so it's a negative correlation. The second graph that you see is a positive correlation with the number of rooms. Higher the rooms, higher the price. Similarly, you go ahead, try to split it in training, testing and build a model outfit.


Let me quickly just jump onto the ethics part of it. If you have pets and you have a Roomba, your pets poop and the Roomba spreads the poop everywhere, ouch. Who do you think is responsible for this? The pet was doing its job, the Roomba was doing its job. That's where the question of fairness, accountability, and transparency comes into play, where who exactly do you go on to blame if the algorithms do something wrong.

There have been cases of gaslighting using Amazon Alexa, at least till a few months ago. Alexa, when it received a call, didn't explicitly ask for your permission to pick up the call. It was just ring and then accept the call, and people could hear what you're talking about in the house. A lot of domestic abusers actually used that to gaslight their partners.

Google developed this image recognition algorithm a few years ago, which, unfortunately, went on to label people with darker skin complexions as gorillas. Google, one of the biggest companies out there with absolutely infinite resources. If they can do something like this, what hope do the rest of us have?

This is a video of which went viral a few years ago, which basically says that even your soap dispensers can be racist. The teams that go behind building these technologies don't actually test it out for a diverse range of people. That's why those who had fair skin, the soap dispenser works fine for them, but if you have darker skin, the soap dispenser actually fails to detect that your hand is there, and it just doesn't dispense anything. It's affecting every possible way.

This was just a few days back that went absolutely viral, which is about the Apple Card. I think this is the creator of Ruby on Rails, that despite his wife and him having the same income, joint income, same source of assets, the wife had a much, much lower credit limit by the Apple Card. When they investigated, turns out the wife actually had a higher credit rating. That's because the machine algorithms that they're using are absolutely biased.

Whenever you're using anything to do with machine learning or data, this is a great checklist to make sure that the data that you have is representative, it's fair, and not just at the beginning, but it continues to stay that way throughout the lifetime of the model. Make sure to have a diverse team, not just in terms of demographics, but also in terms of opinions, in terms of backgrounds, and make a list of the different ways that the tech could go wrong and how you can misuse it.


Finally, this is a recap. We took a look at what machine learning is, what are the different steps. Hopefully, you got a good idea by looking at the examples of what it looks like in code. Lastly, the ethical considerations are very important. This is a list of resources if you want to get started into machine learning more. My number one recommendation would be, go out there, pick up a data set for something that you're passionate about, it could be sports, it could be tech, it could be cooking, anything, play around with that data and see if you can find something interesting and something amazing in it. I assure you, you're just going to enjoy and really like trying to build some algorithms with this.


See more presentations with transcripts


Recorded at:

Jan 07, 2020