00:22 Wes Reisz: Worldwide, there have been 96 million cases of the coronavirus, with over 2 million deaths attributed to the disease. In particular, places like the US, India, and Brazil have been some of the hardest areas hit. In the US alone, 400,000 people have been attributed as dying from this disease, roughly the same number of American soldiers that died in World War II. Today, I thought we’d talk about how tech is combating major diseases–such as the coronavirus. While the coronavirus certainly has our attention, it won’t be the sole focus of what we talk a bit about today. We’ll talk about things like cancer and heart disease. We’ll also talk about some of the challenges when working with private health care data and some of the techniques and things that still need to be solved when dealing with this type of data, things like safety and ethics. We’ll be talking about ways of using this data in a responsible and effective way.
01:08 Wes Reisz: Hello and welcome to the InfoQueue podcast. I’m Wes Reisz, one of the hosts for the podcast. Today’s guest is Carin Meier. Carin is a data engineer at Reify Health. Reify Health develops software that accelerates the development of new and lifesaving therapies. Carin is an avid functional developer. She’s a committer, PPMC member for Apache MXNet, and you’ve seen her keynote at places like OSCon, Strangeloop, and most recently at QCon Plus (held towards the end of last year).
01:34 Wes Reisz: The next QCon plus, which is an online version of the QCon you know and love, will be taking place over two weeks between May 17th and 28 of 2021. QCon Plus focuses on emerging software trends and practices from the world’s most innovative software shops. All 16 tracks are curated by domain experts to help you focus on the topics that matter the most in software today. Tracks include leading full-cycle engineering teams, modern data pipelines, and continuous delivery: workflows and platforms. You’ll learn new ideas. You’ll learn new insights from over 80 software practitioners and innovator/early adopter companies, all across software. Spaced over two weeks, just a few hours a day, these expert-level technical talks provide real time interactive sessions, regular sessions, async learning, and additional workshops to help you validate your software roadmap. If you’re a senior software engineer, architect or team lead and want to take your technical learning and personal development to a whole new level this year, join us at QCon plus this May 17th to 28th. You can visit qcon.plus for more info.
02:34 Wes Reisz: With that, let’s jump in. Carin, thank you for joining us on the podcast.
02:37 Carin Meier: Thank you for having me. I’m excited to be here.
02:40 What is Apache MXNet?
02:40 Wes Reisz: Yeah. I’m excited to work this out. I thought we jumped right in and start with Apache MXNet. It seems like a good way to bridge right into this topic. As way of an introduction, you’re committer PPMC on the project. What is it? What is the MXNet?
02:53 Carin Meier: Yeah, so Apache MXNet is a machine learning library, and we’re all very familiar with that. The thing that I really enjoy about it is that it’s an Apache model. I was able to come there as just an interested party wanting to use this and realizing there was a gap. There were no Clojure bindings as a language for the library. I was able to get involved and commit and contribute that binding so I could bring the Clojure community to it. Then also help cross-pollinate ideas between the functional communities and the regular view Python developers. Just regularly, I think that the Apache model is a great one for openness across not only different programming languages but across different cultures and different nations in the world. I think it’s a great place.
03:47 Wes Reisz: There’s a bunch of deep learning libraries, for example, out there. What is Apache MXNet’s focus?
03:52 Carin Meier: It’s incubating, so it’s not fully graduated. I’ll put that in there. That’s something that you’ve always got to say until you graduate that you’re incubating Apache. The focus is … It’s a full-fledged machine learning library. You can do deep neural networks with it, but it really focuses on being efficient and fast, as opposed to some of the other ones.
04:13 How are you seeing AI/ML being used to combat the coronavirus?
04:13 Wes Reisz: On this podcast, I wanted to see how tech in particular ML and AI is affecting and just being involved with the fight against the coronavirus. That was the original premise. What are you seeing in the machine learning space as ways that the disease is being combated?
04:30 Carin Meier: Yeah, everybody knows where we are in the pandemic. It’s like a big spotlight has been shown on this area. I think it’s to great effect that we’ve seen great strides with the Google AlphaFold and just many people using machine learning to generate possible solutions that we’ve come up with with our vaccines, which is fantastic. Also, in all the little supporting ways. There’s been machine learning applied to just about every other way that you can accelerate, looking at the results of the trials and results, using machine learning to sift through papers, to find possible correlations between symptoms and bring stuff to the forefront that we couldn’t discover in any sort of timely fashion. Then of course, you think about the machine learning and just every other supporting way. Amazon could still ship my things to me, even though the whole supply chain had been disrupted. There’s ways that we can definitely point to we have a vaccine now, but just everything that’s supporting that and accelerating us. How Zoom was able to scale, how schools were able to be able to move to online learning.
05:47 Carin Meier: All of that has been accelerated in ways that we just can’t count by these techniques and our technology today.
05:55 Wes Reisz: Talking about Zoom, I was at a friend’s, and they have a son. I think he’s six years old, and (I was there for just a few minutes, keeping my social distancing, of course) off at the table, I could see him with eight other six-year-olds on the Zoom meeting. It was just the craziest thing to watch a group of six-year-olds doing a reading exercise or a writing exercise on Zoom. It’s just amazing how the whole human experience had to change with this pandemic.
06:21 Carin Meier: Yeah. We’re in it. Machine learning is around us so much now that it’s like water and air.
06:27 Wes Reisz: Yeah, totally. Are there any specific cases? I know you can’t mention very specifics, but are there any specific cases that you can point to, that we can talk about?
06:36 Carin Meier: There is a CORD-19 Open Research Dataset that the Semantic Scholar team at the Allen Institute for AI. They developed to partner with the global research community by accelerating finding insights in all the related published papers. That was an interesting one. The one that I’m interested in right now is… We can talk about it later, but it was from Google AI. It’s a paper that came out when they’re talking about bringing concept-model explanations for Electronic Health Records. Actually, there’s been all sorts of … We’ll get into this later about how to make machine learning more trustworthy and reliable, but there’s been exciting breakthroughs in that area as well.
07:21 Wes Reisz: I looked at when we were collaborating on some notes, this was one of the ones that was down there, but what is a concept-based model explanation? I went through there and checked a little bit at it, but I guess I didn’t quite follow what is exactly meant by concept-based model explanations?
07:36 Carin Meier: I guess they always have abbreviations from this. That’s TCAV, and this is out from Google, of course, doing a lot of great research in this area. It’s bridging the gap between interpretability. In your traditional models, you’d have this person who has high blood pressure. We could point to all the little factors and then follow them through in a big decision tree. If/then. Here you get the answer at the end. You could really point the way and follow it like a ball through a maze to the end.
08:11 Carin Meier: In these deep learning models, of course, you’ve just got this huge black box full of billions or trillions of connections. You ask when you get the model out at the end, how could you possibly get to this answer? This approach, as I understand it has these concepts like high blood pressure, being an additional concept vector that’s added to the input that then makes it easier to interpret it and be able to follow through those decisions. It’s an approach to the interpretability to vectorize it and blended in to almost like a symbolic blend, but people would probably argue with that.
08:55 Wes Reisz: It’s using the domain to actually explain the model itself right?
08:59 Carin Meier: Right.
09:01 What type of work does Reify Health do?
09:01 Wes Reisz: I mentioned in the intro that you work at Reify Health. What are some of the things that you all are doing there?
09:05 Carin Meier: Yeah. Reify Health, we focus on a particular bottleneck to the clinical trial industry. We’re all very interested in how fast things get through clinical trials and not only the COVID vaccines but lifesaving cancer therapies for breast cancer, all sorts of horrible diseases. There’s potential lifesaving treatments out there. The faster we can get it through clinical trials and understand if they’re going to work or not, the better for everybody. Our company works on the particular bottleneck of enrollment. Before you get the trial and try the drugs on the patients, this is actually getting enough people enrolled in the trial.
09:52 Carin Meier: There’s a lot of opportunity in speeding up that whole process and making it more effective so you can get the trial actually going. That’s where we put all our resources. Right now, our team is building out a data pipeline, which is interesting in itself and the healthcare domain, because you have a lot of data privacy and sensitive information. Then you have of course, different countries involved that have different rules about things. Being able to use that data and route it and protect it and being able to leverage it in an analytical fashion with machine learning … There’s a lot of interesting technical challenges. That’s where we are. We’re working with accelerating enrollment in this area.
10:39 Why has explainability been a challenge in machine learning?
10:39 Wes Reisz: As you were talking a bit about the concept-based model explanations and some of the challenges like regions with data, particularly in cases like things like GDPR, there’s a lot of challenges with using data in machine learning–accuracy, safety, ethics, all these kinds of things. I thought we’d shift a bit and talk about some of the challenges that exist in working with this data. Let’s start off. You mentioned already explainability. The ability in simple English, rather than just weighted numbers through thousands of … This Plinko board going through, building a machine learning model, but in a way that you can explain it, maybe not simple English, but the domain of the business, be able to explain how a decision is made. Why is that a problem? Why has that traditionally been a problem with deep learning machine learning?
11:26 Carin Meier: I think it’s just the scale. We’ve got random forests too, that might have this problem as you get to scale as well. I think it’s anything where you get beyond somebody being able to sit down and look at a computer programming model or a flow sheet, or however, you want to describe it. Being not able to fully understand how a computer program got to the answer. Certainly with the deep learning models where you’ve got everything vectorized, you’ve got nonlinearity flowing through huge parameters and you get to the end, and it says, hey, that’s a cat.
12:06 Wes Reisz: The way that I like to always envision it is back in my software experience, I remember building rules engines. Rules engines, you could retrace the path and be able to say because of this decision, we had the next decision, we had the next decision. Those were great and we built them larger and larger and larger. Then all of a sudden convolutional neural networks came along and we could replace this massive rules engine with all these different, again, Plinko boards on how things bounce through the system with something like the convolutional neural network, which was great. It was a lot less code and it was a lot easier to manage from the rules engine, but how it got to that result was lost. The things like what you talked about with explainability with that concept-based model explanation seemed like a way of addressing that. It’s not just a nice to have anymore. It’s legally required by things like GDPR in the European Union.
13:00 Carin Meier: There’s a great conference that goes on every year called NeurIPS. They just had a really great tutorial on interpretability and on these machine learning models. That’s actually free out there. I encourage everybody to go out there, especially if you’re using machine learning models and interested in this. They went into … Basically with simple models … Like you said, with the rules engine, you can trace it through, but once something gets big enough that you can’t, you have to move to a post-hoc explainability. You can’t trace it from the beginning. You can only look afterward with a percentage. This is why it did what it did. You can see this, they have some nice tools out there, especially with text-based models. When you have a snippet of text and you ask it a question based on that, like who was the President in year X, then it’ll light up the highlighted words of how relevant each word was to the answer that it derives. That’s post-hoc explainability.
14:03 Carin Meier: You can look afterward and say, this word doesn’t look like it’s quite right. Then of course you’d have to go through the whole bother of trying to debug it. That’s a whole different thing, if you didn’t like the answer that it got. It’s interesting. If you have that insight into seeing how the model is working, then you can start to address other balances like accuracy and safety. How accurate do you need it to solve your problem? Maybe a machine learning model isn’t even worth it to you, if you don’t need to be that accurate that you don’t need that trade-off. If you do need that accuracy, how can you safely use it? If you have an explanation, can you insert human into the process and have them double-check the answer? I think going down to the core of this, we have wonderful tool machine learning, but it definitely doesn’t replace thinking. Thinking just pushes to a broader picture of how can you incorporate this in this process? Do you need to incorporate this in the process? Do you understand your problem? What is your problem? That’s the hard stuff.
15:19 Wes Reisz: I like that bit about human and the AI loop because I think a lot of times people think about AI and machine learning is just making all these decisions. Certainly, they do, but in many cases, it’s augmenting a human’s ability to, I guess, react on data more appropriately. I can remember talking about Stitch Fix, for example. Stitch Fix, it’s not in the healthcare space, but it does clothing recommendations for people. There’s still an individual there. They use machine learning extensively to give recommended sets of clothes and patterns of things to a person who then makes that final recommendation to the subscriber and the person. I think that’s a really good way of thinking about how machine learning and AI is being used. It helps, it augments the person’s ability to get to a set of data where the real decision can be made faster, I think.
16:11 Carin Meier: Exactly. I think the analogy earlier on was, we want machine learning to be like an Iron Man suit.
16:19 What are some of the core challenges of using machine learning/ deep learning in the healthcare space?
16:19 Wes Reisz: I like that. I like that. Yeah. Let’s talk about the Iron Man suit. What are some of the challenges with creating this Iron Man suit? Things like you mentioned, accuracy, safety, you’ve already talked about explainability. What are some of the core challenges on being able to leverage machine learning, deep learning in this healthcare space?
16:36 Carin Meier: I think those are the key things that are holding us back trust, basically. Healthcare and the medical environment is a high-trust environment. Whatever tools that we use to leverage, we need to understand them and to be able to trust them because they’re making deep impact on people’s lives. The amount of trust that you need to pick a sweater for a person is not the amount of trust that you need to decide whether a person should get a life-saving treatment or not. Google and other big companies are tackling this problem. We need to find ways that we can make sure that privacy on the individual level is being preserved on these models, that they’re explainable in some way that we can trust them, and that we can find best practices. I don’t like to say best practices in a lot of ways that we can incorporate them into our businesses and our models.
17:41 Carin Meier: I’ll just expand on that. The reason I don’t like to say best practices it’s because people use that as an excuse not to think. They’re just like, I don’t need to think about this. The best practice is the way we do it. This is our purpose. Our purpose is to be here and to think about our problems and to think about the trade-offs of every solution, and come to the best possible solution. Just taking an off the box answer and saying, we can use this and not thinking about it is doing a disservice to everyone.
18:09 Wes Reisz: Yeah. It leads us into some of the problems we’ve heard where ML models have gone wrong, for example. That reminds me of a cartoon I remember seeing years ago about design patterns. It was before a developer hears about design patterns, after a developer hears about it, and then after they have more experience leveraging design patterns. The first one, their code’s going all over the place. Then the second one everything’s a design pattern. Every single design pattern that they could possibly imagine is implemented end to end. Then at the end of it, it’s like, here’s just some simpler code that may happen to use a pattern. Once you learn about these things… Oh, I have to try to put them everywhere. It’s not always the best approach.
18:45 Carin Meier: Right. I think that’s led to some of the problems that we’ve had with machine learning models lately.
18:51 How do you balance data needed for an answer with the privacy/safety of the people represented in the dataset?
18:51 Wes Reisz: Let’s talk about safety. In particular, one in the healthcare space that I think seems like a real challenge. I remember a couple of years ago, a few years back, there was a book by Cathy O’Neil Weapons of Math Destruction, that talk about just systemic bias data and reinforcing pre-existing inequity with machine learning models. That went to things like removing things like race, for example, from data sets, when decisions are being made. In healthcare, race can be very important. People with certain ethical backgrounds may be more inclined to having certain diseases like heart disease, for example, or high blood pressure or things like that. How do you balance privacy, safety with things like race when it comes to machine learning models, when it may be important to the decision, but it’s been used for reinforcing pre-existing inequity? How do you balance?
19:44 Carin Meier: I think the first step is recognizing that there is a problem with this, and then you have to approach it carefully. Luckily now, it’s been circulated that there is a problem in datasets. Just because you put it all in a model doesn’t mean that the answer is perfectly free of human bias because we fed it this data. Data in, data out. It doesn’t go away just because it’s a machine learning model, that’s our core truth. Making sure that your data is a good representative set to begin with is your fundamental thing. Of course, with sensitive data like race and ethnicity, you have the additional thing of this has got individuals’ very sensitive data that you need to protect this. This is ways that differential privacy… If people haven’t heard of it. It’s a technique that you can protect an individual person’s information while still gathering statistical insights on the whole. You can still get the core learnings that you need, but without compromising the individual’s privacy.
20:53 Wes Reisz: The way that I understand differential privacy and correct me if I’m wrong cause I’m sure it’s not accurate, but it’s like rather than showing someone in individual, you show it in an aggregated set? That way, the privacy of the individual is respected, but the data is still presented. Is that accurate?
21:08 Carin Meier: It’s got more math behind it. I’m not a math expert, but it’s statistical fuzzing method I guess, is an appropriate way to think about it. There’s also ways that you can use this in training the deep learning models in a distributed fashion as well. That way, the machine learning model is trained on that fuzzed data itself. The individual data never actually reaches the final model, which is an important thing as well. I don’t want to get too far down in the differential privacy, but that’s another technique that’s used to be able to safely extract insights into race and ethnicity. That is an important component to making sure that it is not biased. Again, so then there’s another process at the end, evaluating your model. Does your model have any bias in either direction? It’s all throughout the process. It’s at the beginning, looking at your data coming in, how you actually train the model, how you evaluate the model, and then a circular feedback loop. Let’s get humans in it and make sure that it’s doing the things that we want it to do in a safe manner.
22:24 Wes Reisz: Tying back to what we were talking about, that human in the AI loop before. I think what I’ve just heard is it’s important for humans to audit the decisions that are coming out to make sure that they make sense. Is that accurate?
22:35 Carin Meier: Yeah. Just like any sort of code. You need to test it to make sure that your code is right. That’s at the lower levels of, did it actually get the right answer that you wanted. Is the model accurate? Then the higher level, is it trustworthy? Can you explain how did he get this answer? Why did it say that this person should get this treatment? Then how do we make sure that that isn’t biased? That’s another question. It goes up and up in scope, and how do we safely incorporate this into our business practice? What happens if it’s wrong? I think that’s one of the reasons why so many people are attracted to this area because the problems are tough and they’re important, and they’re changing. A lot of people are attracted to computer science in our industry because we like solving problems, and there’s a lot of problems to solve in this domain that directly impact everyone.
23:29 What are some of the ethical questions at the heart of machine learning today?
23:29 Wes Reisz: We seem only to be creating more problems with our society that have to be solved. We talked about privacy, we talked about explainability, we talked about safety, but one we haven’t talked about is ethics. Just because we can doesn’t mean we should. What are some of the ethical questions that at the heart of machine learning today, deep learning today?
23:48 Carin Meier: Wow. Yeah.
23:51 Wes Reisz: I don’t know how I’d answer that question, so go.
23:55 Carin Meier: I think you ask broad questions, you can get broad answers.
24:00 Wes Reisz: Good response. As soon as I said it, I thought that was an unfair question to ask.
24:04 Carin Meier: The answer is, should we?
24:05 Wes Reisz: It depends. Maybe.
24:09 Carin Meier: That’s the thing. There was a good example of it in the news. I think it was in England when the whole pandemic hit and people couldn’t take their end exams. I think they just said, let’s just put a machine learning model on all your prior test exams, and we’ll just predict what you would have gotten on this test.
24:29 Wes Reisz: That’s a great idea.
24:32 Carin Meier: It’s pretty much the same as what you would have gotten. So what? You can’t go to college now.
24:40 Wes Reisz: Yeah. I definitely would not have gone to college under that arrangement.
24:44 Carin Meier: Yes, in that case, you can, and we did, but should we, sort of thing.
24:51 Wes Reisz: Do no harm. I think that’s a good answer. That’s the best way to end it.
24:55 Carin Meier: I know people at various points, people are always like, “Computer science people should be a guild. We should have an ethics statement just like doctors.” It’s interesting not even getting into whether we should have a guild, but if we had an ethics statement, like the Hippocratic Oath for doctors, what would it be? What would our oath be?
25:16 Can people without a PhD contribute to the field of machine learning and data engineering?
25:16 Wes Reisz: There’s a lot of discipline. There’s a lot of fields that are in machine learning. I think that there’s a perception that to be involved with these data pipelines that do the work that you’re doing requires a PhD to be able to… Is that true? Does it require a PhD to be able to get involved and contribute in a meaningful way to things like deep learning solutions to the coronavirus?
25:38 Carin Meier: I would say definitely not. PhDs are helpful. If you have a PhD, please come and help us, but it’s not required. I think data engineering as a field is one of the fastest-growing fields, just because we need good engineers. We need good engineers to build out our pipelines and to apply engineering practice to building models, maintaining them. The whole of our software industry has trained us to do, and we need it applied to this. Also we need just curious people generally to innovate. I think you were saying before about design patterns. Once you learn about design patterns, everything’s design patterns until… I think a lot of that is with deep learning and deep learning models right now. We’ve got one dominant model and one dominant way that we’re thinking about intelligence. That’s not necessarily the best or the only way. We need more people to come to this with curious minds, bring their backgrounds, whether it’s philosophy, whether it’s game development, whatever it is, so we, as humanity, can press forward and look at all these different solutions and find the best ones.
26:55 What are some of the big things in machine learning we’re set to solve in 2021?
26:55 Wes Reisz: Absolutely. I come at it from a web developer background. I’m a Java developer who comes from a web environment. These models still have to run. It still takes someone to be able to take that machine learning model, wrap it into a service and be able to operationalize it into a platform. There’s so many roles that are needed to be able to tackle the problems that machine learning can help solve. We’re at the very beginning of the year, 2020 is in our rearview mirror, thankfully. What do you hope that we’re going to solve in 2021? What do you think are the big things we’re set to solve?
27:29 Carin Meier: I’m going to bring it down to the scope of machine learning.
27:32 Wes Reisz: Yeah, there you go. Sorry. Let me qualify that. What are some of the things in the machine learning and deep learning space that you think we’re poised to solve in 21?
27:39 Carin Meier: Trust. Building trust in these techniques and in the models so we can use them responsibly and effectively in our healthcare and other areas that we need high trust. That’s a big, big gap right now for us.
27:57 Wes Reisz: Carin, thank you so much. I think we’ve been working on this since the end of last year. Thank you for working on this with me through the holidays and into the New Year. It was fun to sit down and chat finally.
28:07 Carin Meier: Thanks again.
Credit: Google News