216 - Yang and Henle - Machines learning to classify insecticides as toxic to bees

Transcript

Andony Melathopoulos: [00:00:00] Oh, faithful listener. Thank you so much for sticking with the podcast. It has been a particularly busy, a very productive time here at the OSU pollinator health lab. Lots of great research. We've been up to, but in the midst of it all, I've carried my handy field recorder with me everywhere I've gone and I've got some, a whole hopper of podcasts ready to release from Georgia to Southern Alberta.

To hood river, where we talked with somebody who's restoring water pipeline, that's been buried to pollinate habitat to just down the street. There's a school that has integrated Mason, bees and honeybees into their curricula. So lots of great episodes to come, including today's episode.

A story is very cool when multiple people send it to you in your inbox, and you can always do that. If some story comes across you're interested in send it to me and I'll follow it up and track it down. But this story happened right here at Oregon state university, just down the street under my nose.

I didn't even know about it. Corey Simon, who's an assistant professor of chemical [00:01:00] engineering and HOWI Fern, an associate professor of computer science. Oversaw this great project on using machine learning to try and predict the toxicity of pesticides to bees. And today I've got the lead author on the paper ping yang.

Who's a graduate student here at Oregon state university and Adrian Hillel, who is a a doctoral student in chemical engineering. So without further ado, Dive into this whole question of whether you can train a machine to look at molecules and predict whether it's toxic to bees this week on pollination.

All right. I'm so excited today to welcome Adrian and ping to pollination. Welcome, Adrian. Yeah. Thanks for having me and welcome. Yeah. Hello,

Ping Yang: this is ping.

Andony Melathopoulos: So I guess just to begin with this is an intriguing paper. I was really [00:02:00] excited to have a number of people, refer it to me. And so maybe just to start with ping, where did this the idea, you're over, over a couple blocks for me in engineering.

Where did the idea come from to, use machine learning and to this real specific problem of B toxicity, where did, how did your lab develop this thought?

Ping Yang: First of all, actually we have developed we have a saw to use this kind of. Grab kernel scenes into our merchant learning method.

We do some research on this machine learning field that we find that's a paper that interest us and also the pesticides. We think that is it's very used used widely in agriculture and is pass a economic means to control the we pass and the pathogens. So we use the pesticides increase.

That the expected crop yelled and also the quality and contribute to the food security. But we think that the widespread pesticide use [00:03:00] has some active external on the, on. Our ecosystem and human health, especially it will kill some agriculture, be patient spaces, not the targeted. So such as our main character, our in our paper is base.

So we think that we can use some machine learning algorithms to speed up the prediction of S or he, or meanings of the pesticides. Toxicity, and we can get the result very quick and it's usable for our further research in the lab. So we did this, develop this algorithm. And

Andony Melathopoulos: help us.

And I remember we were talking earlier that, one of the starting points was that, the environmental protection agency already collects data on, acute toxicity to bees, the, short term exposure and a laboratory setting. And so you already, the lab already had this great data set to start this machine learning [00:04:00] process on chemistries.

So you have maybe just let's get into the methodologies a little bit. So we've got these you've got these chemicals and I guess maybe here's a way into this part of the conversation. People have heard about machine learning before, we know when you go into Google photos, it can rec you can type somebody's face in and they'll find all the, clearly there's a lot of capacity, but I don't know if people often think about chemical.

Like the ability, how does one, how does what's the data that goes into machine learning apart from the toxicity, but for any kind of broad, maybe Adrian, just give us a sense what is the input? What goes into the machine learning? What is it learning from?

Adrian Henle: so in, in our case, in a lot of cases for chemical machine learning, what we're looking at is what's called a graph representation.

Anybody who's taken, high school chemistry has drawn the little diagrams where you have an atom with a line to another atom and so on. And If you consider the atoms to be nodes or vertices in a graph [00:05:00] and the bonds to be edges of the graph, you end up with a, a flattened out picture of the connectivity of the molecule.

And so that's really what the model is. Learning is a connects to B, which connects to C and back to D and so on. So it's, you could think of it as the shape of the molecule, but it's. It's the shape without space.

Andony Melathopoulos: Okay. Cuz it molecules clearly they've got all sorts of orientation around the bonds, but this is a flat representation and this, and what it has to learn from, is this data set?

And I suppose that's one you were telling me about the distinctions between different types of machine learning, some machine learning you tell it what to look for and another. It, you just let the computer learn on its own. Tell us a little bit about the methodology that was used here.

Adrian Henle: So here what we did was the data set ultimately comes as what's called a smiles

Andony Melathopoulos: string. Smile string smiles.

Adrian Henle: Yeah. Okay. Looking happy. It's it's an acronym that, it, somebody back [00:06:00] acronymed it together, but it, what it really means is it's a way to write out the structure of a molecule in text.

And our code reads that in constructs, the graph representation from that. And then for every smile string in the data set, there's a classification toxic or nontoxic. And by doing a whole lot of analysis of those graph structures and learning, which motifs happen more often in the toxic versus nontoxic classifications.

And if you give it a new smile string, it'll look at the graph. Think about it for a little bit and say, yeah, this is more in common with the toxic. Then not. So I'm gonna predict toxicity.

Andony Melathopoulos: Okay, great. So ping the, if I get this straight, this the, these This kind of combination of the toxicity, whether it's toxic or non-toxic and this flattened kind of chemical structure gets fed into the machine learning.

What happens then? And how do you know, how do you know that it's actually learning? How do how do [00:07:00] you confirm that it's actually learning something me.

Ping Yang: So actually if you put some like Andrea says smile strings into our machine learning vessel it actually just compare the different. And the commons part or the both the difference and common parts. Actually, it's not parts it's there's patterns in the graph. And the pattern in graph will be compared and used into know, because we already know some we already know some pesticides is toxic or not toxic.

So we. This pesticides to get the whole things first that we use these toxic pesticides compared with other that we don't know the toxic things. And then if we compared find that very frequently showed patterns, both in this PIDE, then we can, somehow. Perfect. Complete confident [00:08:00] that these new ized must be also toxic.

Okay. This is how we compared

Andony Melathopoulos: it. So you've had some knowns and then you ran a Nu a number through and just, if it could guess them correctly, then that gave you a sense that it was really learning properly. It was not just, randomly assigning it was accurate. So tell us what the results were like.

Did Was it able to learn from a molecule structure, what if it's a toxic or non-toxic product?

Ping Yang: It cannot just based on the based on the molecule structure, we cannot definitely say that it is toxic or non-tax, but if from our learning that It's more like comparing the, actually we use the word walk. We use the walk the walk in the graph journal. Actually we are comparing using the walk to count all the different parts and also the common parts in between two different.

Patents or two different graphs. Alright

Andony Melathopoulos: wait. So what do you mean [00:09:00] by walk? I know what walking is in, nonengineering machine learning terms. So what, what does walking mean in the model? yeah a

Ping Yang: walk in a graph is just, is like you. So graph are basically constructed by some nose and some each, so each connect so each are connect to nose.

And so a walk is actually is is a step from one nose to another nose and to. So like a step in between the nose. Okay. So that's a walk. So actually when you do a walk, you can compare two different walk between two different patterns or two different graphs. So if you are you are going like if you, if I fly from San Francisco to Portland yeah.

And you fly from Los Angeles to San Francisco and then to Portland, then we have the common walks that call San Francisco to Portland.

Andony Melathopoulos: Okay. That makes [00:10:00] sense. Okay. So it's kind, and I imagine, people who fly from, that, that route general route may be more similar to the people who, fly, I don't know, Anchorage to, Edmonton or something.

It's just, it might be a much different kind of group of people. And so getting back to this question, so the walking is, in some ways you're walking from. The atoms inwards and and looking for similar walks and that getting back to the question of, can it predict toxicity? I guess it can look for similarities.

Yeah. Okay.

Ping Yang: Yeah. Yeah. Graph is actually comparing the similarity between different graphs.

Andony Melathopoulos: Okay. So agent, so it seems like it can, maybe it, the it's been, it was really successful at predicting. Compounds that were, related to one another and being able to tell if it's a toxic one or not.

Did I get that right?

Adrian Henle: Yeah, that's right. Yeah. So [00:11:00] in, in the dataset we had about half a dozen different classifications of commonly used pesticides fungicides miticides and herbicide. All things that you would wanna know. If I spray this on my crops, is it going to do what I want only what I want, or is it gonna do some other things off target?

And from all of those data together yeah, the model was able to make pretty good predictions. I think things said before the whole data set accuracy was somewhere in the 70 to 80% range, which at first I didn't think was all that good, but we had an invited speaker last term professor Taylor sparks.

From university of Utah who also runs a podcast on material science called materialism. Yeah. Great name. Awesome. And a great guy very friendly. And so we were telling him about this and he said, 80% is actually fantastic. Cuz think about what it would take for you, a human being to look at a few thousand potential new molecules.

And could you in a day or [00:12:00] less. Tell me with 80% accuracy. If these things are going to have off target toxicity and the answer is absolutely.

Andony Melathopoulos: Yeah I having doing some of those studies here at OSU, it takes a lot of effort. You need to get the bees, you need to get them at the right stage. You need to make sure you have good laboratory practices.

So it's auditable. It's a big undertaking. So this could be a great tool for kind of making quick screens of new chemistries. And it's we better really put our attention to this, cuz this looks like it could be a real toxic to be kind of product.

Adrian Henle: Yeah we see this as a prioritization tool.

So if you have big database molecules that you might want to to make, test out for a new product, and, eventually you're gonna have to do this real, in vivo testing to determine toxicity that's the slow, expensive, difficult part. The ones that the model says are almost definitely going to be toxic to the wrong.

You probably just won't make those focus, your attention on other things that have a better chance of being [00:13:00] appropriately targeted.

Andony Melathopoulos: I'm glad you raised that. And I, wonder ping, what do you see as some of the applications of this and what are some of the next steps with this work?

Ping Yang: For this work, actually, we are not only want to test only the toxicity of pesticide actually for our further. Research on this project, we are actually want to expand the area of predictions so we can get other things because we can test a lot of different, small molecules. We can test the similarity between different, small molecules, so we can on not only just do for predict the toxicity of the pesticide, but also other things like.

Actually we are doing something we are trying to do is for, because we are a lab doing for some simulation of ma metal, organic brain framework. So sometimes we will make, to want to do is the storage or the AB [00:14:00] absorption of gas in the, of different gas in the. Yeah. So that's our further research, but based on the pesticide we can do is actually to increase the accuracy of our prediction of this toxicity because our algorithm is not that perfect to predict all different kinds of pesticides.

So there is something there's some way to increase our accuracy.

Andony Melathopoulos: I. Okay. So that's, a next step might be to, you're at 70 or 80%, but, I imagine there must be, other methodologies to and other products or other ways to look at these molecules that may, reveal, even it gets it even sharper.

wow, this is amazing. Adrian, is there anything that you wanted to add to that in terms of applications you talked about, like a screening tool. Is there any other kind of ways that you envision this technology being taken up or maybe next steps.

Adrian Henle: The [00:15:00] real easy sort of lateral move on, it would be This doesn't have to be about B toxicity.

The method should work. If you have some set of molecules and some property that you wanna be able to predict. If you wanna predict, are these things going to be harmful to people are these things going to be useful for the treatment of a specific disease or condition? The, these are all equally answerable questions using the method that we described in the paper.

Then in terms of how to, really take the technology to the next step. As you mentioned, there are a plethora of other machine learning techniques that we could either supplant ours with or use to help refine what we've done interpretability would be a really cool thing I think to tackle next, because right now we know how well the method works, but we don't really know why.

And that's a common problem in machine learning. The model does what it does. And in the end you have some, you can give the model to other people. It's a huge set of numbers all written down and they don't mean anything [00:16:00] to the human mind. So it would be really cool to come up with something that would look at what our model does computationally and show us here's the thing that really mattered.

Here's what I saw in the data. Because then, a human organic chemist learns really well that way here's a chunk of molecule do or do not put this into the next thing that you make.

Andony Melathopoulos: Oh, I see. So it, the, one of the things that another next step is that, cuz it, the machine learns and you don't really know why it just tells you this is toxic, but if, what if you knew.

There's a kind of kink in the molecule here, this combination, and then somebody who's designing new products can make sure that they're not toxic to bees. It's I know to avoid that in the synthesis.

Adrian Henle: Yeah. It's human beings can learn from the model that

Andony Melathopoulos: way. Oh, fantastic. This has been really wonderful talking with you, both.

Congratulations on your publication and we are looking forward to hearing more about your work here over over in horticulture. [00:17:00]

Adrian Henle: all right. Yeah, thank you again for having us. It's

Andony Melathopoulos: been fun. Yeah,

Ping Yang: you.

A new study uses machine learning to classify whether a pesticide is toxic to bees or not.

Ping Yang and Adrian Henle are graduate students at Oregon State University in Cory Simon's research group, studying the application of novel machine learning techniques to predicting properties of chemical systems.

Links Mentioned:

Was this page helpful?

Related Content from OSU Extension

Have a question? Ask Extension!

Ask Extension is a way for you to get answers from the Oregon State University Extension Service. We have experts in family and health, community development, food and agriculture, coastal issues, forestry, programs for young people, and gardening.