Search for relevant subsets of binary predictors in high dimensional regression for discovering the lead molecule

Mameli, Valentina; Slanzi, Debora; Poli, Irene; Green, Darren V. S.

doi:10.1002/pst.2117

One of the main problems that the drug discovery research field confronts is to identify small molecules, modulators of protein function, which are likely to be therapeutically useful. Common practices rely on the screening of vast libraries of small molecules (often 1–2 million molecules) in order to identify a molecule, known as a lead molecule, which specifically inhibits or activates the protein function. To search for the lead molecule, we investigate the molecular structure, which generally consists of an extremely large number of fragments. Presence or absence of particular fragments, or groups of fragments, can strongly affect molecular properties. We study the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi-level penalization term is introduced for the high dimensionality of the problem. We evaluate the performance of this model in two simulation studies, comparing different penalization terms and different clustering techniques to derive the best predictor subsets structure. Both studies are characterized by small sets of data relative to the number of predictors under consideration. From the results of these simulation studies, we show that our approach can generate models able to identify key features and provide accurate predictions. The good performance of these models is then exhibited with real data about the MMP–12 enzyme.