Importance of programming language expressions for donor materials in modeling
The programming language expression is one of the important aspects in the ML approach as it transforms the raw data into a machine-readable representation (25). Various expressions for the same molecule comprise vastly different chemical information, or this information is presented in different abstract levels. A desirable form of expression should cover almost all the features of the molecule but contain no redundant information. Here, a set of ML models are used to explore the different expressions of a molecule by comparing their predicted accuracy for the PCE.
The image of a chemical structure is a direct and original expression of a molecule (Fig. 1B). However, features connected with PCE are not reflected in an image and are regarded as hidden features. To overcome this problem, we use deep learning, which can extract features from images. The confusion matrix shown in Fig. 2A indicates the performance of the deep learning model. The predicted accuracies of the best-performing deep learning model for the first (0 to 2.99%) and second (above 3.00%) categories are 70.79 and 67.90%, respectively. The overall accuracy is 69.41%. The unsatisfactory performance of the deep learning model with image as expression is attributed to the small size of our database (a typical feature of deep learning models is that they require large training sets). When the number of molecules in the database reaches 50,000, the accuracy of the deep learning model can exceed 90% (19). To fully train a deep learning model usually requires a large database containing millions of samples (35, 36). Here, each category only has hundreds of molecules, making it difficult for the model to extract enough information to achieve high accuracy. Fine-tuning a pretrained model (36) can considerably reduce the amount of data required, but thousands of samples are still needed to provide a sufficient number of features. Therefore, increasing the size of the database is one of the solutions when using images to express molecules.
The SMILES code provides another original expression for a molecule (Fig. 1B) (31). Through a traversal over the whole chemical structure, a string that contains the information on atoms, bonds, rings, aromaticity, and branches can be obtained based on established rules. The results of using SMILES as inputs for BP, DNN, RF, and SVM models are shown in Fig. 2B. The average accuracies through cross-validation of all the four methods are low; the highest one, achieved by the RF model, is only 67.84%. There are two possible reasons: (i) SMILES is still close to raw data, and unlike deep learning, the four classic ML methods do not have the ability to extract hidden features. As will be shown later, a further conversion, e.g., to fingerprints, is needed for these classic ML methods. (ii) As mentioned above, 0 is added to keep the length for SMILES for different molecules. These 0s may affect the process of building logical relationships in the models. Thus, SMILES performs worse than images as descriptors of the molecules for predicting the PCE class.
Molecular descriptors describe the properties of a molecule using an array of real numbers rather than expressing the chemical structure directly (32). Here, two kinds of descriptors (PaDEL and RDKIt) that have different sizes of data are used. The PaDEL descriptor (table S1) (37) consists of 1875 different types of descriptors, which can be defined as one-dimensional (1D) descriptors (i.e., the number of certain groups or atoms), 2D descriptors (i.e., graph invariants and molecular properties), and 3D descriptors (i.e., geometry). Figure 2C depicts the results using a PaDEL descriptor as input for the BP, RF, and SVM models. The RF model attained the best performance (average accuracy as high as 76.27%), far superior to the BP and SVM models. It needs to be noted that our DNN model cannot process a long array of real numbers in this experiment. The RDKIt descriptor (38) (196 bits; table S2) is much shorter than the PaDEL descriptor (1875 bits). The shorter length of the RDKIt descriptor implies it contains less information. The results of using RDKIt as the input are shown in Fig. 2D. The RF models again attain the best performance. However, the prediction accuracy of RDKIt (75.29%) is only 1% worse than that of PaDEL (76.27%). In contrast, the accuracy of RDKIt (67.65%) for the BPNN model is better than that of the PaDEL (62.35%). The accuracy of RDKIt for the SVM model is 47.65%, less than 50% (random classification), suggesting that SVM cannot establish a logical relationship between RDKIt descriptors and PCE. These results indicate that a large data size implies more descriptors that are not relevant to PCE, which will affect the ANN performance. In addition, a small data dimension means that the chemical information is insufficient to train SVM models effectively. Therefore, looking for appropriate descriptors directly related to the target object is the key when using molecular descriptors as inputs in ML approaches.
Molecular fingerprints are designed for large-scale database screening and take the form of an array of bits (39). They contain “1”s and “0”s to describe the presence or absence of particular substructures/patterns in the molecule. Here, seven types of fingerprints are used as inputs to train the BPNN, DNN, RF, and SVM models. The influence of the fingerprint length on the prediction performance of different models is also considered. The results of using different types of fingerprints as inputs are summarized in Fig. 3.
MACCS fingerprints (40) have 166 bits, making them the shortest. Although it is short, the similarity of fingerprints among different molecules is relatively small. For example, both P3HT and PTB7 have 166 bits in total, and 26 bits of content in the fingerprints are different, leading to a “degree of difference” of 15.66% (the complete MACCS fingerprints are shown in table S3). However, the results of using MACCS fingerprints as the input are unsatisfactory (the highest average accuracy achieved by the RF model is only 72.35%) because of the limited information they contain. PubChem fingerprints (41) have 876 bits, longer than MACCS. However, the differences between molecules for PubChem are small. For instance, the degree of difference is 10.39% for P3HT and PTB7, implying most of the bits are the same for these two materials. The small difference among molecules suggests that the substructures described by PubChem exist in most of the molecules, and models will struggle to identify the difference among molecules. Although an RF model can obtain an average accuracy of 74.90%, we cannot conclude that the PubChem fingerprints are suitable as an expression of a molecule for screening OPV donor materials.
The FP2 fingerprint (42) has 1020 bits, and it is a path-based fingerprint that indexes small-molecule fragments based on linear segments up to seven atoms. The performances of the four ML methods are stable and satisfactory. The SVM model has the highest average accuracy of 74.51% (Fig. 3D). In addition, the Extended fingerprint (1021 bits) is an extension of the Chemistry Development Kit fingerprint (43), with additional bits describing ring features. The prediction results for the Extended fingerprints are similar to those for the FP2 fingerprints. The best-performing approach is obtained using the RF method (Fig. 3C), attaining an average accuracy of 77.06%.
Both Daylight (44) and Hybridization fingerprints (43) have 1024 bits, but the information expressed within these two fingerprints is quite different. Daylight fingerprints represents the pattern for each atom and its nearest neighbors. Hybridization fingerprints takes into account SP2 hybridization states rather than aromaticity. However, the verification results are similar for these two fingerprints used as inputs (Fig. 3). The highest average accuracies (obtained by the RF models) for the Daylight and Hybridization fingerprints are 79.02 and 78.24%, respectively. We point out that the best combination of programming language expression and ML algorithm over all models is obtained with the Hybridization fingerprint and RF, which achieves a prediction accuracy of 81.76%. Moreover, it is observed that the prediction performances of the FP2, Extended, Daylight, and Hybridization fingerprints are close to each other. These fingerprints are organized with different rules of representation but have similar lengths (around 1000 bits). The similar prediction performance of different fingerprints with almost the same length indicates that the fingerprint length, rather than the contents of the fingerprints, has a notable impact on the prediction of PCE.
The Morgan fingerprint (45) is the longest, having 2048 bits. For the BPNN model, the Morgan fingerprint performs poorer than most of the fingerprints with lengths around 1000 bits. Notably, the other ML models still have satisfactory results, and the highest average accuracy of 79.80% is obtained with the SVM model.
From the results described above, we can conclude that, generally, the performances of all ML models improve when the fingerprint length increases from 166 to 1024 bits. This is understandable since more chemical information is included in longer fingerprints. In particular, DNN, RF, and SVM models can establish an accurate relationship between the chemical structure and PCE when the length of the fingerprint exceeds 1000 bits, while BPNN performs the best with fingerprints whose length is around 1000 bits. This may be due to the relatively poor data processing capability of BPNN, as activation functions used in BPNN are imperfect (more details are described in the Supplementary Materials). A long fingerprint carries much more information than BPNN requires, which may “mislead” the model, causing too much pressure on computation (making the model difficult to converge). In addition, the overall results suggest that molecular fingerprints with lengths above 1000 bits are the most suitable and effective inputs for building ML models to predict the PCE, owing to their ease of accessibility and the abundance of chemical information they contain.
Considering that a higher threshold value of ML models is more meaningful when designing highly efficient materials, we increased the threshold from 3 to 10%. As mentioned earlier in Methods, an increase in the threshold will reduce the number of molecules in the database. We trained RF models with Daylight fingerprints as the input. When the threshold is set at 10%, the average prediction accuracy is 86.67%, but the SD is large (±11.58%), which may be due to the small database that contains only 100 molecules.
Screening for high PCE donor material via ML
To efficiently predict the PCE of donor materials, four ML methods are used, and their performance for different machine language expressions are summarized in Fig. 4A. The RF method performs the best, because its strategy is to choose multiple features rather than all features from the input for establishing the relationship (46), which is advantageous when dealing with complex and long inputs. For example, only the RF model performs well when using SMILES, PaDEL, and RDKIt descriptors to represent materials.
To further verify the reliability of our ML models, we designed 10 new small molecular donor materials (D1 to D10, whose chemical structures are available in fig. S2). The OPV fabrication process can be found in the Supplementary Materials. To the best of our knowledge, nine of them have not been reported yet, and one was published very recently (22). Originating from the well-studied A-π-D-π-A structure and the highly efficient BTR molecule developed by us (47), these 10 donor materials can be divided into three groups with variations in the A (end group), π (link), D (core), and side-chain groups. Donors D1, D2, D6, and D9 have the same π-D-π structure but different A moieties, while donors D3, D4, and D5 have chlorination or alkyl chain modification on the D part. In donors D7, D8, and D10, the π links were modified.
As shown in Fig. 4B, the OPV devices are based on a typical normal cell architecture. D3 and D7 used IDIC or Y3 as acceptors, respectively, while the other eight donors used PC71BM as acceptors. The donor:acceptor blend film is sandwiched in between a poly(3,4-ethylenedioxythiophene):poly(styrenesulfonate) (PEDOT:PSS)–coated indium tin oxide (ITO) transparent anode and a [2-(1,10-phenanthrolin-3-yl)naphth-6-yl]diphenylphosphine oxide (DPO) electron transport layer. Ag was used as a back cathode. After fabrication, these devices were tested under AM1.5G illumination in ambient to investigate their photovoltaic performance. The current density−voltage (J–V) curves of the OPV devices are displayed in Fig. 4C, and the photovoltaic performance parameters are summarized in table S4.
Before the experiment, we used our RF models with 3% as threshold to evaluate these 10 materials. Three representative fingerprints, i.e., FP2, Hybridization, and Daylight, were selected to express the chemical structure of the 10 new molecules. The results are displayed in table S5. The comparison between the prediction results by the RF model and the experimental PCE values is shown in Fig. 4D. Eight of 10 molecules are classified into the correct category, while two materials (D8 and D10) that exhibited low PCE (less than 3%) are classified into the category with the PCE range of above 3%. It is noted that the prediction result signifies the potential of a material for OPV application. So, these two materials may be further improved by optimizing the experimental conditions.
In addition, these 10 new materials have also been evaluated by the model using 10% as threshold. The prediction results are displayed in table S6 and fig. S3. The model with 10% as threshold can classify eight molecules into the correct category. In general, the predicted PCE classes are in good agreement with the experimental results. Experiment outcomes indicate that a minor change in structure can bring about a large difference in PCE values. Encouragingly, these minor modifications can be identified by an optimized ML model, thus leading to favorable prediction results. Although the ML model produces a prediction through comparing similarities, we believe the features of similarities learned by the models are complex. It is not merely the structural similarity, but perhaps it contains abstract features such as the location and connection of various substructures.
Credit: Google News