Biology Methods and Protocols2021Full TextOpen Access

MultiPep: a hierarchical deep learning approach for multi-label classification of peptide bioactivities

Alexander G. B. Grønning, Tim Kacprowski, Camilla Schéele

37 citations2021Open Access — see publisher for license terms1 related compound

Research Article — Peer-Reviewed Source

Original research published by Grønning et al. in Biology Methods and Protocols. Redistributed under Open Access — see publisher for license terms. MedTech Research Group provides these references for informational purposes. We do not conduct original research. All studies are the work of their respective authors and institutions.

Abstract

Peptide-based therapeutics are here to stay and will prosper in the future. A key step in identifying novel peptide-drugs is the determination of their bioactivities. Recent advances in peptidomics screening approaches hold promise as a strategy for identifying novel drug targets. However, these screenings typically generate an immense number of peptides and tools for ranking these peptides prior to planning functional studies are warranted. Whereas a couple of tools in the literature predict multiple classes, these are constructed using multiple binary classifiers. We here aimed to use an innovative deep learning approach to generate an improved peptide bioactivity classifier with capacity of distinguishing between multiple classes. We present MultiPep: a deep learning multi-label classifier that assigns peptides to zero or more of 20 bioactivity classes. We train and test MultiPep on data from several publically available databases. The same data are used for a hierarchical clustering, whose dendrogram shapes the architecture of MultiPep. We test a new loss function that combines a customized version of Matthews correlation coefficient with binary cross entropy (BCE), and show that this is better than using class-weighted BCE as loss function. Further, we show that MultiPep surpasses state-of-the-art peptide bioactivity classifiers and that it predicts known and novel bioactivities of FDA-approved therapeutic peptides. In conclusion, we present innovative machine learning techniques used to produce a peptide prediction tool to aid peptide-based therapy development and hypothesis generation.

Full Text
01

Abstract

Abstract Peptide-based therapeutics are here to stay and will prosper in the future. A key step in identifying novel peptide-drugs is the determination of their bioactivities. Recent advances in peptidomics screening approaches hold promise as a strategy for identifying novel drug targets. However, these screenings typically generate an immense number of peptides and tools for ranking these peptides prior to planning functional studies are warranted. Whereas a couple of tools in the literature predict multiple classes, these are constructed using multiple binary classifiers. We here aimed to use an innovative deep learning approach to generate an improved peptide bioactivity classifier with capacity of distinguishing between multiple classes. We present MultiPep: a deep learning multi-label classifier that assigns peptides to zero or more of 20 bioactivity classes. We train and test MultiPep on data from several publically available databases. The same data are used for a hierarchical clustering, whose dendrogram shapes the architecture of MultiPep. We test a new loss function that combines a customized version of Matthews correlation coefficient with binary cross entropy (BCE), and show that this is better than using class-weighted BCE as loss function. Further, we show that MultiPep surpasses state-of-the-art peptide bioactivity classifiers and that it predicts known and novel bioactivities of FDA-approved therapeutic peptides. In conclusion, we present innovative machine learning techniques used to produce a peptide prediction tool to aid peptide-based therapy development and hypothesis generation.

02

Introduction

Identifying bioactivities of peptides is becoming increasingly important. Illuminating the biological activity of peptides will help shedding light on their impact on various processes and help revealing peptides that can be used for therapies against diseases and disorders. A number of biologically available peptides have already now proven efficient for use as drugs for various diseases, including diabetes, osteoporosis, hypertension, cancer, and a variety of endocrine disorders [ 1 , 2 ]. Though peptide-based drug development is still hampered by certain challenges, such as physiochemical instability and short half-life [ 3 , 4 ], it can be expected that the future with its technical advances will contain more therapies involving peptides [ 2 , 3 ]. At the same time, mass spectrometry-based peptidomics technologies are rapidly developing, allowing for the discovery of previously unknown molecules [ 5 , 6 ]. Although a subset of these peptides likely represent potent metabolic signaling molecules, the field currently lacks an efficient approach to scan for specific bioactivities, which further allow meaningful follow-up experiments for target validation. Thus, accurately predicting the bioactivity of peptides is important for identifying novel drug candidates as well as new and unknown peptides. Many tools have been developed for predicting the bioactivity of peptides. Table 1 shows a selection of important state-of-the-art binary classifiers that can group peptides into different bioactivity classes. Table 1: State-of-the-art binary classifiers for bioactivity prediction Prediction class Algorithm type(s) Name/description of tool General bioactivity Neural network PeptideRanker [ 7 ] Antimicrobial peptides Neural network Convolutional long short-term memory neural network [ 8 ] Deep-AmPEP30 [ 9 ] Anticancer peptides Support vector machine mACPpred [ 10 ] Neuropeptide peptides Various PredNeuroP [ 11 ] Toxic peptides Support vector machine ToxinPred [ 12 ] Hemolytic peptides Various HLPpred-Fuse [ 13 ] Antioxidant peptides Neural network AnOxPePred [ 14 ] Though many of the peptide bioactivity predictors presented in Table 1 offer a wealth of additional information next to a prediction score [ 9 , 13–15 ], a common denominator for all of them is that they are binary classifiers. For example, the classifier PeptideRanker can predict whether a peptide has a bioactivity or not, and Deep-AmPEP can distinguish short antimicrobial peptides (AMPs) from short non-AMPs. This automatically implies that peptides either have one or none bioactivity, which is not in line with reality [ 15–25 ]. Getting an overview of peptides’ bioactivities is important when seeking to unravel therapeutics-relevant peptides. Further, an overview of peptides’ bioactivity profiles is vital when analyzing mass spectrometry-based peptidomics data, where a key step is to rank peptide candidates for downstream functional studies. A few tools that can predict multiple bioactivity classes simultaneously exist in the literature. PEPred-suite [ 26 ] uses eight random forest (RF) models to predict how peptides belong to eight different classes. Peptipedia [ 27 ] is another tool that, among other things, can predict different bioactivities of peptides by using 44 RF models trained on data from a large number of peptide databases. A combination of several machine learning algorithm types has been utilized for the prediction of peptide bioactivities ( Table 1 ). However, through the years, deep learning-based models have delivered state-of-the-art results in the field of biological sequence analysis [ 9 , 8 , 28 , 29 ]. And though much have been achieved already regarding peptide bioactivity prediction ( Table 1 ), the possibilities with deep learning are endless, and allows for continuous improvements. Here, we present MultiPep, a deep neural network multi-label classifier that can predict the bioactivity of peptides based on 20 bioactivity classes. The tool can take peptides of length 2–200 amino acids that can consist of natural amino acids (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V). MultiPep utilizes convolutional neural networks [ 30 ] for predicting the peptide class belonging, and can classify peptides into zero or more bioactivity classes based on their intrinsic amino acid patterns. The architecture of MultiPep is inspired by a dendrogram from a hierarchical clustering of our defined bioactivity classes’ mutual overlaps ( Fig. 1 ). An overlap occurs when two or more classes contain the same peptides. The 20 output nodes of the network are divided into “network class-clades,” which have outputs that are agglomeratively combined until two single binary predictions are made. The architecture ensures that extra penalties are added while training, if a sequence is predicted to be in, for example, a wrong network class-clade. Figure 1: Dendrogram template and overall architecture of the convolutional neural network. ( A ) Dendr

03

Materials and methods

Data and databases The data used for training, validation, and testing of our models were downloaded from APD3 [15], BioDADPep [ 16 ], BIOPEP-UWM [ 18 ], CancerPPD [ 19 ], CAMPR3 [20], DBAASP [2 1 ], LAMP2 [22], NeuroPedia [ 23 ], NeuroPep [ 24 ], PeptideDB [ 25 ], and SATPdb [ 17 ]. APD3, CAMPR3, DBAASP, LAMP2, and SATPdb all contain many peptides with experimentally validated antimicrobial function, such as “antibacterial,” “antifungal,” and “antivirus.” But, also peptides annotated with many other types of bioactivities ranging from “toxic” and “chemotactic” to “wound-healing” and “enzyme inhibitor.” The databases are both curated (CAMPR3, DBAASP, and SATPdb) and based on peptides found in the literature (APD3) and scientific literature and authoritative databases (LAMP2). CancerPDD contains experimentally validated anticancer peptides collected from published research articles, patents, and from other databases. PeptideDB and BIOPEP-UWM contain peptides belonging to many different bioactivity classes. PeptideDB contains bioactive peptides from 2820 metazoan species extracted through searches in Swiss-Prot and Trembl protein databases, using BLAST alignment tools and other in silico methods, whereas BIOPEP-UWM is a smaller but continuously curated database. BioDADPep is a database that contains peptides with anti-Type I and/or Type II diabetes properties found based on literature search on PubMed and searches on other published databases. The NeuroPep database contains neuropeptides extracted from the literature, UniProt and other databases, such as Neuropedia. All the entries in NeuroPep have been manually checked [ 24 ]. Neuropedia is a database with neuropeptides based on in-house mass spectrometry data. For all databases, where it was possible, only peptides with experimentally validated bioactivity were downloaded. Duplicates and peptide sequences containing unnatural amino acids (B, J, O, U, X, and Z) were removed as well.

04

Defined bioactivity classes

Based on the downloaded peptides and their annotations, 20 bioactivity classes were defined (see Supplementary Information and Supplementary Table S1 ). The complete list of classes and how the used databases have contributed to their generation can be seen in Table 2 . Some database class contributions were based on overlaps. For instance, the class “ACE inhibitor” stems only from the database BIOPEP-UWM, however, the database LAMP2 contributed with 39 peptides, because it had peptides from another class that overlapped with the peptides from the “ACE inhibitor” class. Table 2: Classes and databases. Classes CAMP3 LAMP2 APD3 SATPdb DBAASP BIOPEP-UWM PeptideDB NeuroPedia CancerPDD BioDADPep NeuroPep Total class size 0 ACE inhibitor 1 39 1 687 29 973 4 1 1 65 3 973 1 Antibacterial 2104 11 516 2858 4533 10310 512 1059 16 265 31 37 13 538 2 Anticancer 245 1575 419 1177 1919 104 121 1 440 14 6 2426 3 Antidiabetes 11 46 17 125 41 189 9 3 2 1091 3 1112 4 Antifreeze 0 0 0 0 0 0 192 0 0 0 0 192 5 Antifungal 1457 4747 1955 2818 4172 219 559 11 197 22 27 5342 6 Antihypertensive 1 73 6 1664 51 783 10 3 5 107 8 1672 7 Antimicrobial 1882 13 446 2125 8887 3681 239 2544 15 225 9 57 14 362 8 Antioxidative 10 38 28 81 35 649 6 4 1 15 5 675 9 Antiparasite 133 479 190 342 388 46 94 5 20 6 10 503 10 Antivirus 735 4219 990 3918 1611 90 196 8 36 6 14 4500 11 Cellcellsignaling 11 38 15 679 30 22 378 392 0 3 391 682 12 Cytokines_ growthfactors 46 66 64 24 30 11 3729 0 1 0 1 3760 13 Dipeptidyl peptidase inhibitor 0 58 2 169 50 459 6 4 1 158 6 459 14 Drugdelivery 11 108 11 1484 96 17 44 29 5 1 45 1484 15 Hemolytic 183 1299 342 1279 1045 61 111 0 51 1 2 1339 16 Neuropeptide 18 70 26 473 33 108 2101 552 2 4 3822 3926 17 Opioid 0 5 0 27 1 117 12 9 2 0 16 117 18 Peptidehormone 22 149 28 499 33 34 6943 495 1 2 2077 6943 19 Toxic 411 1746 523 3840 1416 310 2404 2 76 1 10 5793 Bioactivity classes were only created if 100 or more peptides were available for that class, and if the class had more than two peptides that were unique for that specific class. We believed that if the classes did not live up to these criteria, the classes would be to information-poor for a classifier to use. Some of the classes had major overlaps, while other classes had none ( Supplementary Fig. S2 ). The largest class was the Antibacterial class with 14 362 peptides and the smallest was the Opioid class with 117 peptides. Analyses to see how the defined bioactivity classes differed in terms of overall peptide lengths and amino acids distributions were made ( Supplementary Figs S3 and S4 ). Some of the bioactivity classes are very general (antimicrobial, peptide hormone, and antihypertensive) while other classes are more specific (ACE inhibitor, opioid, and dipeptidyl peptidase inhibitor). No negative class with peptides without bioactivity was defined, as the many different classes generated negative classes for each other. If required, the applied databases can be used to acquire more information about the classes.

05

Overall network architecture

MultiPep is a deep neural network implemented using Keras ( http://www.keras.io ) and Tensorflow [ 34 ] in Python. MultiPep’s architecture is inspired by a dendrogram from a hierarchical clustering of our defined bioactivity classes’ mutual overlaps ( Fig. 1A and Supplementary Fig. S2 ). The max-normalized class overlap values were clustered using the complete linkage type and Euclidean distance as metric. We used the hierarchical clustering algorithm provided by SciPy [ 35 ]. The cutoff for the clustering was based on SciPy’s default cutoff value. Based on the classes’ divisions, a dendrogram template that consisted of five lower level class clades with two, six, three, five, and four classes in each, respectively, was created. Above the class clades, upper levels that connected the lower levels until all class clades were amalgamated were made ( Fig. 1A ). Using the dendrogram template, our deep neural network was constructed ( Fig. 1B ). The network has “network class clades,” which consist of small convolutional neural networks (CNNs) and class-clade-specific output layers. See the section “Class-clade-specific convolutional neural networks” for a description of the CNNs. The network class-clade-associated output layers is connected to upper level output layers. Only the network class-clade-associated output layers (Output 4_1, Output 4_2, Output 4_3, Output 4_4, and Output 4_5 in Fig. 1B ) contain weights. These layers are regulated by the sigmoid activation function. The upper level output layers (Output 1, Output 2_1, Output 2_2, and Output 3 in Fig. 1B ) are concatenation layers that merge the two connected lower level output layers’ maximum output values. Table 3 shows which bioactivity classes the upper level output layers represent. To provide an example, if Output level_2_2_2 outputs 0.8 given a peptide, this peptides is predicted to belong to one of the bioactivity classes “antifreeze,” “cytokines/growth factors,” “antioxidative,” “drugdelivery,” “opioid,” “ACE inhibitor,” “antihypertensive,” “antidiabetes,” or “dipeptidyl peptidase inhibitor” by a score of 0.8. Table 3: Bioactivity classes represented by upper-level output layers Layer outputs Combined classes Output_1_1 Hemolytic, toxic, antiparasite, anticancer, antibacterial, antifungal, insecticides, antimicrobial, and antivirus Output_1_2 Cell–cell signaling, neuropeptide, peptide hormone, antifreeze, cytokines/growth factors, antioxidative, drugdelivery, opioid, ACE inhibitor, antihypertensive, antidiabetes, and dipeptidyl peptidase inhibitor Output_2_1_1 Hemolytic and toxic Output_2_1_2 Antiparasite, anticancer, antibacterial, antifungal, insecticides, antimicrobial, and antivirus Output_2_2_1 Cell–cell signaling, neuropeptide, and peptide hormone Output_2_2_2 Antifreeze, cytokines/growth factors, antioxidative, drugdelivery, opioid, ACE inhibitor, antihypertensive, antidiabetes, and dipeptidyl peptidase inhibitor Output_3_1 Antifreeze, cytokines/growth factors, antioxidative, drugdelivery, and opioid Output_3_2 ACE inhibitor, antihypertensive, antidiabetes, and dipeptidyl peptidase inhibitor

06

Class-clade-specific convolutional neural networks

The CNNs from the different network class clades have the same architecture, which is shown in Fig. 2 . Each of the CNNs are connected to the input layer. The input layer passes n input examples as a 3D tensor of size n × 4000 × 1 to seven 1D convolutional layers within the different CNNs. The input data are passed as a 3D tensor, as this is required by Keras’ 1D convolutional layers. See the section “Generating data for training and testing” for a more detailed explanation of the input data size. Figure 2: Architecture of network class-clade convolutional neural networks. All layers with dark-gray backgrounds (except for the input layer) have weights, whereas the layers with white backgrounds are mathematical operations or dropout layers. Below the name of each layer (except for the dropout layers), the sizes of the layers’ output tensors are written. Arrows show the flow of information and how the layers are connected. What constitutes the convolutional neural network (CNN) is encapsulated by a light-gray box. “Conv4” means a 1D convolutional layer with a kernel size of four one-hot-encoded amino acids. The same logic applies to the remaining convolutional layers. “Dense500” means a dense layer with 500 nodes. The convolutional layers have 40 kernels with no bias parameters and they convolve over 4, 6, 10, 16, 22, 30, and 40 one-hot encoded amino acids, respectively. Their strides are 20, which is the length of 1 one-hot encoded amino acids. The convolutional layers perform valid convolutions. The 3D tensors that are outputted by the convolutional layers are max pooled along their second axes. The max pooled convolutional output values are concatenated, such that a ( n × 280 ) tensor is created. Each of the n input examples are then normalized by a division of their summed absolute values. This is done to ensure that the network focuses on amino acid patterns and does not predict based on sequence length. Dropout with a rate = 0.2 is applied on the normalization layer. The normalization layer is connected to a chain of three dense layers with 500 nodes where dropout is applied on each of them (rate = 0.5). See the section “Training settings and network initialization” to read more about the used regularization techniques. In the top of Fig. 2 , the CNN is connected to an output layer. This could, for instance, be “CNN1” that is connected to “Output 4_1” in Fig. 1B . All layers with weights that are not output layers consist of Parametric Rectified Linear Units (PReLU) [ 36 ].

07

Generating data for training and testing

Models based on 10-fold cross-validation (CV) were trained. To do this, the peptide classes were divided into 10 bins of similar sizes. To distribute the peptides classes uniformly across the bins, all peptides belonging to all specific sets of classes were counted. If the count value was greater than 10, the peptides were split into the 10 bins, and if the count value was lower than 10, the peptides were uniformly distributed at random across the bins. Like in Reference [ 11 ], all peptide sequences were one-hot encoded before introduced to the MultiPep. All sequences were zero-padded until they reached a length of 200 one-hot encoded amino acids. Training sets, validation sets, and testing sets with a size distribution of 0.8, 0.1, and 0.1, respectively, were generated. For all peptide sequences, labels or targets were created. One indicated membership of a class and zero indicated that they were not members of the specific class.

08

Training settings and network initialization

The start parameters of the convolutional layers were initialized using orthogonal weights. The initial weights of the dense non-output layers were sampled from a uniform distribution with min = - 1 and max = 1 . The initial weights of the output layers were sampled from a uniform distribution with min = 0.001 and max = 0.05 . The bias parameters of the dense layers and output layers were initialized as zeros. The Adam [ 37 ] update function was used with a learning rate of 0.0005. The remaining hyper-parameters for the Adam update function were set to the Keras default values. The batch size was 128 and dropout with a rate of 20% and 50% on the normalization layer and the dense layers, respectively, was applied ( Fig. 2 ). As a final regularization technique, early stopping after 20 epochs was used. All data were shuffled before an epoch was started. To acquire optimal models, weights for the network class clades were saved whenever their performances on the validation set had improved. Additionally, the weights of the entire network were saved whenever the overall performance had increased. The early-stopping count-down was set to zero whenever the parameters of a “network class clade” were saved. The performance was measured in terms of loss.

09

The customized MCC loss function

To increase the loss penalty during training, the BCE loss function was extended with a customized version of the MCC. The MCC function is given by (1) MCC = TP · TN - FP · FN TP + FP TP + FN TN + FP ( TN + FN ) , where TP = true positives , TN = true negatives , FP = false positives , and FN = false negatives . MCC values range from −1 to 1, where −1 indicates complete disagreement and 1 indicates a perfect agreement [ 38 ]. Here, the MC C loss function for a single sigmoid output node is calculated by (2) T P * = ∑ i N y ^ i y i , (3) T N * = ∑ i N 1 - y ^ i ( 1 - y i ) , (4) F P * = ∑ i N y ^ i 1 - y i , (5) F N * = ∑ i N 1 - y ^ i y i , (6) MC C loss = 1 - T P * · T N * - F P * · F N * T P * + F P * T P * + F N * T N * + F P * T N * + F N * , where T P * , T N * , F P * , and F N * are modified under certain circumstances. If y i = 0 for all instances in a mini-batch, then T P * is set to be minimum 1. If y i = 1 for all instances in a mini-batch, then T N * is set to be minimum 1. If y ^ i = 0 for all instances in a mini-batch, then F P * is set to be minimum 1. If y ^ i = 1 for all instances in a mini-batch, then F N * is set to be minimum one. These modifications are done to avoid that the denominator of the MC C loss function becomes zero. Again, with a single sigmoid output node, the loss for a single predicted mini-batch ranges between 0 and 2. Zero is given to a perfectly predicted mini-batch and two is given to perfectly miss-predicted mini-batch that contains both y i = 1 and y i = 0 instances. In the situation where y i = 1 and y ^ i = 0 or y i = 0 and y ^ i = 1 for all instances in a mini-batch, the loss can never reach exactly 2 due to the T P * and T N * definitions. So, to avoid “losing” loss while training, the loss is set to 2 if | ∑ i N y i - ∑ i N y ^ i | = N , where N is the number of examples in the mini-batch. In the situation where y i = 1 and y ^ i = 1 or y i = 0 and y ^ i = 0 for all instances in a mini-batch, the loss can never reach exactly 0 due to the F P * and F N * definitions. To circumvent this, the MC C loss was implemented such that its outputs are multiplied by 0 if | ∑ i N y i - ∑ i N y ^ i | = 0 .

10

The complete loss function with MC C loss

The entire loss function for a single output node given all instances of mini-batch is defined as (7) Loss = - 1 N ∑ i = 1 N [ y i · log ⁡ y ^ i + 1 - y i · log ⁡ 1 - y ^ i ] + MC C loss , where N is the number of examples in the mini-batch. The loss function can be seen as two-step function where the MC C loss values are calculated for the entire batch in the first step and then added to the BCE loss values, which then are averaged. During the updating steps while training, the loss is summed across all output nodes.

11

Loss function with class weights

Models using class weights were trained, which is a standard way of training models on data with imbalanced class-size distributions [ 39 ]. The class weights were based on the data from the training sets and were generated using the “compute_class_weight” function from Scikit-learn [ 40 ]. This means that a class weight, C l j , was generated for every j class. While training, the loss of a single output node was calculated using (8) Los s Cl = - 1 N ∑ i = 1 N y i · log ⁡ y ^ i + 1 - y i · log ⁡ 1 - y ^ i · C l j , where N is the number of examples in the mini-batch and C l j is the class weight of the class in question. During the updating steps while training, the loss is summed across all output nodes.

12

Webtool and stand-alone-program

A Numpy [ 41 ] version of MultiPep based on the trained weights was implemented. It is available on https://agbg.shinyapps.io/MultiPep/ . Users can choose between using the average of all CV models’ prediction as the final output or to get the max of all CV models’ predictions of the different classes as the final output. This is true for the webtool and the stand-alone-program. The webtool rounds the predictions to three decimals, while the stand-alone-program provides a higher number of decimals. The stand-alone-program is available at https://github.com/scheelelab/MultiPep . If predicting more than 500 in one batch of peptide sequences, we recommend using the stand-alone-program.

13

Performance metrics

The performance of MultiPep and peptide predictors compared with MultiPep was evaluated using the following metrics: (9) MCC = TP · TN - FP · FN TP + FP TP + FN TN + FP ( TN + FN ) , (10) F 1 score = 2 T P 2 T P + FP + FN , (11) Precision = TP TP + FP , (12) Recall = TP TP + FN , (13) Accuracy = TP + TN TP + TN + FP + FN , where TP = true positives , TN = true negatives , FP = false positives , and FN = false negatives .

14

Results

Architecture rationale We designed the architecture of MultiPep with inspiration from a dendrogram from a hierarchical clustering of our defined bioactivity classes’ mutual overlaps. The performed clustering of the max-normalized class overlap values provided a proxy for the classes’ peptide similarities. We hypothesized that the network would make more exact predictions when we grouped similar classes together and connected them to the same CNN. The CNNs would then be forced to learn how to distinguish between similar peptide patterns using the same network parameters. Further, the hierarchical structure of MultiPep ensures that an extra penalty is added while training, if a peptide is predicted to be, for example, in a wrong class clade or if the output layer of the network class clade produces only false negatives or positives. This happens because a, for example, false positive in an erroneous class clade will propagate through all connected upper output layers and thus produce more loss via these layers.

15

MCC loss function rationale

It has been suggested that the MCC function is a good performance metric for imbalanced data sets [ 42 ]. Thus, inspired by the work in Abhishek and Hamarneh [ 43 ], we found it intriguing to use a customized version of MCC function as parameter-free loss function for our data set with an imbalanced class–size distribution. However, the MCC function as is was in our opinion not suited for a classifier that, while training, iteratively looks at mini-batches of training examples. Especially not when training on data sets with small classes. For example, the MCC function will include division by 0, if for example, only true negatives or true positives are predicted in a mini-batch ( Equation 6 ). Our loss function is a combination of the standard BCE and our MC C loss function, as we found that this combination yielded the best results. In this context, the MC C loss function produces extra penalties for the predictions with a low MCC score.

16

Model performances

Using data derived from multiple peptide databases, we conducted 10-fold CVs and tested the models on their associated test sets. We trained several models using two different loss functions. The neural network architecture and all hyper-parameters and regularization settings were always the same. First, we trained models using our loss function with the MC C loss penalty, where we applied our training scheme that saved the parameters of the best “network class clades” whenever performance on the validation sets increased. Secondly, while training these models, we saved the parameters of the entire network whenever the overall performance had increased. Thirdly, we ran a 10-fold CV using the weighted BCE loss function. We used the training scheme, where the parameters of the “network class clades” were saved whenever their performances increased and the training scheme where we saved all network parameters whenever the overall performance had increased. This produced 4 × 10 different models, which we used for a comparative study to find the best way of training the network. Tables 4 and 5 show the mean and standard deviations of the performances of the trained models on the CV test sets using MCC and the F1 score as performance metrics. In the tables, the mean values have been compared and the standard deviations are stated to provide readers with an overview of the robustness of the models. Similar tables showing the performance on the test sets, but using precision, recall, and accuracy as performance metrics are available in the Supplementary Information ( Supplementary Tables S2–S4 ). Additionally, plots of receiver operating characteristic (ROC) curves of MultiPep models’ predictions of the test sets can be found in the Supplementary Information (Supplementary Figs S5–S14). Tables showing the performance on the validation sets can as well be found in the Supplementary Information ( Supplementary Tables S5–S9 ). The classification threshold was 0.5 and all mean values in the tables have been rounded to three decimals, whereas all standard deviation values have been rounded to two decimals. We did not use the upper layers’ predictions for the comparisons, as we were only interested in finding how the models performed on the bioactivity classes. The performances on the test sets and validation sets of the individual CV models trained using the “save the individual network class-clade” training scheme can be found in Supplementary Tables S10–S29 . Table 4: Mean and standard deviation of MCC of CV models on test sets Bioactivity class/output name BCE + MC C loss BCE + MC C loss − overall lowest loss Weighted BCE Weighted BCE—overall lowest loss Output 1_1 0.839 ± 0.01 0.843 ± 0.01 0.803 ± 0.01 0.799 ± 0.01 Output 1_2 0.856 ± 0.01 0.859 ± 0.01 0.81 ± 0.02 0.847 ± 0.01 Output 2_1_1 0.782 ± 0.01 0.788 ± 0.01 0.764 ± 0.02 0.778 ± 0.01 Output 2_1_2 0.849 ± 0.01 0.853 ± 0.01 0.813 ± 0.01 0.802 ± 0.01 Output 2_2_1 0.895 ± 0.01 0.894 ± 0.01 0.875 ± 0.01 0.878 ± 0.01 Output 2_2_2 0.826 ± 0.01 0.83 ± 0.01 0.751 ± 0.02 0.811 ± 0.01 Output 3_1 0.798 ± 0.01 0.798 ± 0.01 0.711 ± 0.03 0.787 ± 0.02 Output 3_2 0.766 ± 0.03 0.765 ± 0.02 0.749 ± 0.02 0.751 ± 0.02 Hemolytic 0.531 ± 0.04 0.539 ± 0.03 0.514 ± 0.03 0.52 ± 0.03 Toxic 0.785 ± 0.01 0.789 ± 0.01 0.768 ± 0.02 0.782 ± 0.01 Antimicrobial 0.675 ± 0.01 0.674 ± 0.01 0.617 ± 0.01 0.587 ± 0.01 Antivirus 0.611 ± 0.02 0.614 ± 0.02 0.593 ± 0.01 0.569 ± 0.03 Antiparasite 0.281 ± 0.05 0.173 ± 0.1 0.299 ± 0.06 0.289 ± 0.07 Anticancer 0.51 ± 0.03 0.488 ± 0.02 0.513 ± 0.03 0.47 ± 0.03 Antibacterial 0.702 ± 0.01 0.703 ± 0.01 0.659 ± 0.01 0.644 ± 0.01 Antifungal 0.487 ± 0.02 0.471 ± 0.02 0.411 ± 0.03 0.335 ± 0.06 Cell–cell signaling 0.553 ± 0.02 0.552 ± 0.02 0.544 ± 0.03 0.548 ± 0.04 Neuropeptide 0.78 ± 0.02 0.771 ± 0.02 0.74 ± 0.02 0.757 ± 0.03 Peptide hormone 0.845 ± 0.01 0.844 ± 0.01 0.82 ± 0.01 0.828 ± 0.01 Antifreeze 0.966 ± 0.04 0.976 ± 0.03 0.976 ± 0.03 0.977 ± 0.02 Cytokines/growth factors 0.932 ± 0.01 0.936 ± 0.01 0.851 ± 0.02 0.923 ± 0.01 Antioxidative 0.467 ± 0.06 0.461 ± 0.06 0.378 ± 0.09 0.483 ± 0.07 Drugdelivery 0.598 ± 0.05 0.588 ± 0.04 0.248 ± 0.13 0.556 ± 0.05 Opioid 0.701 ± 0.11 0.676 ± 0.11 0.67 ± 0.11 0.733 ± 0.1 ACE inhibitor 0.511 ± 0.02 0.511 ± 0.02 * 0.502 ± 0.03 0.498 ± 0.03 Antihypertensive 0.668 ± 0.03 0.664 ± 0.02 0.652 ± 0.02 0.655 ± 0.02 Antidiabetes 0.596 ± 0.03 0.58 ± 0.03 0.587 ± 0.03 0.587 ± 0.03 Dipeptidyl peptidase inhibitor 0.564 ± 0.05 0.564 ± 0.04 0.557 ± 0.03 0.559 ± 0.08 Average of bioactivity classes: 0.638 0.629 0.595 0.615 Best total 8 7 2 3 Notes : Bold values indicate the model with the best performance. The asterisk symbols indicate that more than three decimals are needed to reveal the highest values. At the two bottom rows, “Best total” shows the number of times the average of the models are the best and “Average of bioactivity classes” shows the average of the columns above, but only for the bioactivity classes. Table 5: Mean and stand

17

Comparisons against state-of-the-art binary classifiers

We next aimed to compare MultiPep against existing state-of-the-art binary bioactivity classifiers. We could not find a range of tools necessary for comparing all bioactivity classes predicted by MultiPep. Thus, we chose a few different classifiers based on their performances and bioactivity focus ( Tables 1 and 6 ). We only used the tools’ pre-trained models available on the tools’ webpages or on GitHub ( Table 6 ). As many of the chosen classifiers did not take peptides of length 2–200, it was necessary to create different sets of peptide sequences that were compatible with the tools’ input formats (see Supplementary Table S30 for sizes of peptide sets). Table 6: Comparisons against state-of-the-art peptide bioactivity predictors MCC F1 score Precision Recall Accuracy MultiPep 0.804 0.91 0.983 0.848 0.897 Neural network by Veltri et al . [ 8 ], link 0.526 0.836 0.773 0.909 0.78 MultiPep 0.813 0.94 0.984 0.9 0.917 Deep-AmPEP30, link 0.657 0.9 0.92 0.88 0.858 RF-AmPEP30, link 0.712 0.914 0.94 0.889 0.879 MultiPep 0.604 0.631 1.0 0.461 0.824 mACPpred, link 0.459 0.653 0.512 0.9 0.688 MultiPep 0.879 0.892 0.959 0.833 0.973 PredNeuroP, link 0.698 0.722 0.579 0.959 0.901 MultiPep 0.677 0.77 0.997 0.627 0.817 ToxinPred, link 0.567 0.687 0.943 0.54 0.76 MultiPep 0.724 0.725 1.0 0.569 0.928 HLPred-Fuse, link 0.435 0.511 0.355 0.908 0.708 MultiPep 0.588 0.552 1.0 0.381 0.913 AnOxPePred, link 0.29 0.394 0.377 0.413 0.822 Notes : Bold values indicate that the performance is better than the compared tool(s). All values in the table have been rounded to three decimals. Links to the webtools/github of the tools are inserted next to the names of the tools in the first column from the left. For the tool by Veltri et al. [ 8 ], Deep-AmPEP30 and RF-AmPEP30, the positive peptides were from the “antibacterial” class whereas the negative peptides were drawn from the classes “cell–cell signaling,” “neuropeptide,” and “peptide hormone.” For mACPred, ToxinPred, HLPpred-Fuse, and AnOxPePred, the positive peptides were from the classes “anticancer,” “toxic,” “hemolytic,” and “antioxidative,” respectively, and the negative peptides were as well drawn from the classes “cell–cell signaling,” “neuropeptide,” and “peptide hormone.” For PredNeuroP, we sampled positive peptides from the “neuropeptide” class and negative peptides from the classes “antiparasite,” “anticancer,” “antibacterial,” “antifungal,” “antimicrobial,” and “antivirus.” For the tool by Veltri et al. [ 8 ], we used their newest model for the predictions. For ToxinPred, we used the tool’s Swiss-Prot-based SVM model. For AnOxPePred, we used the tools “Peptide Mode” and only used its “Free Radical Scavenger”-prediction mode. Additionally, we compared MultiPep with THPep, a tool specialized in finding tumor homing peptides [ 44 ]. Here, the positive peptides were from the class “anticancer” and the negative peptides were from the classes “cell–cell signaling,” “neuropeptide,” and “peptide hormone.” For all peptide sets created, we ensured that there were no overlaps between the positive and negative peptides. We did not make any effort to create balanced data sets, as this should not matter for the comparisons. For the comparisons, we used the MultiPep model trained using our “save the individual network class-clade” approach that produced the lowest overall loss on the test set ( Supplementary Tables S15 and S25 ). All peptides from the above-described peptide sequence sets were drawn from the model’s test set, as the peptides in this set contain unseen peptides that have not been part of model’s training. We did not consider if the sequences in the smaller sequence sets were part of the different tools’ training sets. All peptide sequence sets are available on https://github.com/scheelelab/MultiPep . We observed that MultiPep in general outperforms existing peptide prediction tools when subject to our model’s test set ( Table 6 and Supplementary Table S31 ). While the recall ability of MultiPep is bested by the other tools except for Deep-AmPEP30, RF-AmPEP30, and ToxinPred, MultiPep consistently outperforms the other tools on the MCC score, precision, and accuracy.

18

Comparing prediction error of MultiPep against PeptideRanker

As an additional analysis, we compared MultiPep with PeptideRanker. PeptideRanker is a binary classifier that can predict general bioactivity of peptides. Thus, we found it interesting to compare MultiPep against this tool. Using PeptideRanker, we predicted all peptides of the test set of the model with the lowest loss on the test trained using our “save the individual network class-clade” training scheme ( Supplementary Tables S15 and S25 ). The set contained 4585 peptides and we did not consider whether any of the used peptides from out test set were part of the training data of PeptideRanker. For both MultiPep and PeptideRanker, instead of using peptides with no bioactivity, we used all peptides of the test set as true positives and found the error of the tool’s predictions. We calculated the error of PeptideRanker’s predictions by ∑ i = 1 N 1 - round p N , where N is the number of examples in the test set, p is the predictions, and round ( ) is a function that rounds values to their nearest integer. The rounding function implies that the threshold for the predictions was 0.5. For MultiPep, we took the predictions on the Output_1 output layer and found how they diverged from the labels of that layer. We calculated ∑ i = 1 N ∑ j = 1 k | y j - round ( p j ) | N , where N is the number of examples in the test set, is an element of the row vector y with labels, is an element of the row vector p with predictions, and round ( ) is a function that rounds values to their nearest integer. Also, to make a more direct estimate of whether MultiPep can assign a prediction above 0.5 to bioactive peptides, we calculated an additional error percentage by ∑ i = 1 N 1 - round max ( p ) N , where N and p are the same as above and max ( ) is a function that finds the maximum value of the row vector p . With this approach, we calculated the prediction errors ( Table 7 ). This analysis shows that MultiPep, with a low prediction error, successfully identifies a higher number of bioactive peptide from a list of peptides with known bioactivity. Table 7: Prediction error of MultiPep compared with PeptideRanker Program names Rounded prediction error MultiPep (Output_1, max ) 0.074 MultiPep (Output_1) 0.140 PeptideRanker 0.299 Note : Final prediction errors rounded to three decimals.

19

MultiPep versus peptipedia

Peptipedia is a tool that can predict many different bioactivity classes simultaneously. The authors describe how they have trained 44 different RF models that each can predict belonging to a specific bioactivity class [ 27 ]. As described for the comparisons above, we used the MultiPep model trained using our “save the individual network class-clade” approach that produced the lowest overall loss on the test set ( Supplementary Tables S15 and S25 ). We used the test set of this model to generate a test-subset that was used as input to the Peptipedia models ( Supplementary Table S30 ). Noteworthy, as the Peptipedia webtool was unavailable during the writing of this article, we retrieved the Peptipedia predictions via personal communication with the authors of the article presenting Peptipedia [ 27 ]. In the prediction data received by the Peptipedia authors, 41 bioactivity classes were included. The Peptipedia prediction is available on https://github.com/scheelelab/MultiPep . Both MultiPep and Peptipedia can predict many different bioactivity classes. Supplementary Table S32 presents an overview of classes that were deemed comparable and incomparable, which was based on the name of the classes. For this grouping of the classes, it was taken into account how the MultiPep classes were generated ( Supplementary Information ). If a MultiPep bioactivity class was deemed comparable with more than one Peptipedia bioactivity class, the predictions of these Peptipedia classes were merged such that, for every predicted peptide, the max value of the predictions was used. Comparable bioactivity classes are expected to contain similar and identical peptides. Table 8 shows how MultiPep and Peptipedia performed on the generated test-subset. With the exception of the recall score for the antiparasite and the antifungal classes, we observed that MultiPep outperforms Peptipedia on all performances for predicting the compared bioactivity classes. Table 8: Performance of MultiPep and Peptipedia MCC F1 score Precision Recall Accuracy Bioactivity classes M P M P M P M P M P Hemolytic 0.599 0.023 0.609 0.064 0.655 0.044 0.569 0.115 0.977 0.895 Toxic 0.785 −0.052 0.798 0.172 0.952 0.115 0.686 0.341 0.952 0.551 Antimicrobial 0.689 0.065 0.791 0.48 0.806 0.355 0.777 0.74 0.863 0.463 Antivirus 0.634 0.021 0.647 0.171 0.834 0.116 0.528 0.324 0.939 0.666 Antiparasite 0.281 0.039 0.269 0.031 0.409 0.016 0.2 0.533 0.988 0.647 Anticancer 0.563 0.08 0.564 0.142 0.746 0.086 0.453 0.419 0.961 0.719 Antibacterial 0.721 0.254 0.811 0.497 0.775 0.477 0.851 0.519 0.876 0.671 Antifungal 0.493 0.211 0.556 0.32 0.544 0.214 0.569 0.637 0.889 0.668 Neuropeptide 0.802 0.249 0.82 0.296 0.815 0.377 0.825 0.243 0.967 0.896 Drugdelivery 0.653 0.111 0.65 0.143 0.804 0.133 0.545 0.154 0.98 0.937 Antihypertensive 0.679 0.0 0.672 0.0 0.557 0.0 0.848 0.0 0.982 0.978 Antidiabetes 0.641 0.046 0.646 0.05 0.697 0.097 0.602 0.034 0.986 0.973 Notes : Bold values indicate that the performance is better than the compared tool. All values in the table have been rounded to three decimals. M, MultiPep; P, Peptipedia.

20

MultiPep models versus logistic regression models

We next aimed to compare our MultiPep deep learning models with linear logistic regression models. Therefore, we trained 10 logistic regression models using our peptide data and our 10-fold CV scheme. We used the logistic regression functionalities provided by the Python package Scikit-learn [ 40 ]. We initialized the following hyper-parameters as follows: random_state = 1234, max_iter = 1000, and class_weight = “balanced.” The remaining hyper-parameters were set to their default values. The peptides were one-hot encoded before introduced to the logistic regression models. When comparing the performances of the two model types, it can be seen that MultiPep outperforms the logistic regression models ( Tables 4 and 5 and Supplementary Tables S2–S9 , S33 , and S34 ). Nevertheless, the logistic regression models seem to have a generally better recall performance.

21

Prediction of FDA-approved therapeutic peptides

We wanted to test utility of MultiPep to classify FDA-approved therapeutic peptides. Thus, we generated a list of peptide sequences from THPdb [ 45 ], containing FDA-approved peptides. We ensured that none of the peptides were part of our training, validation, or test sets before we predicted. We predicted the example peptides from THPdb as is and did not use any additional information ( Table 8 ). For the predictions, we used the average of all 10 CV models trained using our “save the individual network class-clade” training scheme. However, we also predicted the therapeutic peptides using the max score of the CV models ( Supplementary Table S35 ). The threshold for shoving prediction scores was set to 0.15 ( Table 9 and Supplementary Table S35 ). Table 9: Prediction of FDA-approved therapeutic peptides Therapeutic peptides Predictions Aprotinin (Trasylol, Bayer Pharmaceuticals, Th1158 ) Antibacterial: 0.486 Antimicrobial: 0.627 Glatiramer acetate (Copaxone, Teva Pharmaceutical Industries, Th1113 ) Antimicrobial: 0.815 Antivirus: 0.232 Lucinactant ( Surfaxin , Discovery Laboratories , Th1146 ) Hemolytic: 0.631 Toxic: 0.321 Antibacterial: 0.977 Antimicrobial: 0.276 Pramlintide (Symlin, AstraZeneca, Th1100 ) Cell–cell signaling: 0.430 Neuropeptide: 0.995 Peptide hormone: 0.997 Liraglutide (Saxenda, Novo Nordisk, Th1124 ) Cell–cell signaling: 0.225 Neuropeptide: 0.893 Peptide hormone: 0.982 Teduglutide (Gattex, NPS Pharmaceuticals, Th1137 ) Cell–cell signaling: 1.0 Neuropeptide: 0.999 Peptide hormone: 1.0 Bivalirudin (Angiomax, The Medicines Company , Th1006 ) Neuropeptide: 0.690 Peptide hormone: 0.806 Sermorelin (Sermorelin acetate, Emd serono inc ., Th1157 ) Neuropeptide: 0.812 Peptide hormone: 0.997 Metreleptin (Myalept, Amylin Pharmaceuticals, Th1208 ) Neuropeptide: 0.873 Peptide hormone: 1.0 Corticotropin (H.P. Acthar, Questcor Pharmaceuticals, Th1104 ) Neuropeptide: 0.839 Peptide hormone: 0.999 Thymalfasin (Zadaxin, SciClone Pharmaceuticals, Th1110 ) Neuropeptide: 0.173 Peptide hormone: 0.567 Aldesleukin (Proleukin, Chiron Corp., Th1036 ), Anakinra (Kineret, Amgen Inc., Th1023 ), Filgrastim (Neulasta, Amgen Inc., Th1082 ), Interferon beta-1b (Betaseron, Bayer, Th1057 ), Interferon alfacon-1 (INFERGEN, Kadmon Pharmaceuticals, Th1058 ), Interferon gamma-1b (Actimmune, InterMune Inc., Th1030 ), Peginterferon alfa-2a (Pegasys, Hoffman-La Roche Inc., Th1008 ), Oprelvekin (Neumega, Genetics Institute Inc., Th1033 ), Palifermin (Kepivance, Amgen Inc., Th1034 ), and Sargramostim (Leucomax, Novartis, Th1017 ) Cytokines/growth factors: 1.0 Notes : The names of the therapeutic peptides are written in Column 1 together with the peptides’ brand names, designer company, and a link to the peptides in THPdb. The second column contains MultiPep’s predictions. Predictions in bold are above threshold at 0.5. The predictions are based on the average of all 10 CV models. The threshold for showing prediction scores is 0.15. Aprotinin is an inhibitor of plasmin, trypsin, chymotrypsin, kallikrein, thrombin, and activated protein C that is used to control bleeding during surgery [ 46 ]. However, MultiPep predicts that it has broad-spectrum antimicrobial properties, which is consistent with additional research on the effects of Aprotinin [ 47 , 48 ]. Glatiramer acetate is a drug that is used to treat relapsing-remitting multiple sclerosis and has an average of 40–100 residues [ 49 ]. Interestingly, it has been found that Glatiramer acetate has efficient antibacterial properties [ 50 ], which is in line with the predictions of MultiPep. Moreover, MultiPep detected minor antiviral properties of Glatiramer acetate. However, this needs to be verified. Lucinactant is used to treat respiratory distress syndrome in infants. It contains two phospholipids and a high concentration of sinapultide (also known as KL-4), a synthetic peptide designed to have similar activity to surfactant protein B [ 51 , 52 ]. MultiPep classifies the peptide sinapultide as an antibacterial and hemolytic peptide with minor antimicrobial and toxic properties. Though Lucinactant is to be considered safe compared with alternative surfactants, it is associated with certain side-effects, such as transient pallor, dose interruption, and endotracheal obstruction [ 51 ], which may explain the toxic profile of sinapultide predicted by MultiPep. Further studies are needed to verify, if sinapultide has antibacterial properties. Pramlintide is used for treatment of insulin-using patients with type 2 or type 1 diabetes mellitus [ 53 ]. Pramlintide is an analog of the neuroendocrine hormone amylin and it works via similar mechanisms [ 53 ]. To this end, MultiPep detected high association with the classes “neuropeptide” and “peptide hormone” which indicates that MultiPep was able to capture the general mechanism of action the peptide-drug. Liraglutide, Teduglutide, and Metreleptin are analogs of GLP-1, GLP-2, and leptin, respectively [ 54–56 ]. They have all been predicted t

22

Discussion

In this work, we construct and demonstrate the utility of a new multi-label classifier, MultiPep. Moreover, we test our novel loss function where BCE synergistically is merged with a customized version of the MCC function. Our results suggest that using our loss of function, instead of weighted BCE, is beneficial when training using large multi-label datasets containing an imbalanced class–size distribution. Whether this is a general tendency needs to be verified. Additionally, we use an innovative neural network architecture and a new training scheme to find optimal network parameters. We demonstrate that our novel training scheme on average produces models that are better than when saving all parameters of a network simultaneously whenever performance has increased. Further, we show that MultiPep on our data surpasses state-of-the-art peptide bioactivity classifiers, and that it can predict the general bioactivities of FDA-approved therapeutic peptides. It should be noted that the comparisons with the other tools were made using our data only; thus, the results do not suggest that MultiPep is a better tool as such. If the classifiers were trained on the same data the outcome might have been different. The results demonstrate that MultiPep is better on the currently used data. This was consistent with our goal for the comparisons, namely to show how the different pre-trained models performed on a previously unseen dataset. MultiPep is designed to be a tool that can grant scientists an overview of peptides’ bioactivities. It does not offer a wealth of additional predictive information like other peptide bioactivity classifiers [ 9 , 12–14 ]. Thus, we suggest that MultiPep can be used in a pipeline, where the predictions of MultiPep can be further evaluated using tools that have more narrow and specialized classification foci. All in the interest of the progress of research. MultiPep is not the only tool that simultaneously can predict more bioactivity classes of peptides. Previous tools in the literature, PEPred-Suite and Peptipedia, also predicts multiple classes for peptides. The Peptipedia models performed rather poorly on the applied test-subset. As implied above, the Peptipedia models were trained on other data, and their bioactivity classes were probably defined differently. For the comparison, we deemed that a range of MultiPep bioactivity classes and Peptipedia bioactivity classes were comparable ( Supplementary Table S32 ). We assumed that when the class names were similar or identical, then predictions of models trained on these classes would be somewhat overlapping. This did not seem to be the case. So, are the MultiPep and Peptipedia bioactivity classes just incomparable and of different origin? Though MultiPep’s performance was better, the binary classifiers that were tested against MultiPep ( Table 6 ) performed reasonably on our test subsets with our defined bioactivity classes. This indicates that MultiPep and these models agree on the definition of the used classes and that these bioactivity classes and included peptides contain essential class-specific features. Altogether, this suggests that the problem does not lie with the bioactivity classes defined in the data presented in this work. In addition, the peptide data used by MultiPep and Peptipedia stem from a number of identical databases [ 27 ]. Therefore, it is very unlikely that MultiPep’s and Peptipedia’s defined bioactivity classes are of completely different origin. The difference in performance might rather be explained by the algorithms used to create the tools. PEPred-Suite and Peptipedia utilize individual RF models to predict bioactivity classes of peptides. MultiPep on the other hand uses a single deep learning-based model with sub-models to predict bioactivities of peptides. In other words, PEPred-Suite and Peptipedia solve many binary classification problems, whereas MultiPep turns it into a multi-label classification problem. Another difference is that PEPred-Suite and Peptipedia use elaborate techniques to encode the peptide sequences, and MultiPep uses a simple one-hot encoding. Many peptide encoding techniques exist and peptide encoding in general is a field that is gaining a lot of attention [ 77 , 78 ]. PEPred-Suite uses adaptive feature representation strategy, where they, among other things, use 10 feature encoding algorithms, which together efficiently capture local and global compositional information and well as position-specific residue information and physiochemical information [ 26 ]. Peptipedia encodes peptides using representations of physicochemical properties and transforms them using Fourier transforms [ 27 , 79 ]. Although some general selection rules have been suggested, it is difficult to find a single universally optimal peptide encoding technique [ 77 , 78 ]. As it has been found that deep learning models require little encoding for the classification process [ 77 ], we chose to use a simple and, in

23

Supplementary data

Supplementary data are available at Biology Methods and Protocols online.

24

Data availability

All data in this study are available either through https://github.com/scheelelab/MultiPep or via direct communication with the authors. The webtool is available at https://agbg.shinyapps.io/MultiPep/ .

25

Code availability

Data and scripts for generating input data for MultiPep, the MultiPep stand-alone program, all MultiPep versions and parameters used in this article, as well as sequences predicted by the compared bioactive predictors are available at https://github.com/scheelelab/MultiPep .

26

Author contributions

C.S. and A.G.B.G. conceived and designed the project. MultiPep was implemented, trained, and tested by A.G.B.G. The MultiPep stand-alone program and webtool were implemented by A.G.B.G. All analyses and comparisons were carried out by A.G.B.G. All authors read and contributed to the manuscript.

27

Funding

Novo Nordisk Foundation Center for Basic Metabolic Research is an independent Research Center, based at the University of Copenhagen, Denmark and supported by an unconditional donation from the Novo Nordisk Foundation ( www.cbmr.ku.dk ) (Grant number NNF18CC0034900).

28

Supplementary Material

bpab021_Supplementary_Data Click here for additional data file.

Article Details
DOI10.1093/biomethods/bpab021
PubMed ID34909478
PMC IDPMC8665375
JournalBiology Methods and Protocols
Year2021
AuthorsAlexander G. B. Grønning, Tim Kacprowski, Camilla Schéele
LicenseOpen Access — see publisher for license terms
Citations37