Abstract
Background
Total body clearance of biological drugs is for the most part dependent on the receptor mechanisms (receptor mediated clearance) and the concentration of antibodies aimed at administered drug – anti-drug-antibodies (ADA). One of the significant factors that induces the increase of ADA level after drug administration could be the aggregates present in the finished product or formed in the organism. Numerous attempts have been made to identify the sequence fragments that could be responsible for forming the aggregates – aggregate prone regions (APR).
Purpose
The aim of this study was to find physiochemical parameters specific to APR that would differentiate APR from other sequences present in therapeutic proteins.
Methods
Two groups of amino acid sequences were used in the study. The first one was represented by the sequences separated from the therapeutic proteins (n = 84) able to form APR. A control set (CS) consisted of peptides that were chosen based on 22 tregitope sequences.
Results
Classification model and four classes (A, B, C, D) of sequences were finally presented. For model validation Cooper statistics was presented.
Conclusions
The study proposes a classification model of APR. This consists in a distinction of APR from sequences that do not form aggregates based on the differences in the value of physicochemical parameters. Significant share of electrostatic parameters in relation to classification model was indicated.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Background
Therapeutic proteins are one of the fastest developing types of drug. A pharmacokinetic profile as well as the effect of biological drugs are the result of many complex interactions with the immune system (Mould and Green 2010; Dostalek et al. 2012). One of the key pharmacokinetic parameters of biological drugs is clearance (EMA 2012). Total body clearance of biological drugs is for the most part dependent on the receptor mechanisms (receptor mediated clearance) and the concentration of antibodies aimed at the administered drug – anti-drug-antibodies (ADA) (Datta-Mannan et al. 2007; Wang and Chow 2010). ADA are produced in the organism as a response to most of the biological drugs including humanized molecules and completely human monoclonal antibodies. The issue of immune response to biological drugs is treated by the health authorities (FDA 2014). Apart from binding (binding antibodies), ADA can also neutralize drugs present in the organism (neutralizing antibodies) (Hsu et al. 2014). One of the significant factors that induces the increase of ADA level after drug administration could be the aggregates present in the finished product or formed in the organism (Chennamsetty et al. 2009; Wang et al. 2009). Protein aggregations, depending on their structure, can exhibit different immunogenicity. Even a slight quantity of formed aggregations after an administration of biological drug may induce a significant increase of ADA level (Rosenberg 2006). Insufficient knowledge about the possibility of aggregation process induction at the stage of drug design may endanger safety and efficacy of biological drug application in the clinical phase. However, the impact of aggregation on safety during biological drug research applications may often be difficult to predict. The reason for this is the specificity of the immune response. In extreme cases the dynamics of this response are extraordinarily high and the equivalent is not seen in small molecules.
Numerous attempts have been made to identify the sequence fragments of different proteins that could be responsible for forming the aggregates (Pawar et al. 2005; Tartaglia et al. 2005). In the case of therapeutic proteins, it is known that these fragments are short, hydrophobic sequences (aggregation prone regions – APR) that in favorable conditions initiate the aggregation process (Wang et al. 2009). A large number of APR in protein sequences could be connected with higher ability to form aggregates in vivo. This, in turn, can have a significant impact on the concentration of free form drug in blood and number of side effects (Rosenberg 2006). The aggregates in finished drugs are usually identified with the use of physicochemical methods such as: size exclusion chromatography, analytical ultracentrifugation, electrophoresis, light scattering etc. However, these methods have some limitations (Tatkiewicz et al. 2015). They are not always sensitive enough or they do not determine the share of aggregates of various structures in a single sample. Most of the commonly used methods allow the determination of hydrodynamic size or the molecular weight of an aggregate. These measurements ascertain or exclude the presence of aggregates only after their formation. They do not identify a danger connected with the ability of determined protein to form the aggregates. Hence, the significant tool to assess the risk relating to the ability of forming the aggregates is in silico analysis (Agrawal et al. 2011). Currently, different kinds of software based on phenomenological methods, statistical models, Monte Carlo simulations, scoring matrices, decision trees, Bayesian models etc. (Wang et al. 2009; Tsolis et al. 2013) are used to find APR in the sequences of therapeutic proteins.
The aim of this study was to find physiochemical parameters specific to APR that would differentiate APR from other sequences present in therapeutic proteins.
Methods
Sequences selection
Two groups of amino acid sequences were used in the study. The first one was represented by the sequences separated from the therapeutic proteins (n = 84) able to form aggregation bridges – APR (Wang et al. 2009) (Table 1; sequences 1–84). A control set (CS) consisted of peptides that were chosen based on 22 tregitope sequences (Epivax Inc. 2007).
The shortest sequences in the APR group consisted of 5 amino acids. 46.4 % of APR sequences consisted of only 5 amino acids. A CS was created also based on the peptides with the length of 5 amino acids. From each tregitope (n = 22) two sequences were chosen (Table 1; sequences 85–106). The first one was made of the first five amino acids of each tregitope (1–5) and second one was made of the next consecutive five amino acids of each tregitope (6–10). This way, CS sequences (n = 44) of the length of 5 amino acids each were obtained. One of the sequences from the CS was removed from the analysis (VVSVL). This sequence was the same as one of the APR sequences. Another one (VSWYQ) was also removed from the CS group as a result of double selection from the group of tregitopes during conducted procedures. This way, the final number of CS sequences was 42 (Table 1, sequences 107–148).
Tregitopes were used to build CS as they are short amino acid sequences present in the structure of many therapeutic proteins. After protein internalization, these sequences are responsible for the modulation of an immune response by influencing the regulatory T cells. The effect of tregitope presentation by MHC-II is a tolerogenic action (De Groot et al. 2013). The presence of tregitopes in therapeutic protein sequences (except vaccines) is a desired element considering the suppression of immune response in relation to the administered protein.
Physicochemical parameters calculations
In the first phase of physicochemical characterization of analyzed sequences physicochemical parameters of single amino acids were calculated. 16 parameters were taken from PubMed® database (XLogP3, rotatable bond count, heavy atom count, formal charge, complexity, isotope atom count, defined atom stereocenter count, undefined atom stereocenter count, defined bond stereocenter count, undefined bond stereocenter count, covalently-bonded unit count). Analysis of 51 physicochemical parameters of single amino acids was completed using QikProp 3.1 from Schrödinger package (v 31207) software (Grabowski et al. 2012). QikProp was run in the normal mode. Three-dimensional structures of compounds were prepared in LigPrep 2.2 using settings recommended in the QikProp’s user manual (Schrödinger 2015). In the initial phase of study, 62 parameters and features of physicochemical structure were used. They were calculated separately for each amino acid that was a part of the examined sequences.
In the second phase, physicochemical parameters for whole sequences were calculated. In this phase arithmetic expression value (AEx) was created with the use of eight clue physicochemical parameters (Table 1). In cases of such parameters as: number of non-conjugated amine groups (AM), number of carboxylic acid groups (AC), number of non-trivial (not CX3), non-hindered (not alkene, amide, small ring) rotatable bonds (ROT), number of ring atoms not able to form conjugated aromatic systems (NON), ionization potential (IP) and hydrogen bond acceptors (HBA), the sum of the values of particular parameter (SP) calculated for each amino acid separately SP = (SP i = n i ) + (SP ii × n ii ) + (SP iii × n iii ) …, where SPi-iii – structure parameter calculated for particular amino acid, n – amount of particular amino acid in the sequence, i, ii, iii – particular amino acids was calculated.
In case of predicted apparent Caco-2 cell permeability (QPCaco) and solubility (QPlogS), SP was an arithmetic mean of particular physicochemical parameters calculated according to the formula:
where N – number of non-replayed amino acids in sequence. This way calculated physicochemical parameters (Table 1) were used to search for the correlation that could differentiate APR from CS.
Creation of arithmetic expressions
The search for differences in each SP between the APR and CS groups (Table 1) did not yield any significant findings. Therefore, an attempt to create an arithmetic expression value (AEx) consisting of several different SP was made. To this end, the method published earlier was used (Grabowski et al. 2012). Many compilations of SP were tested (not published data), as the result of which the arithmetic statement was distincted: Ln(AM − IP + AC × ROT) − (QPCaco − NON). A value QPlogS and this arithmetic statement were used to classify the sequence groups and to exhibit a significance of differences between classes (APR and CS). QPlogS is a physicochemical parameter of a complex character. This focuses the information of solubility in water. However, this information combines many properties linked to a molecule solubility and its electrostatical character. Hence, there was an attempt to use that parameter in presented model.
Statistical analysis and model validation
Statistical analysis was performed with the use of GraphPad Prism 6.0 software. All relationships were confirmed by Mann-Whitney test (Zc statistics) and differences with p <0.05 were regarded as statistically significant. Arithmetic mean (M), standard deviation (SD), lower and higher 95 % confidence intervals for M (CI low, CI high), and standard error (SE) was calculated (\( SD/\sqrt{N} \), where – N is total number of sequences (APR and CS)). Sample size of training set (APR) was positively verified by Toplis ratio (ratio of the number of chemicals in the training set to the number of descriptors in the AEx is >5:1) (OECD 2007; ECHA 2008).
Classification model and four classes (A, B, C, D) of sequences were finally presented. Currently, the Cooper statistics is the most widely used method of classification model validation (Fang and Fang 2013; Fang et al. 2013; Zambrano et al. 2015). That is why, for model validation Cooper statistics based on Bayesian approach (sensitivity – Sn, specificity – Sp, accuracy – Ac, error rate – Er, positive predictivity – Pp, negative predictivity – Np, false positive (over-classification) rate – FPoc, false negative (under-classification) rate – FNuc, proportion of active chemicals in a population – Pas) was presented. Cooper statistics was calculated using equations: S p = T n /(T n + F p ), S n = T p /(T p + F n ), A c = (T p + T n )/N where E r = [N − (T p + T n )]/N, P p = T p /(T p + F p ), N p = T n /(T n + F n ), FP oc = F p /(F p + T n ), FN uc = F n /(F n + T p ), P as = T n /(T n + F p ), where Tp (true positive) is the number of compounds correctly classified as APR, Tn (true negative) is the number of compounds correctly classified as CS, Fn (false negative) is the number of APR compounds classified as CS, Fp (false positive) is the number of CS compounds classified as APR and N is total number of sequences (APR and CS). Each calculated value was multiplied by 100 and expressed as %, model was verified as validated if Cooper statistics is significantly greater than 50 % (OECD 2007).
Results
The mean values of SP calculated for particular groups of sequences were presented in Table 2 (Table 2).
At the initial stage of study, the differences in SP between APR and CS were searched. As the result, they could differentiate significantly between these two groups. The analysis of single parameters did not yield its expected results. In case of comparative analysis of SP calculated for the tregitope sequences and APR, significant differences were identified (p <0.05), for instance in relation to HBA↔IP. After a selection of shorter sequences (CS) from the same tregitope sequences, though, it turned out that the differences in relation to HBA↔IP were not significant (Fig. 1). The significant differences (p <0.05) in relation to values SP of tregitopes and APR were also stated for correlations: FISA↔AC × DN 0.5/SA, FISA↔Vol, Vol↔HBA, QPlogS↔FISA, where FISA – hydrophilic component of the solvent accessible surface area, AC × DN 0.5/SA – index of cohesive interaction in solids, Vol – total solvent accessible volume in cubic angstroms (Å2) using a probe with a 1.4 Å radius. However, all the same correlations were not significant for the CS selected from the tregitopes. At the next stage of study, SP was used to create arithmetic statement (AEx) that allowed differentiation of APR (n = 84) from CS (n = 42) with a sensitivity of 79.76 %. After statement of a correlation AEx↔QPlogS, the sequences APR and CS were differentiated on 4 different classes (A, B, C, D). The range of classes are characterized with the values of parameters QPlogS and AEx. A definition of class includes the values: QPlogs > 0 and AEx < 0 (class A), QPlogs ≥ 0 and AEx > 0 (class B), QPlogs < 0 and AEx ≥ 0 (class C), QPlogs ≤ 0 and AEx ≤ 0 (class D), (Fig. 2). A range specific for APR illustrates class D on Figure 2. As a result of using AEx, only 20.24 % of APR were incorrectly recognized as sequences not connected with the aggregation process (class B and C on Fig. 2). And only one of 42 CS sequences was recognized as a sequence potentially dangerous and classified to class D. As a result, 97.67 % of CS sequences were classified as not possessing any features connected with forming the aggregates (Table 3) – class A, B, C on the Figure 2.
Out of 127 sequences (a sum of APR and CS) only one was present in A class. 67 APR sequences were classified to class D. Other sequences were in classes: B and C. The proposed classification model did not allow total separation of APR from CS. During analyses, it turned out that four out of 42 CS sequences had regions that were repeated in APR. These regions contained hydrophobic amino acids such as F, I, L, M and N. These regions were: LMI, LYL, TDF and QYN. Only one APR (CQQYN) was classified to class B instead of class D. In relation to CS, none of mentioned regions (LMI, LYL, TDF, QYN) impacted on incorrect classification of CS. Every CS sequence possessing the mentioned regions in its structure was assigned to class C (EEQYN) or B (LMIYE; NTLYL; GTDFT).
Significant difference (p <0.05) between AEx value calculated for APR (n = 84) and CS (n = 42), Zc = 7.172 was stated. Moreover, significant difference (p <0.05) was also stated between QPlogS value calculated for APR (n = 84) and QPlogS calculated for CS (n = 42) p <0.05, Zc = 6.270. Calculated Cooper statistics was Sn = 79.76 %, Sp = 97.62 %, Ac = 85.71 %, Er = 14.29 %, Pp = 98.53 %, Np =70.69 %, FPoc = 2.38 %, FNuc = 20.24 % and Pas = 85.71 %.
Discussion
This study attempted to use the physicochemical parameters of single amino acids to detect APR sequences in therapeutic proteins. The introduced method involved analysis using software used previously mostly for calculations of physicochemical parameters of small molecules. This method uses the analysis of physicochemical parameters of single amino acids and bases on a prediction of final parameter (SP). This parameter, in turn, is the basis for creating a sequence or region characterization based on AEx, constructed of many SP (AEx = SP ⟺ SP ⟺ SP…., where ⟺ means mathematic operation). AEx with QPlogS was used to construct a model, where 4 sequence classes were defined. Class D includes APR sequences, and classes A, B, C – sequences that do not have the same influence on aggregation bridges forming.
In the course of the study, it was stated that using the long amino acid sequences to verify the presented model implemented false positives. In long tregitope sequences AEx had a value significantly different from AEx calculated for APR. However, this may result from the existence of feature camouflage of the shorter CS (CS derived from tregitopes, n = 5).
The study proposes a classification model of APR consisting in a separation of APR based on the differences in the value of QPlogS and AEx in relation to sequences that do not form aggregates. A value of water solubility or hydrophobicity of APR with reference to APR has been discussed in many studies (Wang et al. 2009; Tsolis et al. 2013; Zbilut et al. 2003; Wu et al. 2014). The significance of this feature in relation to APR was also confirmed in this study. Moreover, it was stated that the charge characterization of particular amino acids present in analyzed sequences has a significant correlation with APR. It is indicated by the presence in AEx of such parameters as: ionisation potential, number of amine groups or number of carboxylic acid groups. At least three parameters used to construct AEx relate to the charge characterization of analyzed sequences. The presence of IP in AEx does not seem to be accidental. The IP value is determined, among other things, in relation to the oxidative potential of amino acids. It is known that IP value is connected with the proton-donating or proton-accepting character of the amino group and carbonyl groups of amino acids (Hirakawa 2014). IP is a parameter indicative of the molecular ability to transfer positive ion. Therefore IP is connected with the oxidative reactions of amino acids (Rooman and Wintjens 2014). On the other hand, oxidation of some amino acids (histidine, methionine, cysteine, tryptophan, tyrosine) may have influence on the increase of aggregation forming dynamics (Li et al. 1995).
Although some sequences were not classified correctly, validation parameters confirm the predictive quality of the model. Based on the calculations, it can be deduced that the finding of APR in the protein structure with the use of parameters used so far for small molecules is possible. The study confirmed previous observations concerning the influence of short, hydrophobic protein sequences on the initiation of the protein aggregation process. Additionally, the significant share of electrostatistic parameters including IP in relation to classification parameters was indicated.
Conclusions
The study proposes a classification model of APR consisting in a distinction of APR based on the differences in the structure in relation to sequences that do not form aggregates. Key parameters for validation of the presented model include: number of non-conjugated amine groups, number of carboxylic acid groups, number of non-trivial (not CX3), non-hindered (not alkene, amide, small ring) rotatable bonds, hydrogen bond acceptors, predicted apparent Caco-2 cell permeability, ionization potential, number of ring atoms not able to form conjugated aromatic systems and solubility.
This presented model allows selection of APR’s in the protein sequence in non-clinical drug development process.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
References
Agrawal NJ, Kumar S, Wang X. Aggregation in protein-based biotherapeutics: computational studies and tools to identify aggregation-prone regions. J Pharm Sci. 2011;100:5081–95.
Chennamsetty N, Voynov V, Kayser V, Helk B, Trout B. Design of therapeutic proteins with enhanced stability. PNAS. 2009;106:11937–42.
Datta-Mannan A, Witcher DR, Tang Y, Watkins J, Wroblewski VJ. Monoclonal antibody clearance. Impact of modulating the interaction of IgG with the neonatal Fc receptor. J Biol Chem. 2007;282:1709–17.
De Groot AS, Cousens L, Mingozzi F, Martin W. Tregitope peptides: the active pharmaceutical ingredient of IVIG? Clin Dev Immunol. 2013. doi:10.1155/2013/493138.
Dostalek M, Gardner I, Gurbaxani BM, Rose RH, Chetty M. Pharmacokinetics, pharmacodynamics and physiologically-based pharmacokinetic modelling of monoclonal antibodies. Clin Pharmacokinet. 2012;52:83–124.
ECHA. Guidance on information requirements and chemical safety assessment. Chapter R.6: QSARs and grouping of chemicals. May 2008. Guidance for the implementation of REACH. European Chemicals Agency. 2008. p. 1–134.
EMA. Guideline on similar biological medicinal products containing monoclonal antibodies – non-clinical and clinical issues. EMA/CHMP/BMWP/403543/2010:1–16. 2012.
Epivax Inc. Regulatory t cell epitopes, compositions and uses thereof US 20090018067 A1, CA2677073A1. 2007. US Patent, http://www.google.com/patents/US20090018067.
Fang Y, Fang J. Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol Biosyst. 2013;9(4):806–11.
Fang Y, Gao S, Tai D, Middaugh CR, Fang J. Identification of properties important to protein aggregation using feature selection. BMC Bioinformatics. 2013;14:314.
FDA. Guidance for Industry Immunogenicity Assessment for Therapeutic Protein Products. 2014. p. 1–36.
Grabowski T, Jaroszewski JJ, Gad SC, Feder M. Correlation between in silico physicochemical characteristics of drugs and their mean residence time in human and dog. Int J Toxicol. 2012;31:25–33.
Hirakawa K. Theoretical study of the effects of amino acids on one-electron oxidation of a nucleobase: adenine residue can be a hole-trapping site. Pure Appl Chem Sci. 2014;2:41–8.
Hsu L, Snodgrass BT, Armstrong AW. Antidrug antibodies in psoriasis: a systematic review. Br J Dermatol. 2014;170:261–73.
IUPAC. Nomenclature and symbolism for amino acids and peptides. Pure Appl Chem. 1983;56:595–624.
Li S, Schoneich C, Borchardt RT. Chemical instability of protein pharmaceuticals: mechanisms of oxidation and strategies for stabilization. Biotechnol Bioeng. 1995;48:490–500.
Mould DR, Green B. Pharmacokinetics and pharmacodynamics of monoclonal antibodies concepts and lessons for drug development. Biodrugs. 2010;24:23–39.
OECD. The OECD Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships [(Q)SAR] Models. OECD Environmental Health and Safety Publications, Series on Testing and Assessment No. 69. OECD Environment Directorate Joint Meeting of the Chemicals Committee and the Working Party on Chemicals, Pesticides and Biotechnology. ENV/JM/MONO(2007)2. 2007. p. 1–154.
Pawar AP, Dubay KF, Zurdo J. Prediction of “aggregation-prone” and “aggregation-susceptible” regions in proteins associated with neurodegenerative diseases. J Mol Biol. 2005;350:379–92.
Rooman M, Wintjens RJ. Sequence and conformation effects on ionization potential and charge distribution of homo-nucleobase stacks using M06-2X hybrid density functional theory calculations. Biomol Struct Dyn. 2014;32:532–45.
Rosenberg AS. Effects of protein aggregates: an immunologic perspective. AAPR J. 2006;8:501–7.
Schrödinger LLC. QuicProp 3.1 from Schrödinger v 31207. 2015. http://www.schrodinger.com/products/14/17/ Accessed 1 Jan 2015.
Tartaglia GG, Cavalli A, Pellarin R, Caflisch A. Prediction of aggregation rate and aggregation-prone segments in polypeptide sequences. Protein Sci. 2005;14:2723–34.
Tatkiewicz W, Elizondo E, Moreno E. Methods for characterization of protein aggregates. Methods Mol Biol. 2015;1258:387–401.
Tsolis AC, Papandreou NC, Iconomidou VA, Hamodrakas SJ. A consensus method for the prediction of ‘Aggregation-Prone’ peptides in globular proteins. PLoS ONE. 2013;8(1):e54175. doi:10.1371/journal.pone.0054175.
Wang YM, Chow AT. Development of biosimilars-pharmacokinetic and pharmacodynamic considerations. J Biopharm Stat. 2010;20:46–61.
Wang X, Das TK, Singh SK, Kumar S. Potential aggregation prone regions in biotherapeutics: a survey of commercial monoclonal antibodies. MAbs. 2009;1:254–67.
Wu H, Kroe-Barrett R, Singh S, Robinson AS, Roberts CJ. Competing aggregation pathways for monoclonal antibodies. FEBS Lett. 2014;588:936–41.
Zambrano R, Jamroz M, Szczasiuk A, Pujols J, Kmiecik S, Ventura S. AGGRESCAN3D (A3D): server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015;43(W1):W306–13.
Zbilut JP, Colosimo A, Conti F. Protein aggregation/folding: the role of deterministic singularities of sequence hydrophobicity as determined by nonlinear signal analysis of acylphosphatase and Abeta(1–40). Biophys J. 2003;85:3544–57.
Acknowledgements
Special thanks are expressed to Liam Healy for correcting grammar and language of the manuscript.
Funding
The Authors declare no sources of funding for the research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
All authors declare no actual or potential conflict of interest including any financial, personal or other relationships with other people or organizations. Additionally, the Authors declare no prior interactions with the in Current Drug Safety regarding the submitted manuscript.
Authors’ contributions
All Authors have contributed substantially to the design, performance, analysis and reporting of the work (MM, TG - designed study, performed study; TG, KO - collected data, analyzed data, wrote paper). All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Marczak, M., Okoniewska, K. & Grabowski, T. Classification model of amino acid sequences prone to aggregation of therapeutic proteins. In Silico Pharmacol. 4, 6 (2016). https://doi.org/10.1186/s40203-016-0019-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40203-016-0019-4