Kalamazoo, Michigan
March 22, 2024
March 22, 2024
March 23, 2024
11
10.18260/1-2--45644
https://peer.asee.org/45644
65
On Amino Acid Modeling with Efficient Neural Architecture Search - An AutoML approach I. INTRODUCTION The state-of-the-art algorithms in proteomics domain for de novo sequencing and peptide analysis report high accuracy – Tran et al. claim DeepNovo achieving 97.2 to 99.5% accuracy in reconstructing mouse antibody samples [1]. However, with the rapidly expanding protein sequencing domain, new automated tools are required to scale this success, as these alogrithms require hand-tuning of parameters making it labor intensive. Neural Architecture Search technique has shown great promise in Image Classification and Object Detection domains in AutoML projects of Google and others. This paper assesses its performance in the proteomics domain by applying it to amino acid prediction. II. PEPTIDE ANALYSIS AND AMINO ACID PREDICTION De novo peptide sequencing is the pattern recognition of charge b- and y- ions of a mass spectra specimen. Manually, one can sit down and calculate the atomic mass units derived by the MS/MS spectrum. The calculations interpret the energy released from b- and y- ions which are then matched to a residual amino acid [2]. Current approaches in machine learning use manually designed architectures to learn the features of mass spectra for predicting peptides. These architectures have been trained to provide fast, efficient, and accurate coverage (see Figure 1) [1]. III. APPLICATION TO PROTEOMICS: AMINO ACID MODELING Where previous approaches used neural architecture, hand tailored to the datasets [1], we use Efficient NAS (ENAS) for the automatic generation and training of custom models on a wide variety of datasets, about 7 low resolution and 9 high resolution datasets from the PRIDE Peptidome library [12] – a proteomics repository with large, annotated datasets. These have been used in the construction and training of complex neural networks [3]–[8]. We assess ENAS’ viability to identify a spectrum’s amino acids. These acids represent key features for de novo sequencing. The features can then be used as labels to a spectrum dataset. Consequently, it can be used as input to a de novo peptide prediction algorithm. IV. RESULTS ENAS computes order of magnitude better than NAS. For comparative results with respect to DeepNovo, we used Escherichia coli [9] dataset. Without data optimization, ENAS derived a model which yielded a maximum amino acid identification rate of 70.6%. Over the course of 2 hours, a total of 2000 models were generated and two of those models had an identification rate of at least 70%. In comparison, DeepNovo reported an amino acid identification precision of 52.3% for the same Escherichia coli dataset [1]. The full paper will report extensive results from low and high resolution datasets tested in DeepNovo.
George, J., & Gupta, A., & Fong, A. (2024, March), Towards Streamlining the Process of Building Machine Learning Models for your Artificial Intelligence Applications Paper presented at 2024 ASEE North Central Section Conference, Kalamazoo, Michigan. 10.18260/1-2--45644
ASEE holds the copyright on this document. It may be read by the public free of charge. Authors may archive their work on personal websites or in institutional repositories with the following citation: © 2024 American Society for Engineering Education. Other scholars may excerpt or quote from these materials with the same citation. When excerpting or quoting from Conference Proceedings, authors should, in addition to noting the ASEE copyright, list all the original authors and their institutions and name the host city of the conference. - Last updated April 1, 2015