Towards Streamlining the Process of Building Machine Learning
Models for your Artificial Intelligence Applications

Joseph George; Ajay Gupta; Alvis Fong

Download Paper | Permalink

Conference: 2024 ASEE North Central Section Conference
Location: Kalamazoo, Michigan
Publication Date: March 22, 2024
Start Date: March 22, 2024
End Date: March 23, 2024
Page Count: 11
DOI: 10.18260/1-2--45644
Permanent URL: https://peer.asee.org/45644
Download Count: 88

Abstract

On Amino Acid Modeling with Efficient Neural Architecture Search - An AutoML approach I. INTRODUCTION The state-of-the-art algorithms in proteomics domain for de novo sequencing and peptide analysis report high accuracy – Tran et al. claim DeepNovo achieving 97.2 to 99.5% accuracy in reconstructing mouse antibody samples [1]. However, with the rapidly expanding protein sequencing domain, new automated tools are required to scale this success, as these alogrithms require hand-tuning of parameters making it labor intensive. Neural Architecture Search technique has shown great promise in Image Classification and Object Detection domains in AutoML projects of Google and others. This paper assesses its performance in the proteomics domain by applying it to amino acid prediction. II. PEPTIDE ANALYSIS AND AMINO ACID PREDICTION De novo peptide sequencing is the pattern recognition of charge b- and y- ions of a mass spectra specimen. Manually, one can sit down and calculate the atomic mass units derived by the MS/MS spectrum. The calculations interpret the energy released from b- and y- ions which are then matched to a residual amino acid [2]. Current approaches in machine learning use manually designed architectures to learn the features of mass spectra for predicting peptides. These architectures have been trained to provide fast, efficient, and accurate coverage (see Figure 1) [1]. III. APPLICATION TO PROTEOMICS: AMINO ACID MODELING Where previous approaches used neural architecture, hand tailored to the datasets [1], we use Efficient NAS (ENAS) for the automatic generation and training of custom models on a wide variety of datasets, about 7 low resolution and 9 high resolution datasets from the PRIDE Peptidome library [12] – a proteomics repository with large, annotated datasets. These have been used in the construction and training of complex neural networks [3]–[8]. We assess ENAS’ viability to identify a spectrum’s amino acids. These acids represent key features for de novo sequencing. The features can then be used as labels to a spectrum dataset. Consequently, it can be used as input to a de novo peptide prediction algorithm. IV. RESULTS ENAS computes order of magnitude better than NAS. For comparative results with respect to DeepNovo, we used Escherichia coli [9] dataset. Without data optimization, ENAS derived a model which yielded a maximum amino acid identification rate of 70.6%. Over the course of 2 hours, a total of 2000 models were generated and two of those models had an identification rate of at least 70%. In comparison, DeepNovo reported an amino acid identification precision of 52.3% for the same Escherichia coli dataset [1]. The full paper will report extensive results from low and high resolution datasets tested in DeepNovo.

Citation
Format

George, J., & Gupta, A., & Fong, A. (2024, March), Towards Streamlining the Process of Building Machine Learning Models for your Artificial Intelligence Applications Paper presented at 2024 ASEE North Central Section Conference, Kalamazoo, Michigan. 10.18260/1-2--45644

TY - CPAPER
AB - On Amino Acid Modeling with Efficient Neural Architecture Search - An
AutoML approach
I. INTRODUCTION
The state-of-the-art algorithms in proteomics domain for de novo sequencing and peptide
analysis report high accuracy – Tran et al. claim DeepNovo achieving 97.2 to 99.5% accuracy
in reconstructing mouse antibody samples [1]. However, with the rapidly expanding protein
sequencing domain, new automated tools are required to scale this success, as these alogrithms
require hand-tuning of parameters making it labor intensive.
Neural Architecture Search technique has shown great promise in Image Classification and
Object Detection domains in AutoML projects of Google and others. This paper assesses its
performance in the proteomics domain by applying it to amino acid prediction.
II. PEPTIDE ANALYSIS AND AMINO ACID PREDICTION
De novo peptide sequencing is the pattern recognition of charge b- and y- ions of a mass
spectra specimen. Manually, one can sit down and calculate the atomic mass units derived
by the MS/MS spectrum. The calculations interpret the energy released from b- and y- ions
which are then matched to a residual amino acid [2].
Current approaches in machine learning use manually designed architectures to learn the
features of mass spectra for predicting peptides. These architectures have been trained to
provide fast, efficient, and accurate coverage (see Figure 1) [1].
III. APPLICATION TO PROTEOMICS: AMINO ACID MODELING
Where previous approaches used neural architecture, hand tailored to the datasets [1], we
use Efficient NAS (ENAS) for the automatic generation and training of custom models on
a wide variety of datasets, about 7 low resolution and 9 high resolution datasets from the
PRIDE Peptidome library [12] – a proteomics repository with large, annotated datasets. These
have been used in the construction and training of complex neural networks [3]–[8].
We assess ENAS’ viability to identify a spectrum’s amino acids. These acids represent
key features for de novo sequencing. The features can then be used as labels to a spectrum
dataset. Consequently, it can be used as input to a de novo peptide prediction algorithm.
IV. RESULTS
ENAS computes order of magnitude better than NAS. For comparative results with respect
to DeepNovo, we used Escherichia coli [9] dataset. Without data optimization, ENAS derived
a model which yielded a maximum amino acid identification rate of 70.6%. Over the course of
2 hours, a total of 2000 models were generated and two of those models had an identification
rate of at least 70%. In comparison, DeepNovo reported an amino acid identification precision
of 52.3% for the same Escherichia coli dataset [1]. The full paper will report extensive results
from low and high resolution datasets tested in DeepNovo.
AU - Joseph George
AU - Ajay Gupta
AU - Alvis Fong
CY - Kalamazoo, Michigan
DA - 2024/03/22
PB - ASEE Conferences
TI - Towards Streamlining the Process of Building Machine Learning
Models for your Artificial Intelligence Applications
UR - https://peer.asee.org/45644
DO - 10.18260/1-2--45644
ER -

Towards Streamlining the Process of Building Machine Learning Models for your Artificial Intelligence Applications

Paper Authors

Joseph George Western Michigan University

Ajay Gupta Western Michigan University

Alvis Fong Western Michigan University

Abstract

Citation

APA

APA - LaTeX bibitem

MLA

MLA - LaTeX bibitem

Bibtex

EndNote - RIS