Asee peer logo

Desperately Seeking Standards: Using Text Processing to Save Your Time

Download Paper |

Conference

2021 ASEE Virtual Annual Conference Content Access

Location

Virtual Conference

Publication Date

July 26, 2021

Start Date

July 26, 2021

End Date

July 19, 2022

Conference Session

TS3: Working with Students

Tagged Division

Engineering Libraries

Page Count

13

DOI

10.18260/1-2--36932

Permanent URL

https://peer.asee.org/36932

Download Count

34

Request a correction

Paper Authors

biography

Halle Burns University of Nevada, Las Vegas Orcid 16x16 orcid.org/0000-0003-2346-2876

visit author page

Halle Burns is the Data Librarian at the University of Nevada, Las Vegas University Libraries. In addition, she is certified as an instructor with The Carpentries. Her current research interests include data literacy, digital humanities, and improving the accessibility of data science and technology education.

visit author page

biography

Susan B. Wainscott University of Nevada, Las Vegas Orcid 16x16 orcid.org/0000-0001-9994-0956

visit author page

Susan Wainscott is the Engineering Librarian for the University of Nevada, Las Vegas University Libraries. She holds a Master of Library and Information Science from San Jose State University and a Master of Science in Biological Sciences from Illinois State University. As liaison librarian to several departments at UNLV, she teaches information literacy for many students, provides reference assistance to the campus and community, and maintains the collection in assigned subject areas. Her research interests include information literacy instruction and assessment, the notion of threshold concepts, the effect a student’s emotional state has on their learning, and improving access to technical literature.

visit author page

Download Paper |

Abstract

Purpose/Hypothesis We aim to analyze our standards-use, interlibrary loan, and document-delivery-request data on a more regular basis to inform collections management decisions. However, manually searching for standards titles within interlibrary loan and document-delivery-request data is time consuming and unlikely to occur on a regular basis. We were also interested in a method that could be applied to large blocks of text, such as theses and dissertations.

Design/Method To detect the presence of engineering standards and other standards documents in tabular datasets as well as in large blocks of text, the first step was to develop a regular expression, using Python in Jupyter Notebooks. Regular expressions (or regex), used for text processing and querying, identifies patterns within written text. This pattern was tested to match a series of standards, within sample text that included known standards such as ANS 10.5-2006. In addition, it was checked against words and phrases it should not match against, including web addresses and mathematical equations. As a proof of concept, the text processing code was evaluated against a collection of sample pdf dissertations, one of which included standards documents in the text and references list.

As there are many iterations of what a standard can be called, we were unable to restrict the regex matching criteria any further. This means that false-positives appeared, such as the “state name and zip code” combination, report numbers, and chemical formulas. To help identify results from false-positives, we expanded the regular expression to also pull words surrounding the match, giving context to the results. This does not prevent the false-positives but allows us to quickly distinguish a false-positive from an actual match.

Once the pattern was identified, it was then applied (using Python and the pandas package) to compiled spreadsheets to identify standards in tabular collections data. We compared these results to an earlier manual search performed on the same data set. We also tested the text processing method on a set of dissertations.

Results The new method required 25% less time to complete, and the outcomes were similar. While we predicted that more standards would be located using the text processing method compared to a manual search, the text processing method missed three standards that were previously detected, and located one standard that had not been previously detected. The regular expression also successfully detected standards documents mentioned in large blocks of text.

Conclusion We developed and assessed an open source text processing method to flag potential standards mentioned in text and tabular datasets. This method is a substantial improvement over manual searching, providing similar results in a quarter of the time. The new method requires less than half a standard workday to analyze 10,000 interlibrary loan or document delivery requests. Our pilot test of the method on large blocks of text shows that it will also detect standards used in materials that are not regularly indexed for citations such as theses and dissertations, as well as technical reports and other gray literature.

Burns, H., & Wainscott, S. B. (2021, July), Desperately Seeking Standards: Using Text Processing to Save Your Time Paper presented at 2021 ASEE Virtual Annual Conference Content Access, Virtual Conference. 10.18260/1-2--36932

ASEE holds the copyright on this document. It may be read by the public free of charge. Authors may archive their work on personal websites or in institutional repositories with the following citation: © 2021 American Society for Engineering Education. Other scholars may excerpt or quote from these materials with the same citation. When excerpting or quoting from Conference Proceedings, authors should, in addition to noting the ASEE copyright, list all the original authors and their institutions and name the host city of the conference. - Last updated April 1, 2015