Desperately Seeking Standards: Using Text Processing to Save Your Time

Halle Burns; Susan B. Wainscott

Download Paper | Permalink

Conference: 2021 ASEE Virtual Annual Conference Content Access
Location: Virtual Conference
Publication Date: July 26, 2021
Start Date: July 26, 2021
End Date: July 19, 2022
Conference Session: TS3: Working with Students
Tagged Division: Engineering Libraries
Page Count: 13
DOI: 10.18260/1-2--36932
Permanent URL: https://peer.asee.org/36932
Download Count: 413

Paper Authors

biography

Halle Burns University of Nevada, Las Vegas orcid.org/0000-0003-2346-2876

visit author page

Halle Burns is the Data Librarian at the University of Nevada, Las Vegas University Libraries. In addition, she is certified as an instructor with The Carpentries. Her current research interests include data literacy, digital humanities, and improving the accessibility of data science and technology education.

visit author page

biography

Susan B. Wainscott University of Nevada, Las Vegas orcid.org/0000-0001-9994-0956

visit author page

Susan Wainscott is the Engineering Librarian for the University of Nevada, Las Vegas University Libraries. She holds a Master of Library and Information Science from San Jose State University and a Master of Science in Biological Sciences from Illinois State University. As liaison librarian to several departments at UNLV, she teaches information literacy for many students, provides reference assistance to the campus and community, and maintains the collection in assigned subject areas. Her research interests include information literacy instruction and assessment, the notion of threshold concepts, the effect a student’s emotional state has on their learning, and improving access to technical literature.

visit author page

Download Paper | Permalink

Abstract

Purpose/Hypothesis We aim to analyze our standards-use, interlibrary loan, and document-delivery-request data on a more regular basis to inform collections management decisions. However, manually searching for standards titles within interlibrary loan and document-delivery-request data is time consuming and unlikely to occur on a regular basis. We were also interested in a method that could be applied to large blocks of text, such as theses and dissertations.

Design/Method To detect the presence of engineering standards and other standards documents in tabular datasets as well as in large blocks of text, the first step was to develop a regular expression, using Python in Jupyter Notebooks. Regular expressions (or regex), used for text processing and querying, identifies patterns within written text. This pattern was tested to match a series of standards, within sample text that included known standards such as ANS 10.5-2006. In addition, it was checked against words and phrases it should not match against, including web addresses and mathematical equations. As a proof of concept, the text processing code was evaluated against a collection of sample pdf dissertations, one of which included standards documents in the text and references list.

As there are many iterations of what a standard can be called, we were unable to restrict the regex matching criteria any further. This means that false-positives appeared, such as the “state name and zip code” combination, report numbers, and chemical formulas. To help identify results from false-positives, we expanded the regular expression to also pull words surrounding the match, giving context to the results. This does not prevent the false-positives but allows us to quickly distinguish a false-positive from an actual match.

Once the pattern was identified, it was then applied (using Python and the pandas package) to compiled spreadsheets to identify standards in tabular collections data. We compared these results to an earlier manual search performed on the same data set. We also tested the text processing method on a set of dissertations.

Results The new method required 25% less time to complete, and the outcomes were similar. While we predicted that more standards would be located using the text processing method compared to a manual search, the text processing method missed three standards that were previously detected, and located one standard that had not been previously detected. The regular expression also successfully detected standards documents mentioned in large blocks of text.

Conclusion We developed and assessed an open source text processing method to flag potential standards mentioned in text and tabular datasets. This method is a substantial improvement over manual searching, providing similar results in a quarter of the time. The new method requires less than half a standard workday to analyze 10,000 interlibrary loan or document delivery requests. Our pilot test of the method on large blocks of text shows that it will also detect standards used in materials that are not regularly indexed for citations such as theses and dissertations, as well as technical reports and other gray literature.

Citation
Format

Burns, H., & Wainscott, S. B. (2021, July), Desperately Seeking Standards: Using Text Processing to Save Your Time Paper presented at 2021 ASEE Virtual Annual Conference Content Access, Virtual Conference. 10.18260/1-2--36932

TY - CPAPER
AB - Purpose/Hypothesis
We aim to analyze our standards-use, interlibrary loan, and document-delivery-request data on a more regular basis to inform collections management decisions. However, manually searching for standards titles within interlibrary loan and document-delivery-request data is time consuming and unlikely to occur on a regular basis. We were also interested in a method that could be applied to large blocks of text, such as theses and dissertations.

Design/Method
To detect the presence of engineering standards and other standards documents in tabular datasets as well as in large blocks of text, the first step was to develop a regular expression, using Python in Jupyter Notebooks. Regular expressions (or regex), used for text processing and querying, identifies patterns within written text. This pattern was tested to match a series of standards, within sample text that included known standards such as ANS 10.5-2006. In addition, it was checked against words and phrases it should not match against, including web addresses and mathematical equations. As a proof of concept, the text processing code was evaluated against a collection of sample pdf dissertations, one of which included standards documents in the text and references list.

As there are many iterations of what a standard can be called, we were unable to restrict the regex matching criteria any further. This means that false-positives appeared, such as the “state name and zip code” combination, report numbers, and chemical formulas. To help identify results from false-positives, we expanded the regular expression to also pull words surrounding the match, giving context to the results. This does not prevent the false-positives but allows us to quickly distinguish a false-positive from an actual match.

Once the pattern was identified, it was then applied (using Python and the pandas package) to compiled spreadsheets to identify standards in tabular collections data. We compared these results to an earlier manual search performed on the same data set. We also tested the text processing method on a set of dissertations.

Results
The new method required 25% less time to complete, and the outcomes were similar. While we predicted that more standards would be located using the text processing method compared to a manual search, the text processing method missed three standards that were previously detected, and located one standard that had not been previously detected. The regular expression also successfully detected standards documents mentioned in large blocks of text.

Conclusion
We developed and assessed an open source text processing method to flag potential standards mentioned in text and tabular datasets. This method is a substantial improvement over manual searching, providing similar results in a quarter of the time. The new method requires less than half a standard workday to analyze 10,000 interlibrary loan or document delivery requests. Our pilot test of the method on large blocks of text shows that it will also detect standards used in materials that are not regularly indexed for citations such as theses and dissertations, as well as technical reports and other gray literature.
AU - Halle Burns
AU - Susan B. Wainscott
CY - Virtual Conference
DA - 2021/07/26
PB - ASEE Conferences
TI - Desperately Seeking Standards: Using Text Processing to Save Your Time
UR - https://peer.asee.org/36932
DO - 10.18260/1-2--36932
ER -