Human vs. Automated Coding Style Grading in Computing Education

James Perretta; Westley Weimer; Andrew Deorio

Download Paper | Permalink

Conference: 2019 ASEE Annual Conference & Exposition
Location: Tampa, Florida
Publication Date: June 15, 2019
Start Date: June 15, 2019
End Date: June 19, 2019
Conference Session: Technical Session 11: Topics related to Computer Science
Tagged Division: Computers in Education
Page Count: 13
DOI: 10.18260/1-2--32906
Permanent URL: https://peer.asee.org/32906
Download Count: 488

Paper Authors

biography

James Perretta University of Michigan

visit author page

James Perretta is currently pursuing a master's degree in Computer Science at the University of Michigan, where he also develops automated grading systems. His research interests and prior work focus on using automated grading systems and feedback policies to enhance student learning.

visit author page

author page

Westley Weimer University of Michigan

biography

Andrew Deorio University of Michigan orcid.org/0000-0001-5653-5109

visit author page

Andrew DeOrio is a teaching faculty member at the University of Michigan and a consultant for web and machine learning projects. His research interests are in ensuring the correctness of computer systems, including medical and IOT devices and digital hardware, as well as engineering education. In addition to teaching software and hardware courses, he teaches Creative Process and works with students on technology-driven creative projects. His teaching has been recognized with the Provost's Teaching Innovation Prize, and he has twice been named Professor of the Year by the students in his department.

visit author page

Download Paper | Permalink

Abstract

Human vs. Automated Coding Style Grading in Computing Education

Computer programming courses often evaluate student coding style manually. Static analysis tools provide an opportunity to automate this process. In this paper, we explore the tradeoffs of human style graders and general-purpose static analysis tools to evaluate student code. We investigate the following research questions: - Are human coding style evaluation scores consistent with static analysis tools? - Which style grading criteria are best evaluated with existing static analysis tools and which are more effectively evaluated by human graders?

We analyze data from a second-semester programming course at a large research institution with 943 students enrolled. Hired student graders evaluated student code with rubric criteria such as “Lines are not too long” or “Code is not too deeply nested.” We also ran several static analysis tools on the same student code to evaluate the same criteria. We then analyzed the correlation between the number of static analysis warnings and human style grading score for each criterion.

In our preliminary results, we see that static analysis tools tend to be more effective at evaluating objective code style criteria. We found a weak negative or no correlation between the human style grading score and number of static analysis warnings. Note that we expect student code with more static analysis warnings to receive fewer human style grading points. When comparing the “Lines are not too long” human style grading criterion to a related line-length static analysis inspection, we see a Pearson correlation score of r=-0.21. We also see trends in the distributions of human style grading scores that suggest human graders perform inconsistently. For example, 50% of students who received full human style grading points for the line-length criterion had 3 or more static analysis warnings from a related line-length inspection. Additionally, 23% of students who received no points on the same criterion had no static analysis warnings for the line-length inspection.

We also found that some code style criteria are not well suited to the general-purpose static analysis tools we investigated. For example, none of the static analysis tools we investigated provide a robust way of evaluating the quality of variable and function names in a program. Some tools provide an inspection for detecting variable names that are shorter than a user-specified length threshold; however, this inspection fails to identify low-quality variable names that happen to be longer than the minimum allowed length. Furthermore, there are some common scenarios where a short variable name is acceptable by convention.

Static analysis tools have the benefit of integration with an automated grading system, facilitating faster and more frequent feedback compared to human grading. The literature suggests that frequent feedback encourages students to actively improve on their work (Spacco et al. 2006). There is also evidence to suggest that increased engagement is most beneficial to students with less experience (Carini et al. 2006). Our results suggest that automated code quality evaluation could be one tool that benefits student learning in intro CS courses, helping most those students with least access to CS training pre-college.

References - Carini, R.M., Kuh, G.D. & Klein, S.P. Res High Educ (2006) 47: 1. - Spacco, Jaime and Pugh, William. Helping students appreciate test-driven development (TDD). Proceedings of OOPSLA, pages 907–913, 2006.

Citation
Format

Perretta, J., & Weimer, W., & Deorio, A. (2019, June), Human vs. Automated Coding Style Grading in Computing Education Paper presented at 2019 ASEE Annual Conference & Exposition , Tampa, Florida. 10.18260/1-2--32906

TY  - CPAPER
AB  - Human vs. Automated Coding Style Grading in Computing Education 

Computer programming courses often evaluate student coding style manually. Static analysis tools provide an opportunity to automate this process. In this paper, we explore the tradeoffs of human style graders and general-purpose static analysis tools to evaluate student code. We investigate the following research questions:
 - Are human coding style evaluation scores consistent with static analysis tools?
 - Which style grading criteria are best evaluated with existing static analysis tools and which are more effectively evaluated by human graders?

We analyze data from a second-semester programming course at a large research institution with 943 students enrolled. Hired student graders evaluated student code with rubric criteria such as “Lines are not too long” or “Code is not too deeply nested.” We also ran several static analysis tools on the same student code to evaluate the same criteria. We then analyzed the correlation between the number of static analysis warnings and human style grading score for each criterion.

In our preliminary results, we see that static analysis tools tend to be more effective at evaluating objective code style criteria. We found a weak negative or no correlation between the human style grading score and number of static analysis warnings. Note that we expect student code with more static analysis warnings to receive fewer human style grading points. When comparing the “Lines are not too long” human style grading criterion to a related line-length static analysis inspection, we see a Pearson correlation score of r=-0.21. We also see trends in the distributions of human style grading scores that suggest human graders perform inconsistently. For example, 50% of students who received full human style grading points for the line-length criterion had 3 or more static analysis warnings from a related line-length inspection. Additionally, 23% of students who received no points on the same criterion had no static analysis warnings for the line-length inspection.

We also found that some code style criteria are not well suited to the general-purpose static analysis tools we investigated. For example, none of the static analysis tools we investigated provide a robust way of evaluating the quality of variable and function names in a program. Some tools provide an inspection for detecting variable names that are shorter than a user-specified length threshold; however, this inspection fails to identify low-quality variable names that happen to be longer than the minimum allowed length. Furthermore, there are some common scenarios where a short variable name is acceptable by convention.

Static analysis tools have the benefit of integration with an automated grading system, facilitating faster and more frequent feedback compared to human grading. The literature suggests that frequent feedback encourages students to actively improve on their work (Spacco et al. 2006). There is also evidence to suggest that increased engagement is most beneficial to students with less experience (Carini et al. 2006). Our results suggest that automated code quality evaluation could be one tool that benefits student learning in intro CS courses, helping most those students with least access to CS training pre-college.

References
 - Carini, R.M., Kuh, G.D. &amp; Klein, S.P. Res High Educ (2006) 47: 1. 
 - Spacco, Jaime and Pugh, William. Helping students appreciate test-driven development (TDD). Proceedings of OOPSLA, pages 907–913, 2006.

AU  - James Perretta
AU  - Westley Weimer
AU  - Andrew Deorio
CY  - Tampa, Florida
DA  - 2019/06/15
PB  - ASEE Conferences
TI  - Human vs. Automated Coding Style Grading in Computing Education 
UR  - https://peer.asee.org/32906
DO  - 10.18260/1-2--32906
ER  -