research-article

Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication

Authors:
Mike Schaekermann

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Graeme Beaton

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Minahz Habib

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Andrew Lim

Sunnybrook Health Sciences Centre, Toronto, ON, Canada

Sunnybrook Health Sciences Centre, Toronto, ON, Canada
View Profile

,
Kate Larson

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Edith Law

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

Proceedings of the ACM on Human-Computer Interaction Volume 3 Issue CSCWArticle No.: 76pp 1–23https://doi.org/10.1145/3359178

Published:07 November 2019Publication History

Proceedings of the ACM on Human-Computer Interaction

Abstract

Expert disagreement is pervasive in clinical decision making and collective adjudication is a useful approach for resolving divergent assessments. Prior work shows that expert disagreement can arise due to diverse factors including expert background, the quality and presentation of data, and guideline clarity. In this work, we study how these factors predict initial discrepancies in the context of medical time series analysis, examining why certain disagreements persist after adjudication, and how adjudication impacts clinical decisions. Results from a case study with 36 experts and 4,543 adjudicated cases in a sleep stage classification task show that these factors contribute to both initial disagreement and resolvability, each in their own unique way. We provide evidence suggesting that structured adjudication can lead to significant revisions in treatment-relevant clinical parameters. Our work demonstrates how structured adjudication can support consensus and facilitate a deep understanding of expert disagreement in medical data analysis.

References

Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.Google ScholarDigital Library
Lora Aroyo and Chris Welty. 2014. The three sides of CrowdTruth. Journal of Human Computation, Vol. 1 (2014), 31--34.Google ScholarCross Ref
Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology, Vol. 128, 10 (10 2017), 1994--2005. https://doi.org/10.1016/j.clinph.2017.06.252Google Scholar
A. Baker, K. Young, J. Potter, and I. Madan. 2010. A review of grading systems for evidence-based guidelines produced by medical specialties. Clinical Medicine, Vol. 10, 4 (8 2010), 358--363. https://doi.org/10.7861/clinmedicine.10--4--358Google Scholar
Erin P. Balogh, Bryan T. Miller, and John R. Ball (Eds.). 2015. Improving Diagnosis in Health Care. National Academies Press, Washington, D.C. https://doi.org/10.17226/21794Google Scholar
Forrest S Bao, Xin Liu, and Christina Zhang. 2011. PyEEG: An Open Source Python Module for EEG/MEG Feature Extraction. Computational Intelligence and Neuroscience, Vol. 2011 (2011), 1--7. https://doi.org/10.1155/2011/406391Google ScholarCross Ref
Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open, Vol. 2, 3 (3 2019), e190096. https://doi.org/10.1001/jamanetworkopen.2019.0096Google Scholar
Floris Bex, Henry Prakken, Chris Reed, and Douglas Walton. 2003. Towards a Formal Account of Reasoning about Evidence: Argumentation Schemes and Generalisations. Artificial Intelligence and Law, Vol. 11, 2/3 (2003), 125--165. https://doi.org/10.1023/B:ARTI.0000046007.11806.9aGoogle ScholarDigital Library
Katarzyna Budzynska, Mathilde Janier, Juyeon Kang, Chris Reed, Patrick Saint-Dizier, Manfred Stede, and Olena Yaskorska. 2014. Towards Argument Mining from Dialogue. In Computational Models of Argument - Proceedings of COMMA 2014, Atholl Palace Hotel, Scottish Highlands, UK, September 9--12, 2014. 185--196. https://doi.org/10.3233/978--1--61499--436--7--185Google Scholar
Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. https://doi.org/10.1145/3025453.3026044Google ScholarDigital Library
Nancy Chang, Praveen Paritosh, David Huynh, and Collin Baker. 2015. Scaling semantic frame annotation. In Proceedings of The 9th Linguistic Annotation Workshop. 1--10.Google ScholarCross Ref
Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019. Cicero. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. https://doi.org/10.1145/3290605.3300761Google Scholar
Carlos Ches n evar, Jarred McGinnis, Sanjay Modgil, Iyad Rahwan, Chris Reed, Guillermo Simari, Matthew South, Gerard Vreeswijk, and Steven Willmott. 2006. Towards an argument interchange format. The Knowledge Engineering Review, Vol. 21, 04 (12 2006), 293. https://doi.org/10.1017/S0269888906001044Google Scholar
Robin Cohen. 1987. Analyzing the Structure of Argumentative Discourse. Comput. Linguist., Vol. 13, 1--2 (1 1987), 11--24. http://dl.acm.org/citation.cfm?id=26386.26388Google ScholarDigital Library
Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890Google ScholarDigital Library
Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science, Vol. 9, 3 (4 1963), 458--467. https://doi.org/10.1287/mnsc.9.3.458Google Scholar
Todd Davies and Reid Chandler. 2012. Online deliberation design. Democracy in motion: Evaluation the practice and impact of deliberative civic engagement (2012), 103--131.Google Scholar
Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP) .Google Scholar
Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems, Vol. 8, 2 (7 2018), 1--20. https://doi.org/10.1145/3152889Google ScholarDigital Library
Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making, Vol. 539 (2014), 35--45. http://link.springer.com/10.1007/978--3--319-04280-0%5C_5Google ScholarDigital Library
Gowri Gopalakrishna, Miranda W Langendam, Rob JPM Scholten, Patrick MM Bossuyt, and Mariska MG Leeflang. 2013. Guidelines for guideline developers: a systematic review of grading systems for medical tests. Implementation Science, Vol. 8, 1 (12 2013), 78. https://doi.org/10.1186/1748--5908--8--78Google Scholar
Nitesh Goyal and Susan R Fussell. 2016. Effects of sensemaking translucence on distributed collaborative analysis. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 288--302.Google ScholarDigital Library
Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarCross Ref
Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. https://doi.org/10.1145/3025453.3025781Google ScholarDigital Library
Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, and Kristen Grauman. 2017. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). (4 2017). http://arxiv.org/abs/1705.00366Google Scholar
Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering, Vol. 9, 4 (10 1995), 244--249. https://doi.org/10.1061/(ASCE)0887--3801(1995)9:4(244)Google ScholarCross Ref
Conrad Iber, Sonia Ancoli-Israel, Andrew L Cheeson Jr., and Stuart F Quan. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine.Google Scholar
Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. https://doi.org/10.1145/2818048.2820016Google ScholarDigital Library
Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018). https://doi.org/10.1016/j.ophtha.2018.01.034Google Scholar
John Lawrence and Chris Reed. 2015. Combining Argument Mining Techniques. In Proceedings of the 2nd Workshop on Argumentation Mining at ACL 2015. 127--136. https://doi.org/10.3115/v1/W15-0516Google ScholarCross Ref
John Lawrence and Chris Reed. 2016. Argument Mining using Argumentation Scheme Structures. Proceedings of the 6th International Conference on Computational Models of Argument (COMMA 2016), Vol. 0 (2016), 379 -- 390. https://doi.org/10.3233/978--1--61499--686--6--379Google Scholar
V K Chaithanya Manam and Alexander J Quinn. 2018. WingIt: Efficient Refinement of Unclear Task Instructions. In The Sixth AAAI Conference on Human Computation and Crowdsourcing. 108--116. https://www.aaai.org/ocs/index.php/HCOMP/HCOMP18/paper/view/17931Google Scholar
Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning, Vol. 2, 2--3 (7 1996), 191--212. https://doi.org/10.1080/135467896394500Google ScholarCross Ref
Susannah BF Paletz, Joel Chan, and Christian D Schunn. 2016. Uncovering uncertainty through disagreement. Applied Cognitive Psychology, Vol. 30, 3 (2016), 387--400.Google ScholarCross Ref
Simon Parsons, Elizabeth Sklar, Jordan Salvit, Holly Wall, and Zimi Li. 2013. ArgTrust: Decision Making with Information from Sources of Varying Trustworthiness. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '13). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1395--1396. http://dl.acm.org/citation.cfm?id=2484920.2485242Google ScholarDigital Library
Matthew P Pase, Jayandra J Himali, Natalie A Grima, Alexa S Beiser, Claudia L Satizabal, Hugo J Aparicio, Robert J Thomas, Daniel J Gottlieb, Sandford H Auerbach, and Sudha Seshadri. 2017. Sleep architecture and the risk of incident dementia in the community. Neurology, Vol. 89, 12 (2017), 1244--1250.Google ScholarCross Ref
Thomas Penzel, Xiaozhe Zhang, and Ingo Fietze. 2013. Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine, Vol. 9, 1 (2013), 81--87.Google ScholarCross Ref
Ronald B Postuma, Alex Iranzo, Michele Hu, Birgit Hö gl, Bradley F Boeve, Raffaele Manni, Wolfgang H Oertel, Isabelle Arnulf, Luigi Ferini-Strambi, Monica Puligheddu, and others. 2019. Risk and predictors of dementia and parkinsonism in idiopathic REM sleep behaviour disorder: a multicentre study. Brain, Vol. 142, 3 (2019), 744--759.Google ScholarCross Ref
Stefan R"abiger, Gizem Gezici, Yücel Saygin, and Myra Spiliopoulou. 2018. Predicting worker disagreement for more effective crowd labeling. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 179--188.Google Scholar
Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (7 2018). http://arxiv.org/abs/1807.01771Google Scholar
Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836Google Scholar
Chris Reed and Timothy Norman. 2004. Argumentation Machines. Argumentation Library, Vol. 9. Springer Netherlands, Dordrecht. https://doi.org/10.1007/978--94-017-0431--1Google Scholar
Chris Reed and Doug Walton. 2005. Towards a Formal and Implemented Model of Argumentation Schemes in Agent Communication. 19--30. https://doi.org/10.1007/978--3--540--32261-0_2Google Scholar
Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013). https://doi.org/10.5664/jcsm.2350Google Scholar
Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. https://doi.org/10.1145/3308560.3317085Google ScholarDigital Library
Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. crowdEEG: A Platform for Structured Consensus Formation in Medical Time Series Analysis. In 8th Workshop on Interactive Systems in Healthcare (WISH) at CHI 2019. Glasgow, UK.Google Scholar
Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018a. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'18). New York City, NY. https://doi.org/10.1145/3274423Google ScholarDigital Library
Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018b. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.Google Scholar
Mike Schaekermann, Edith Law, Alex C Williams, and William Callaghan. 2016. Resolvable vs. Irresolvable Ambiguity: A New Hybrid Framework for Dealing with Uncertain Ground Truth. In 1st Workshop on Human-Centered Machine Learning at SIGCHI 2016. San Jose, CA.Google Scholar
Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy, Vol. 44, S1 (3 2006), 28--42. https://doi.org/10.1111/j.2041--6962.2006.tb00028.xGoogle ScholarCross Ref
Miriam Solomon. 2007. The social epistemology of NIH consensus conferences. In Establishing medical reality. Springer, 167--177.Google Scholar
D Walton, C Reed, and F Macagno. 2008. Argumentation Schemes. Cambridge University Press. https://books.google.ca/books?id=qc3LCgAAQBAJGoogle Scholar

Index Terms

Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing systems and tools
    2. Empirical studies in collaborative and social computing
  2. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Collaborative interaction

Recommendations

Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we ...
Read More
Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format
WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

Group-based discussion among human graders can be a useful tool to capture sources of disagreement in ambiguous classification tasks and to adjudicate any resolvable disagreements. Existing workflows for panel-based adjudication, however, capture ...
Read More
A formal model of adjudication dialogues

This article presents a formal dialogue game for adjudication dialogues. Existing AI & law models of legal dialogues and argumentation-theoretic models of persuasion are extended with a neutral third party, to give a more realistic account of the ...
Read More

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Human-Computer Interaction Volume 3, Issue CSCW
November 2019
5026 pages
EISSN:2573-0142
DOI:10.1145/3371885
Editors:
Airi Lampinen
Stockholm University, Sweden
,
Darren Gergle
Northwestern University, USA
,
David A. Shamma
FXPAL, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2019
Published in pacmhci Volume 3, Issue CSCW

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adjudication
ambiguity
disagreement
medical time series
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 393
  Total Downloads
- Downloads (Last 12 months)53
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication

Proceedings of the ACM on Human-Computer Interaction

Abstract

References

Cited By

Index Terms

Recommendations

Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment

Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format

A formal model of adjudication dialogues