Abstract
Expert disagreement is pervasive in clinical decision making and collective adjudication is a useful approach for resolving divergent assessments. Prior work shows that expert disagreement can arise due to diverse factors including expert background, the quality and presentation of data, and guideline clarity. In this work, we study how these factors predict initial discrepancies in the context of medical time series analysis, examining why certain disagreements persist after adjudication, and how adjudication impacts clinical decisions. Results from a case study with 36 experts and 4,543 adjudicated cases in a sleep stage classification task show that these factors contribute to both initial disagreement and resolvability, each in their own unique way. We provide evidence suggesting that structured adjudication can lead to significant revisions in treatment-relevant clinical parameters. Our work demonstrates how structured adjudication can support consensus and facilitate a deep understanding of expert disagreement in medical data analysis.
- Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.Google ScholarDigital Library
- Lora Aroyo and Chris Welty. 2014. The three sides of CrowdTruth. Journal of Human Computation, Vol. 1 (2014), 31--34.Google ScholarCross Ref
- Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology, Vol. 128, 10 (10 2017), 1994--2005. https://doi.org/10.1016/j.clinph.2017.06.252Google Scholar
- A. Baker, K. Young, J. Potter, and I. Madan. 2010. A review of grading systems for evidence-based guidelines produced by medical specialties. Clinical Medicine, Vol. 10, 4 (8 2010), 358--363. https://doi.org/10.7861/clinmedicine.10--4--358Google Scholar
- Erin P. Balogh, Bryan T. Miller, and John R. Ball (Eds.). 2015. Improving Diagnosis in Health Care. National Academies Press, Washington, D.C. https://doi.org/10.17226/21794Google Scholar
- Forrest S Bao, Xin Liu, and Christina Zhang. 2011. PyEEG: An Open Source Python Module for EEG/MEG Feature Extraction. Computational Intelligence and Neuroscience, Vol. 2011 (2011), 1--7. https://doi.org/10.1155/2011/406391Google ScholarCross Ref
- Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open, Vol. 2, 3 (3 2019), e190096. https://doi.org/10.1001/jamanetworkopen.2019.0096Google Scholar
- Floris Bex, Henry Prakken, Chris Reed, and Douglas Walton. 2003. Towards a Formal Account of Reasoning about Evidence: Argumentation Schemes and Generalisations. Artificial Intelligence and Law, Vol. 11, 2/3 (2003), 125--165. https://doi.org/10.1023/B:ARTI.0000046007.11806.9aGoogle ScholarDigital Library
- Katarzyna Budzynska, Mathilde Janier, Juyeon Kang, Chris Reed, Patrick Saint-Dizier, Manfred Stede, and Olena Yaskorska. 2014. Towards Argument Mining from Dialogue. In Computational Models of Argument - Proceedings of COMMA 2014, Atholl Palace Hotel, Scottish Highlands, UK, September 9--12, 2014. 185--196. https://doi.org/10.3233/978--1--61499--436--7--185Google Scholar
- Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. https://doi.org/10.1145/3025453.3026044Google ScholarDigital Library
- Nancy Chang, Praveen Paritosh, David Huynh, and Collin Baker. 2015. Scaling semantic frame annotation. In Proceedings of The 9th Linguistic Annotation Workshop. 1--10.Google ScholarCross Ref
- Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019. Cicero. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. https://doi.org/10.1145/3290605.3300761Google Scholar
- Carlos Ches n evar, Jarred McGinnis, Sanjay Modgil, Iyad Rahwan, Chris Reed, Guillermo Simari, Matthew South, Gerard Vreeswijk, and Steven Willmott. 2006. Towards an argument interchange format. The Knowledge Engineering Review, Vol. 21, 04 (12 2006), 293. https://doi.org/10.1017/S0269888906001044Google Scholar
- Robin Cohen. 1987. Analyzing the Structure of Argumentative Discourse. Comput. Linguist., Vol. 13, 1--2 (1 1987), 11--24. http://dl.acm.org/citation.cfm?id=26386.26388Google ScholarDigital Library
- Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890Google ScholarDigital Library
- Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science, Vol. 9, 3 (4 1963), 458--467. https://doi.org/10.1287/mnsc.9.3.458Google Scholar
- Todd Davies and Reid Chandler. 2012. Online deliberation design. Democracy in motion: Evaluation the practice and impact of deliberative civic engagement (2012), 103--131.Google Scholar
- Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP) .Google Scholar
- Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems, Vol. 8, 2 (7 2018), 1--20. https://doi.org/10.1145/3152889Google ScholarDigital Library
- Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making, Vol. 539 (2014), 35--45. http://link.springer.com/10.1007/978--3--319-04280-0%5C_5Google ScholarDigital Library
- Gowri Gopalakrishna, Miranda W Langendam, Rob JPM Scholten, Patrick MM Bossuyt, and Mariska MG Leeflang. 2013. Guidelines for guideline developers: a systematic review of grading systems for medical tests. Implementation Science, Vol. 8, 1 (12 2013), 78. https://doi.org/10.1186/1748--5908--8--78Google Scholar
- Nitesh Goyal and Susan R Fussell. 2016. Effects of sensemaking translucence on distributed collaborative analysis. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 288--302.Google ScholarDigital Library
- Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarCross Ref
- Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. https://doi.org/10.1145/3025453.3025781Google ScholarDigital Library
- Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, and Kristen Grauman. 2017. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). (4 2017). http://arxiv.org/abs/1705.00366Google Scholar
- Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering, Vol. 9, 4 (10 1995), 244--249. https://doi.org/10.1061/(ASCE)0887--3801(1995)9:4(244)Google ScholarCross Ref
- Conrad Iber, Sonia Ancoli-Israel, Andrew L Cheeson Jr., and Stuart F Quan. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine.Google Scholar
- Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. https://doi.org/10.1145/2818048.2820016Google ScholarDigital Library
- Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018). https://doi.org/10.1016/j.ophtha.2018.01.034Google Scholar
- John Lawrence and Chris Reed. 2015. Combining Argument Mining Techniques. In Proceedings of the 2nd Workshop on Argumentation Mining at ACL 2015. 127--136. https://doi.org/10.3115/v1/W15-0516Google ScholarCross Ref
- John Lawrence and Chris Reed. 2016. Argument Mining using Argumentation Scheme Structures. Proceedings of the 6th International Conference on Computational Models of Argument (COMMA 2016), Vol. 0 (2016), 379 -- 390. https://doi.org/10.3233/978--1--61499--686--6--379Google Scholar
- V K Chaithanya Manam and Alexander J Quinn. 2018. WingIt: Efficient Refinement of Unclear Task Instructions. In The Sixth AAAI Conference on Human Computation and Crowdsourcing. 108--116. https://www.aaai.org/ocs/index.php/HCOMP/HCOMP18/paper/view/17931Google Scholar
- Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning, Vol. 2, 2--3 (7 1996), 191--212. https://doi.org/10.1080/135467896394500Google ScholarCross Ref
- Susannah BF Paletz, Joel Chan, and Christian D Schunn. 2016. Uncovering uncertainty through disagreement. Applied Cognitive Psychology, Vol. 30, 3 (2016), 387--400.Google ScholarCross Ref
- Simon Parsons, Elizabeth Sklar, Jordan Salvit, Holly Wall, and Zimi Li. 2013. ArgTrust: Decision Making with Information from Sources of Varying Trustworthiness. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '13). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1395--1396. http://dl.acm.org/citation.cfm?id=2484920.2485242Google ScholarDigital Library
- Matthew P Pase, Jayandra J Himali, Natalie A Grima, Alexa S Beiser, Claudia L Satizabal, Hugo J Aparicio, Robert J Thomas, Daniel J Gottlieb, Sandford H Auerbach, and Sudha Seshadri. 2017. Sleep architecture and the risk of incident dementia in the community. Neurology, Vol. 89, 12 (2017), 1244--1250.Google ScholarCross Ref
- Thomas Penzel, Xiaozhe Zhang, and Ingo Fietze. 2013. Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine, Vol. 9, 1 (2013), 81--87.Google ScholarCross Ref
- Ronald B Postuma, Alex Iranzo, Michele Hu, Birgit Hö gl, Bradley F Boeve, Raffaele Manni, Wolfgang H Oertel, Isabelle Arnulf, Luigi Ferini-Strambi, Monica Puligheddu, and others. 2019. Risk and predictors of dementia and parkinsonism in idiopathic REM sleep behaviour disorder: a multicentre study. Brain, Vol. 142, 3 (2019), 744--759.Google ScholarCross Ref
- Stefan R"abiger, Gizem Gezici, Yücel Saygin, and Myra Spiliopoulou. 2018. Predicting worker disagreement for more effective crowd labeling. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 179--188.Google Scholar
- Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (7 2018). http://arxiv.org/abs/1807.01771Google Scholar
- Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836Google Scholar
- Chris Reed and Timothy Norman. 2004. Argumentation Machines. Argumentation Library, Vol. 9. Springer Netherlands, Dordrecht. https://doi.org/10.1007/978--94-017-0431--1Google Scholar
- Chris Reed and Doug Walton. 2005. Towards a Formal and Implemented Model of Argumentation Schemes in Agent Communication. 19--30. https://doi.org/10.1007/978--3--540--32261-0_2Google Scholar
- Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013). https://doi.org/10.5664/jcsm.2350Google Scholar
- Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. https://doi.org/10.1145/3308560.3317085Google ScholarDigital Library
- Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. crowdEEG: A Platform for Structured Consensus Formation in Medical Time Series Analysis. In 8th Workshop on Interactive Systems in Healthcare (WISH) at CHI 2019. Glasgow, UK.Google Scholar
- Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018a. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'18). New York City, NY. https://doi.org/10.1145/3274423Google ScholarDigital Library
- Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018b. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.Google Scholar
- Mike Schaekermann, Edith Law, Alex C Williams, and William Callaghan. 2016. Resolvable vs. Irresolvable Ambiguity: A New Hybrid Framework for Dealing with Uncertain Ground Truth. In 1st Workshop on Human-Centered Machine Learning at SIGCHI 2016. San Jose, CA.Google Scholar
- Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy, Vol. 44, S1 (3 2006), 28--42. https://doi.org/10.1111/j.2041--6962.2006.tb00028.xGoogle ScholarCross Ref
- Miriam Solomon. 2007. The social epistemology of NIH consensus conferences. In Establishing medical reality. Springer, 167--177.Google Scholar
- D Walton, C Reed, and F Macagno. 2008. Argumentation Schemes. Cambridge University Press. https://books.google.ca/books?id=qc3LCgAAQBAJGoogle Scholar
Index Terms
- Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication
Recommendations
Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsMedical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we ...
Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format
WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceGroup-based discussion among human graders can be a useful tool to capture sources of disagreement in ambiguous classification tasks and to adjudicate any resolvable disagreements. Existing workflows for panel-based adjudication, however, capture ...
A formal model of adjudication dialogues
This article presents a formal dialogue game for adjudication dialogues. Existing AI & law models of legal dialogues and argumentation-theoretic models of persuasion are extended with a neutral third party, to give a more realistic account of the ...
Comments