[go: up one dir, main page]

skip to main content
research-article

Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication

Published:07 November 2019Publication History
Skip Abstract Section

Abstract

Expert disagreement is pervasive in clinical decision making and collective adjudication is a useful approach for resolving divergent assessments. Prior work shows that expert disagreement can arise due to diverse factors including expert background, the quality and presentation of data, and guideline clarity. In this work, we study how these factors predict initial discrepancies in the context of medical time series analysis, examining why certain disagreements persist after adjudication, and how adjudication impacts clinical decisions. Results from a case study with 36 experts and 4,543 adjudicated cases in a sleep stage classification task show that these factors contribute to both initial disagreement and resolvability, each in their own unique way. We provide evidence suggesting that structured adjudication can lead to significant revisions in treatment-relevant clinical parameters. Our work demonstrates how structured adjudication can support consensus and facilitate a deep understanding of expert disagreement in medical data analysis.

References

  1. Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lora Aroyo and Chris Welty. 2014. The three sides of CrowdTruth. Journal of Human Computation, Vol. 1 (2014), 31--34.Google ScholarGoogle ScholarCross RefCross Ref
  3. Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology, Vol. 128, 10 (10 2017), 1994--2005. https://doi.org/10.1016/j.clinph.2017.06.252Google ScholarGoogle Scholar
  4. A. Baker, K. Young, J. Potter, and I. Madan. 2010. A review of grading systems for evidence-based guidelines produced by medical specialties. Clinical Medicine, Vol. 10, 4 (8 2010), 358--363. https://doi.org/10.7861/clinmedicine.10--4--358Google ScholarGoogle Scholar
  5. Erin P. Balogh, Bryan T. Miller, and John R. Ball (Eds.). 2015. Improving Diagnosis in Health Care. National Academies Press, Washington, D.C. https://doi.org/10.17226/21794Google ScholarGoogle Scholar
  6. Forrest S Bao, Xin Liu, and Christina Zhang. 2011. PyEEG: An Open Source Python Module for EEG/MEG Feature Extraction. Computational Intelligence and Neuroscience, Vol. 2011 (2011), 1--7. https://doi.org/10.1155/2011/406391Google ScholarGoogle ScholarCross RefCross Ref
  7. Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open, Vol. 2, 3 (3 2019), e190096. https://doi.org/10.1001/jamanetworkopen.2019.0096Google ScholarGoogle Scholar
  8. Floris Bex, Henry Prakken, Chris Reed, and Douglas Walton. 2003. Towards a Formal Account of Reasoning about Evidence: Argumentation Schemes and Generalisations. Artificial Intelligence and Law, Vol. 11, 2/3 (2003), 125--165. https://doi.org/10.1023/B:ARTI.0000046007.11806.9aGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  9. Katarzyna Budzynska, Mathilde Janier, Juyeon Kang, Chris Reed, Patrick Saint-Dizier, Manfred Stede, and Olena Yaskorska. 2014. Towards Argument Mining from Dialogue. In Computational Models of Argument - Proceedings of COMMA 2014, Atholl Palace Hotel, Scottish Highlands, UK, September 9--12, 2014. 185--196. https://doi.org/10.3233/978--1--61499--436--7--185Google ScholarGoogle Scholar
  10. Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. https://doi.org/10.1145/3025453.3026044Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nancy Chang, Praveen Paritosh, David Huynh, and Collin Baker. 2015. Scaling semantic frame annotation. In Proceedings of The 9th Linguistic Annotation Workshop. 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  12. Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019. Cicero. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. https://doi.org/10.1145/3290605.3300761Google ScholarGoogle Scholar
  13. Carlos Ches n evar, Jarred McGinnis, Sanjay Modgil, Iyad Rahwan, Chris Reed, Guillermo Simari, Matthew South, Gerard Vreeswijk, and Steven Willmott. 2006. Towards an argument interchange format. The Knowledge Engineering Review, Vol. 21, 04 (12 2006), 293. https://doi.org/10.1017/S0269888906001044Google ScholarGoogle Scholar
  14. Robin Cohen. 1987. Analyzing the Structure of Argumentative Discourse. Comput. Linguist., Vol. 13, 1--2 (1 1987), 11--24. http://dl.acm.org/citation.cfm?id=26386.26388Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science, Vol. 9, 3 (4 1963), 458--467. https://doi.org/10.1287/mnsc.9.3.458Google ScholarGoogle Scholar
  17. Todd Davies and Reid Chandler. 2012. Online deliberation design. Democracy in motion: Evaluation the practice and impact of deliberative civic engagement (2012), 103--131.Google ScholarGoogle Scholar
  18. Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP) .Google ScholarGoogle Scholar
  19. Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems, Vol. 8, 2 (7 2018), 1--20. https://doi.org/10.1145/3152889Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making, Vol. 539 (2014), 35--45. http://link.springer.com/10.1007/978--3--319-04280-0%5C_5Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gowri Gopalakrishna, Miranda W Langendam, Rob JPM Scholten, Patrick MM Bossuyt, and Mariska MG Leeflang. 2013. Guidelines for guideline developers: a systematic review of grading systems for medical tests. Implementation Science, Vol. 8, 1 (12 2013), 78. https://doi.org/10.1186/1748--5908--8--78Google ScholarGoogle Scholar
  22. Nitesh Goyal and Susan R Fussell. 2016. Effects of sensemaking translucence on distributed collaborative analysis. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 288--302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  24. Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. https://doi.org/10.1145/3025453.3025781Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, and Kristen Grauman. 2017. Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s). (4 2017). http://arxiv.org/abs/1705.00366Google ScholarGoogle Scholar
  26. Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering, Vol. 9, 4 (10 1995), 244--249. https://doi.org/10.1061/(ASCE)0887--3801(1995)9:4(244)Google ScholarGoogle ScholarCross RefCross Ref
  27. Conrad Iber, Sonia Ancoli-Israel, Andrew L Cheeson Jr., and Stuart F Quan. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. American Academy of Sleep Medicine.Google ScholarGoogle Scholar
  28. Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. https://doi.org/10.1145/2818048.2820016Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018). https://doi.org/10.1016/j.ophtha.2018.01.034Google ScholarGoogle Scholar
  30. John Lawrence and Chris Reed. 2015. Combining Argument Mining Techniques. In Proceedings of the 2nd Workshop on Argumentation Mining at ACL 2015. 127--136. https://doi.org/10.3115/v1/W15-0516Google ScholarGoogle ScholarCross RefCross Ref
  31. John Lawrence and Chris Reed. 2016. Argument Mining using Argumentation Scheme Structures. Proceedings of the 6th International Conference on Computational Models of Argument (COMMA 2016), Vol. 0 (2016), 379 -- 390. https://doi.org/10.3233/978--1--61499--686--6--379Google ScholarGoogle Scholar
  32. V K Chaithanya Manam and Alexander J Quinn. 2018. WingIt: Efficient Refinement of Unclear Task Instructions. In The Sixth AAAI Conference on Human Computation and Crowdsourcing. 108--116. https://www.aaai.org/ocs/index.php/HCOMP/HCOMP18/paper/view/17931Google ScholarGoogle Scholar
  33. Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning, Vol. 2, 2--3 (7 1996), 191--212. https://doi.org/10.1080/135467896394500Google ScholarGoogle ScholarCross RefCross Ref
  34. Susannah BF Paletz, Joel Chan, and Christian D Schunn. 2016. Uncovering uncertainty through disagreement. Applied Cognitive Psychology, Vol. 30, 3 (2016), 387--400.Google ScholarGoogle ScholarCross RefCross Ref
  35. Simon Parsons, Elizabeth Sklar, Jordan Salvit, Holly Wall, and Zimi Li. 2013. ArgTrust: Decision Making with Information from Sources of Varying Trustworthiness. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '13). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1395--1396. http://dl.acm.org/citation.cfm?id=2484920.2485242Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Matthew P Pase, Jayandra J Himali, Natalie A Grima, Alexa S Beiser, Claudia L Satizabal, Hugo J Aparicio, Robert J Thomas, Daniel J Gottlieb, Sandford H Auerbach, and Sudha Seshadri. 2017. Sleep architecture and the risk of incident dementia in the community. Neurology, Vol. 89, 12 (2017), 1244--1250.Google ScholarGoogle ScholarCross RefCross Ref
  37. Thomas Penzel, Xiaozhe Zhang, and Ingo Fietze. 2013. Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine, Vol. 9, 1 (2013), 81--87.Google ScholarGoogle ScholarCross RefCross Ref
  38. Ronald B Postuma, Alex Iranzo, Michele Hu, Birgit Hö gl, Bradley F Boeve, Raffaele Manni, Wolfgang H Oertel, Isabelle Arnulf, Luigi Ferini-Strambi, Monica Puligheddu, and others. 2019. Risk and predictors of dementia and parkinsonism in idiopathic REM sleep behaviour disorder: a multicentre study. Brain, Vol. 142, 3 (2019), 744--759.Google ScholarGoogle ScholarCross RefCross Ref
  39. Stefan R"abiger, Gizem Gezici, Yücel Saygin, and Myra Spiliopoulou. 2018. Predicting worker disagreement for more effective crowd labeling. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 179--188.Google ScholarGoogle Scholar
  40. Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (7 2018). http://arxiv.org/abs/1807.01771Google ScholarGoogle Scholar
  41. Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836Google ScholarGoogle Scholar
  42. Chris Reed and Timothy Norman. 2004. Argumentation Machines. Argumentation Library, Vol. 9. Springer Netherlands, Dordrecht. https://doi.org/10.1007/978--94-017-0431--1Google ScholarGoogle Scholar
  43. Chris Reed and Doug Walton. 2005. Towards a Formal and Implemented Model of Argumentation Schemes in Agent Communication. 19--30. https://doi.org/10.1007/978--3--540--32261-0_2Google ScholarGoogle Scholar
  44. Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013). https://doi.org/10.5664/jcsm.2350Google ScholarGoogle Scholar
  45. Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. https://doi.org/10.1145/3308560.3317085Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019. crowdEEG: A Platform for Structured Consensus Formation in Medical Time Series Analysis. In 8th Workshop on Interactive Systems in Healthcare (WISH) at CHI 2019. Glasgow, UK.Google ScholarGoogle Scholar
  47. Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018a. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'18). New York City, NY. https://doi.org/10.1145/3274423Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018b. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.Google ScholarGoogle Scholar
  49. Mike Schaekermann, Edith Law, Alex C Williams, and William Callaghan. 2016. Resolvable vs. Irresolvable Ambiguity: A New Hybrid Framework for Dealing with Uncertain Ground Truth. In 1st Workshop on Human-Centered Machine Learning at SIGCHI 2016. San Jose, CA.Google ScholarGoogle Scholar
  50. Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy, Vol. 44, S1 (3 2006), 28--42. https://doi.org/10.1111/j.2041--6962.2006.tb00028.xGoogle ScholarGoogle ScholarCross RefCross Ref
  51. Miriam Solomon. 2007. The social epistemology of NIH consensus conferences. In Establishing medical reality. Springer, 167--177.Google ScholarGoogle Scholar
  52. D Walton, C Reed, and F Macagno. 2008. Argumentation Schemes. Cambridge University Press. https://books.google.ca/books?id=qc3LCgAAQBAJGoogle ScholarGoogle Scholar

Index Terms

  1. Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Human-Computer Interaction
          Proceedings of the ACM on Human-Computer Interaction  Volume 3, Issue CSCW
          November 2019
          5026 pages
          EISSN:2573-0142
          DOI:10.1145/3371885
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 November 2019
          Published in pacmhci Volume 3, Issue CSCW

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader