[go: up one dir, main page]

skip to main content
10.1145/3313831.3376290acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article
Open Access

Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment

Published:23 April 2020Publication History

ABSTRACT

Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten generalist eye care professionals with either individual performance alone, or individual performance and expert discussions from specialists. Compared to performance feedback alone, seeing expert discussions significantly improved generalists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis.

Skip Supplemental Material Section

Supplemental Material

References

  1. Alaa Al Ali, Stephen Hallingham, and Yvonne M. Buys. 2015. Workforce supply of eye care providers in Canada: optometrists, ophthalmologists, and subspecialty ophthalmologists. Canadian Journal of Ophthalmology 50, 6 (dec 2015), 422--428. DOI: http://dx.doi.org/10.1016/j.jcjo.2015.09.001Google ScholarGoogle ScholarCross RefCross Ref
  2. Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology 128, 10 (oct 2017), 1994--2005. DOI: http://dx.doi.org/10.1016/j.clinph.2017.06.252Google ScholarGoogle ScholarCross RefCross Ref
  3. Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open 2, 3 (mar 2019), e190096. DOI: http://dx.doi.org/10.1001/jamanetworkopen.2019.0096Google ScholarGoogle ScholarCross RefCross Ref
  4. Andrew Bastawrous and Benjamin D Hennig. 2012. The global inverse care law: a distorted map of blindness. British Journal of Ophthalmology 96, 10 (oct 2012), 1357.2--1358. DOI: http://dx.doi.org/10.1136/bjophthalmol-2012--302088Google ScholarGoogle ScholarCross RefCross Ref
  5. Anthony A. Cavallerano and Paul R. Conlin. 2008. Teleretinal Imaging to Screen for Diabetic Retinopathy in the Veterans Health Administration. Journal of Diabetes Science and Technology 2, 1 (jan 2008), 33--39. DOI: http://dx.doi.org/10.1177/193229680800200106Google ScholarGoogle ScholarCross RefCross Ref
  6. Po-Hsuan Cameron Chen, Yun Liu, and Lily Peng. 2019b. How to develop machine learning models for healthcare. Nature Materials 18, 5 (may 2019), 410--414. DOI: http://dx.doi.org/10.1038/s41563-019-0345-0Google ScholarGoogle ScholarCross RefCross Ref
  7. Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019a. Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. DOI: http://dx.doi.org/10.1145/3290605.3300761Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jorge Cuadros and George Bresnick. 2009. EyePACS: An Adaptable Telemedicine System for Diabetic Retinopathy Screening. Journal of Diabetes Science and Technology 3, 3 (may 2009), 509--516. DOI: http://dx.doi.org/10.1177/193229680900300315Google ScholarGoogle ScholarCross RefCross Ref
  10. Jasmin Diwan, Chinmay Shah, Saurin Sanghavi, and Amit Shah. 2017. Comparison of case-based learning and traditional lectures in physiology among first year undergraduate medical students. National Journal of Physiology, Pharmacy and Pharmacology (2017), 1. DOI: http://dx.doi.org/10.5455/njppp.2017.7.0204220032017Google ScholarGoogle ScholarCross RefCross Ref
  11. Tim Dornan, Albert Scherpbier, Nigel King, and Henny Boshuizen. 2005. Clinical teachers and problem-based learning: a phenomenological study. Medical Education 39, 2 (feb 2005), 163--170. DOI: http://dx.doi.org/10.1111/j.1365--2929.2004.01914.xGoogle ScholarGoogle ScholarCross RefCross Ref
  12. Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarGoogle Scholar
  13. Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems 8, 2 (jul 2018), 1--20. DOI: http://dx.doi.org/10.1145/3152889Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Dunning. 2011. The Dunning--Kruger Effect. 247--296. DOI: http://dx.doi.org/10.1016/B978-0--12--385522-0.00005--6Google ScholarGoogle ScholarCross RefCross Ref
  15. Matthew J. Gabel, Norman L. Foster, Judith L. Heidebrink, Roger Higdon, Howard J. Aizenstein, Steven E. Arnold, Nancy R. Barbas, Bradley F. Boeve, James R. Burke, Christopher M. Clark, Steven T. DeKosky, Martin R. Farlow, William J. Jagust, Claudia H. Kawas, Robert A. Koeppe, James B. Leverenz, Anne M. Lipton, Elaine R. Peskind, R. Scott Turner, Kyle B. Womack, and Edward Y. Zamrini. 2010. Validation of Consensus Panel Diagnosis in Dementia. Archives of Neurology 67, 12 (dec 2010). DOI: http://dx.doi.org/10.1001/archneurol.2010.301Google ScholarGoogle ScholarCross RefCross Ref
  16. Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  17. Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. 2016. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. Jama 304, 6 (2016), 649--656. DOI: http://dx.doi.org/10.1001/jama.2016.17216Google ScholarGoogle ScholarCross RefCross Ref
  18. Mark Hartswood, Rob Procter, Paul Taylor, Lilian Blot, Stuart Anderson, Mark Rouncefield, and Roger Slack. 2012. Problems of data mobility and reuse in the provision of computer-based training for screening mammography. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems CHI '12. ACM Press, New York, New York, USA, 909. DOI: http://dx.doi.org/10.1145/2207676.2208533Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Oana Inel, Lora Aroyo, Chris Welty, and Robert-Jan Sips. 2013. Domain-Independent Quality Measures for Crowd Truth Disagreement. In The 12th International Semantic Web Conference (ISWC2013). http://data.semanticweb.org/workshop/derive/2013/proceedings/paper-01/htmlGoogle ScholarGoogle Scholar
  20. Jayashree Kalpathy-Cramer, J. Peter Campbell, Deniz Erdogmus, Peng Tian, Dharanish Kedarisetti, Chace Moleta, James D. Reynolds, Kelly Hutcheson, Michael J. Shapiro, Michael X. Repka, Philip Ferrone, Kimberly Drenser, Jason Horowitz, Kemal Sonmez, Ryan Swan, Susan Ostmo, Karyn E. Jonas, R.V. Paul Chan, Michael F. Chiang, Michael F. Chiang, Susan Ostmo, Kemal Sonmez, J. Peter Campbell, R.V. Paul Chan, Karyn Jonas, Jason Horowitz, Osode Coki, Cheryl-Ann Eccles, Leora Sarna, Audina Berrocal, Catherin Negron, Kimberly Denser, Kristi Cumming, Tammy Osentoski, Tammy Check, Mary Zajechowski, Thomas Lee, Evan Kruger, Kathryn McGovern, Charles Simmons, Raghu Murthy, Sharon Galvis, Jerome Rotter, Ida Chen, Xiaohui Li, Kent Taylor, Kaye Roll, Jayashree Kalpathy-Cramer, Deniz Erdogmus, Maria Ana Martinez-Castellanos, Samantha Salinas-Longoria, Rafael Romero, Andrea Arriola, Francisco Olguin-Manriquez, Miroslava Meraz-Gutierrez, Carlos M. Dulanto-Reinoso, and Cristina Montero-Mendoza. 2016. Plus disease in retinopathy of prematurity: improving diagnosis by ranking disease severity and using quantitative image analysis. Ophthalmology 123, 11 (nov 2016), 2345--2351. DOI: http://dx.doi.org/10.1016/j.ophtha.2016.07.020Google ScholarGoogle ScholarCross RefCross Ref
  21. Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (mar 2018). DOI: http://dx.doi.org/10.1016/j.ophtha.2018.01.034Google ScholarGoogle ScholarCross RefCross Ref
  22. Joseph D Kronz, Mark A Silberman, William C Allsbrook, and Jonathan I Epstein. 2000. A web-based tutorial improves practicing pathologists' Gleason grading of images of prostate carcinoma specimens obtained by needle biopsy. Cancer 89, 8 (oct 2000), 1818--1823. DOI: http://dx.doi.org/10.1002/1097-0142(20001015)89: 8<1818::AID-CNCR23>3.0.CO;2-JGoogle ScholarGoogle ScholarCross RefCross Ref
  23. P R Lichter. 1976. Variability of expert observers in evaluating the optic disc. Transactions of the American Ophthalmological Society 74 (1976), 532--72. http://www.ncbi.nlm.nih.gov/pubmed/867638http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= PMC1311528Google ScholarGoogle Scholar
  24. Christopher H Lin, Daniel S Weld, and Others. 2014. To re (label), or not to re (label). In Second AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle ScholarCross RefCross Ref
  25. J.C Liston and B.J.G Dall. 2003. Can the NHS Breast Screening Programme Afford not to Double Read Screening Mammograms? Clinical Radiology 58, 6 (jun 2003), 474--477. DOI: http://dx.doi.org/10.1016/S0009--9260(03)00063--1Google ScholarGoogle ScholarCross RefCross Ref
  26. Joaquin Navajas, Tamara Niella, Gerry Garbulsky, Bahador Bahrami, and Mariano Sigman. 2018. Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour (jan 2018). DOI: http://dx.doi.org/10.1038/s41562-017-0273--4Google ScholarGoogle ScholarCross RefCross Ref
  27. Sonia Phene, R. Carter Dunn, Naama Hammel, Yun Liu, Jonathan Krause, Naho Kitade, Mike Schaekermann, Rory Sayres, Derek J. Wu, Ashish Bora, Christopher Semturs, Anita Misra, Abigail E. Huang, Arielle Spitze, Felipe A. Medeiros, April Y. Maa, Monica Gandhi, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology (sep 2019). DOI: http://dx.doi.org/10.1016/j.ophtha.2019.07.024Google ScholarGoogle ScholarCross RefCross Ref
  28. Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (jul 2018). http://arxiv.org/abs/1807.01771Google ScholarGoogle Scholar
  29. Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (jul 2017). http://arxiv.org/abs/1707.01836Google ScholarGoogle Scholar
  30. Paisan Raumviboonsuk, Jonathan Krause, Peranut Chotcomwongse, Rory Sayres, Rajiv Raman, Kasumi Widner, Bilson J. L. Campana, Sonia Phene, Kornwipa Hemarat, Mongkol Tadarati, Sukhum Silpa-Archa, Jirawut Limwattanayingyong, Chetan Rao, Oscar Kuruvilla, Jesse Jung, Jeffrey Tan, Surapong Orprayoon, Chawawat Kangwanwongpaisan, Ramase Sukumalpaiboon, Chainarong Luengchaichawang, Jitumporn Fuangkaew, Pipat Kongsap, Lamyong Chualinpha, Sarawuth Saree, Srirut Kawinpanitan, Korntip Mitvongsa, Siriporn Lawanasakol, Chaiyasit Thepchatri, Lalita Wongpichedchai, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. npj Digital Medicine 2, 1 (dec 2019), 25. DOI: http://dx.doi.org/10.1038/s41746-019-0099--8Google ScholarGoogle ScholarCross RefCross Ref
  31. Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (jan 2013). DOI: http://dx.doi.org/10.5664/jcsm.2350Google ScholarGoogle ScholarCross RefCross Ref
  32. Rory Sayres, Ankur Taly, Ehsan Rahimy, Katy Blumer, David Coz, Naama Hammel, Jonathan Krause, Arunachalam Narayanaswamy, Zahra Rastegar, Derek Wu, Shawn Xu, Scott Barb, Anthony Joseph, Michael Shumski, Jesse Smith, Arjun B. Sood, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Using a Deep Learning Algorithm and Integrated Gradients Explanation to Assist Grading for Diabetic Retinopathy. Ophthalmology 126, 4 (apr 2019), 552--564. DOI: http://dx.doi.org/10.1016/j.ophtha.2018.11.016Google ScholarGoogle ScholarCross RefCross Ref
  33. Mike Schaekermann. 2016. Resolvable vs . Irresolvable Ambiguity : A New Hybrid Framework for Dealing with Uncertain Ground Truth. In ACM SIGCHI Workshop on Human-Centered Machine Learning, Marco Gillies and Rebecca Fiebrink (Eds.). ACM, San Jose. http://hcml2016.goldsmithsdigital.com/program/Google ScholarGoogle Scholar
  34. Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019a. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. DOI: http://dx.doi.org/10.1145/3308560.3317085Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019b. Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication. In Proceedings of the 2019 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2019), Vol. 3. Austin, TX, 1--23. DOI: http://dx.doi.org/10.1145/3359178Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mike Schaekermann, Graeme Beaton, Elaheh Sanoubari, Andrew Lim, Kate Larson, and Edith Law. 2020. Ambiguity-aware AI Assistants for Medical Data Analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems - CHI '20. ACM Press, Honolulu, HI, USA. DOI: http://dx.doi.org/10.1145/3313831.3376506Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2018), Vol. 2. New York City, NY, 1--19. DOI: http://dx.doi.org/10.1145/3274423Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mike Schaekermann, Naama Hammel, Michael Terry, Tayyeba K. Ali, Yun Liu, Brian Basham, Bilson Campana, William Chen, Xiang Ji, Jonathan Krause, Greg S. Corrado, Lily Peng, Dale R. Webster, Edith Law, and Rory Sayres. 2019. Remote Tool-Based Adjudication for Grading Diabetic Retinopathy. Translational Vision Science & Technology 8, 6 (dec 2019), 40. DOI: http://dx.doi.org/10.1167/tvst.8.6.40Google ScholarGoogle ScholarCross RefCross Ref
  39. Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.Google ScholarGoogle Scholar
  40. Lili Shi, Huiqun Wu, Jiancheng Dong, Kui Jiang, Xiting Lu, and Jian Shi. 2015. Telemedicine for detecting diabetic retinopathy: a systematic review and meta-analysis. British Journal of Ophthalmology 99, 6 (jun 2015), 823--831. DOI: http://dx.doi.org/10.1136/bjophthalmol-2014--305631Google ScholarGoogle ScholarCross RefCross Ref
  41. Malathi Srinivasan, Michael Wilkes, Frazier Stevenson, Thuan Nguyen, and Stuart Slavin. 2007. Comparing Problem-Based Learning with Case-Based Learning: Effects of a Major Curricular Shift at Two Institutions. Academic Medicine 82, 1 (jan 2007), 74--82. DOI: http://dx.doi.org/10.1097/01.ACM.0000249963.93776.aaGoogle ScholarGoogle ScholarCross RefCross Ref
  42. Jens B. Stephansen, Alexander N. Olesen, Mads Olsen, Aditya Ambati, Eileen B. Leary, Hyatt E. Moore, Oscar Carrillo, Ling Lin, Fang Han, Han Yan, Yun L. Sun, Yves Dauvilliers, Sabine Scholz, Lucie Barateau, Birgit Hogl, Ambra Stefani, Seung Chul Hong, Tae Won Kim, Fabio Pizza, Giuseppe Plazzi, Stefano Vandi, Elena Antelmi, Dimitri Perrin, Samuel T. Kuna, Paula K. Schweitzer, Clete Kushida, Paul E. Peppard, Helge B. D. Sorensen, Poul Jennum, and Emmanuel Mignot. 2018. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nature Communications 9, 1 (dec 2018), 5229. DOI: http://dx.doi.org/10.1038/s41467-018-07229--3Google ScholarGoogle ScholarCross RefCross Ref
  43. Daniel Shu Wei Ting, Carol Yim-Lui Cheung, Gilbert Lim, Gavin Siew Wei Tan, Nguyen D. Quang, Alfred Gan, Haslina Hamzah, Renata Garcia-Franco, Ian Yew San Yeo, Shu Yen Lee, Edmund Yick Mun Wong, Charumathi Sabanayagam, Mani Baskaran, Farah Ibrahim, Ngiap Chuan Tan, Eric A. Finkelstein, Ecosse L. Lamoureux, Ian Y. Wong, Neil M. Bressler, Sobha Sivaprasad, Rohit Varma, Jost B. Jonas, Ming Guang He, Ching-Yu Cheng, Gemmy Chui Ming Cheung, Tin Aung, Wynne Hsu, Mong Li Lee, and Tien Yin Wong. 2017. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA 318, 22 (dec 2017), 2211. DOI: http://dx.doi.org/10.1001/jama.2017.18152Google ScholarGoogle ScholarCross RefCross Ref
  44. Daniel Shu Wei Ting, Gemmy Chui Ming Cheung, and Tien Yin Wong. 2016. Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review. Clinical & experimental ophthalmology 44, 4 (may 2016), 260--77. DOI: http://dx.doi.org/10.1111/ceo.12696Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems
      April 2020
      10688 pages
      ISBN:9781450367080
      DOI:10.1145/3313831

      Copyright © 2020 Owner/Author

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 April 2020

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate6,199of26,314submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format