ABSTRACT
Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten generalist eye care professionals with either individual performance alone, or individual performance and expert discussions from specialists. Compared to performance feedback alone, seeing expert discussions significantly improved generalists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis.
Supplemental Material
Available for Download
The auxiliary material for this publication consists of three data files using the comma-separated values (CSV) format. The CSV files can be opened using off-the-shelf spreadsheet software (e.g., LibreOffice Calc, Microsoft Excel, Google Sheets). Please find a short description of each file below.
- Alaa Al Ali, Stephen Hallingham, and Yvonne M. Buys. 2015. Workforce supply of eye care providers in Canada: optometrists, ophthalmologists, and subspecialty ophthalmologists. Canadian Journal of Ophthalmology 50, 6 (dec 2015), 422--428. DOI: http://dx.doi.org/10.1016/j.jcjo.2015.09.001Google ScholarCross Ref
- Elham Bagheri, Justin Dauwels, Brian C. Dean, Chad G. Waters, M. Brandon Westover, and Jonathan J. Halford. 2017. Interictal epileptiform discharge characteristics underlying expert interrater agreement. Clinical Neurophysiology 128, 10 (oct 2017), 1994--2005. DOI: http://dx.doi.org/10.1016/j.clinph.2017.06.252Google ScholarCross Ref
- Michael L. Barnett, Dhruv Boddupalli, Shantanu Nundy, and David W. Bates. 2019. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Network Open 2, 3 (mar 2019), e190096. DOI: http://dx.doi.org/10.1001/jamanetworkopen.2019.0096Google ScholarCross Ref
- Andrew Bastawrous and Benjamin D Hennig. 2012. The global inverse care law: a distorted map of blindness. British Journal of Ophthalmology 96, 10 (oct 2012), 1357.2--1358. DOI: http://dx.doi.org/10.1136/bjophthalmol-2012--302088Google ScholarCross Ref
- Anthony A. Cavallerano and Paul R. Conlin. 2008. Teleretinal Imaging to Screen for Diabetic Retinopathy in the Veterans Health Administration. Journal of Diabetes Science and Technology 2, 1 (jan 2008), 33--39. DOI: http://dx.doi.org/10.1177/193229680800200106Google ScholarCross Ref
- Po-Hsuan Cameron Chen, Yun Liu, and Lily Peng. 2019b. How to develop machine learning models for healthcare. Nature Materials 18, 5 (may 2019), 410--414. DOI: http://dx.doi.org/10.1038/s41563-019-0345-0Google ScholarCross Ref
- Quanze Chen, Jonathan Bragg, Lydia B. Chilton, and Daniel S. Weld. 2019a. Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, New York, New York, USA, 1--14. DOI: http://dx.doi.org/10.1145/3290605.3300761Google ScholarDigital Library
- Robin Cohen, Mike Schaekermann, Sihao Liu, and Michael Cormier. 2019. Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS '19). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1644--1648. http://dl.acm.org/citation.cfm?id=3306127.3331890Google ScholarDigital Library
- Jorge Cuadros and George Bresnick. 2009. EyePACS: An Adaptable Telemedicine System for Diabetic Retinopathy Screening. Journal of Diabetes Science and Technology 3, 3 (may 2009), 509--516. DOI: http://dx.doi.org/10.1177/193229680900300315Google ScholarCross Ref
- Jasmin Diwan, Chinmay Shah, Saurin Sanghavi, and Amit Shah. 2017. Comparison of case-based learning and traditional lectures in physiology among first year undergraduate medical students. National Journal of Physiology, Pharmacy and Pharmacology (2017), 1. DOI: http://dx.doi.org/10.5455/njppp.2017.7.0204220032017Google ScholarCross Ref
- Tim Dornan, Albert Scherpbier, Nigel King, and Henny Boshuizen. 2005. Clinical teachers and problem-based learning: a phenomenological study. Medical Education 39, 2 (feb 2005), 163--170. DOI: http://dx.doi.org/10.1111/j.1365--2929.2004.01914.xGoogle ScholarCross Ref
- Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google Scholar
- Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems 8, 2 (jul 2018), 1--20. DOI: http://dx.doi.org/10.1145/3152889Google ScholarDigital Library
- David Dunning. 2011. The Dunning--Kruger Effect. 247--296. DOI: http://dx.doi.org/10.1016/B978-0--12--385522-0.00005--6Google ScholarCross Ref
- Matthew J. Gabel, Norman L. Foster, Judith L. Heidebrink, Roger Higdon, Howard J. Aizenstein, Steven E. Arnold, Nancy R. Barbas, Bradley F. Boeve, James R. Burke, Christopher M. Clark, Steven T. DeKosky, Martin R. Farlow, William J. Jagust, Claudia H. Kawas, Robert A. Koeppe, James B. Leverenz, Anne M. Lipton, Elaine R. Peskind, R. Scott Turner, Kyle B. Womack, and Edward Y. Zamrini. 2010. Validation of Consensus Panel Diagnosis in Dementia. Archives of Neurology 67, 12 (dec 2010). DOI: http://dx.doi.org/10.1001/archneurol.2010.301Google ScholarCross Ref
- Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarCross Ref
- Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. 2016. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. Jama 304, 6 (2016), 649--656. DOI: http://dx.doi.org/10.1001/jama.2016.17216Google ScholarCross Ref
- Mark Hartswood, Rob Procter, Paul Taylor, Lilian Blot, Stuart Anderson, Mark Rouncefield, and Roger Slack. 2012. Problems of data mobility and reuse in the provision of computer-based training for screening mammography. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems CHI '12. ACM Press, New York, New York, USA, 909. DOI: http://dx.doi.org/10.1145/2207676.2208533Google ScholarDigital Library
- Oana Inel, Lora Aroyo, Chris Welty, and Robert-Jan Sips. 2013. Domain-Independent Quality Measures for Crowd Truth Disagreement. In The 12th International Semantic Web Conference (ISWC2013). http://data.semanticweb.org/workshop/derive/2013/proceedings/paper-01/htmlGoogle Scholar
- Jayashree Kalpathy-Cramer, J. Peter Campbell, Deniz Erdogmus, Peng Tian, Dharanish Kedarisetti, Chace Moleta, James D. Reynolds, Kelly Hutcheson, Michael J. Shapiro, Michael X. Repka, Philip Ferrone, Kimberly Drenser, Jason Horowitz, Kemal Sonmez, Ryan Swan, Susan Ostmo, Karyn E. Jonas, R.V. Paul Chan, Michael F. Chiang, Michael F. Chiang, Susan Ostmo, Kemal Sonmez, J. Peter Campbell, R.V. Paul Chan, Karyn Jonas, Jason Horowitz, Osode Coki, Cheryl-Ann Eccles, Leora Sarna, Audina Berrocal, Catherin Negron, Kimberly Denser, Kristi Cumming, Tammy Osentoski, Tammy Check, Mary Zajechowski, Thomas Lee, Evan Kruger, Kathryn McGovern, Charles Simmons, Raghu Murthy, Sharon Galvis, Jerome Rotter, Ida Chen, Xiaohui Li, Kent Taylor, Kaye Roll, Jayashree Kalpathy-Cramer, Deniz Erdogmus, Maria Ana Martinez-Castellanos, Samantha Salinas-Longoria, Rafael Romero, Andrea Arriola, Francisco Olguin-Manriquez, Miroslava Meraz-Gutierrez, Carlos M. Dulanto-Reinoso, and Cristina Montero-Mendoza. 2016. Plus disease in retinopathy of prematurity: improving diagnosis by ranking disease severity and using quantitative image analysis. Ophthalmology 123, 11 (nov 2016), 2345--2351. DOI: http://dx.doi.org/10.1016/j.ophtha.2016.07.020Google ScholarCross Ref
- Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (mar 2018). DOI: http://dx.doi.org/10.1016/j.ophtha.2018.01.034Google ScholarCross Ref
- Joseph D Kronz, Mark A Silberman, William C Allsbrook, and Jonathan I Epstein. 2000. A web-based tutorial improves practicing pathologists' Gleason grading of images of prostate carcinoma specimens obtained by needle biopsy. Cancer 89, 8 (oct 2000), 1818--1823. DOI: http://dx.doi.org/10.1002/1097-0142(20001015)89: 8<1818::AID-CNCR23>3.0.CO;2-JGoogle ScholarCross Ref
- P R Lichter. 1976. Variability of expert observers in evaluating the optic disc. Transactions of the American Ophthalmological Society 74 (1976), 532--72. http://www.ncbi.nlm.nih.gov/pubmed/867638http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= PMC1311528Google Scholar
- Christopher H Lin, Daniel S Weld, and Others. 2014. To re (label), or not to re (label). In Second AAAI Conference on Human Computation and Crowdsourcing.Google ScholarCross Ref
- J.C Liston and B.J.G Dall. 2003. Can the NHS Breast Screening Programme Afford not to Double Read Screening Mammograms? Clinical Radiology 58, 6 (jun 2003), 474--477. DOI: http://dx.doi.org/10.1016/S0009--9260(03)00063--1Google ScholarCross Ref
- Joaquin Navajas, Tamara Niella, Gerry Garbulsky, Bahador Bahrami, and Mariano Sigman. 2018. Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour (jan 2018). DOI: http://dx.doi.org/10.1038/s41562-017-0273--4Google ScholarCross Ref
- Sonia Phene, R. Carter Dunn, Naama Hammel, Yun Liu, Jonathan Krause, Naho Kitade, Mike Schaekermann, Rory Sayres, Derek J. Wu, Ashish Bora, Christopher Semturs, Anita Misra, Abigail E. Huang, Arielle Spitze, Felipe A. Medeiros, April Y. Maa, Monica Gandhi, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology (sep 2019). DOI: http://dx.doi.org/10.1016/j.ophtha.2019.07.024Google ScholarCross Ref
- Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Robert Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. 2018. Direct Uncertainty Prediction for Medical Second Opinions. (jul 2018). http://arxiv.org/abs/1807.01771Google Scholar
- Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (jul 2017). http://arxiv.org/abs/1707.01836Google Scholar
- Paisan Raumviboonsuk, Jonathan Krause, Peranut Chotcomwongse, Rory Sayres, Rajiv Raman, Kasumi Widner, Bilson J. L. Campana, Sonia Phene, Kornwipa Hemarat, Mongkol Tadarati, Sukhum Silpa-Archa, Jirawut Limwattanayingyong, Chetan Rao, Oscar Kuruvilla, Jesse Jung, Jeffrey Tan, Surapong Orprayoon, Chawawat Kangwanwongpaisan, Ramase Sukumalpaiboon, Chainarong Luengchaichawang, Jitumporn Fuangkaew, Pipat Kongsap, Lamyong Chualinpha, Sarawuth Saree, Srirut Kawinpanitan, Korntip Mitvongsa, Siriporn Lawanasakol, Chaiyasit Thepchatri, Lalita Wongpichedchai, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. npj Digital Medicine 2, 1 (dec 2019), 25. DOI: http://dx.doi.org/10.1038/s41746-019-0099--8Google ScholarCross Ref
- Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (jan 2013). DOI: http://dx.doi.org/10.5664/jcsm.2350Google ScholarCross Ref
- Rory Sayres, Ankur Taly, Ehsan Rahimy, Katy Blumer, David Coz, Naama Hammel, Jonathan Krause, Arunachalam Narayanaswamy, Zahra Rastegar, Derek Wu, Shawn Xu, Scott Barb, Anthony Joseph, Michael Shumski, Jesse Smith, Arjun B. Sood, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2019. Using a Deep Learning Algorithm and Integrated Gradients Explanation to Assist Grading for Diabetic Retinopathy. Ophthalmology 126, 4 (apr 2019), 552--564. DOI: http://dx.doi.org/10.1016/j.ophtha.2018.11.016Google ScholarCross Ref
- Mike Schaekermann. 2016. Resolvable vs . Irresolvable Ambiguity : A New Hybrid Framework for Dealing with Uncertain Ground Truth. In ACM SIGCHI Workshop on Human-Centered Machine Learning, Marco Gillies and Rebecca Fiebrink (Eds.). ACM, San Jose. http://hcml2016.goldsmithsdigital.com/program/Google Scholar
- Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019a. Capturing Expert Arguments from Medical Adjudication Discussions in a Machine-readable Format. In Companion Proceedings of The 2019 World Wide Web Conference - WWW '19, Vol. 2. ACM Press, New York, New York, USA, 1131--1137. DOI: http://dx.doi.org/10.1145/3308560.3317085Google ScholarDigital Library
- Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. 2019b. Understanding Expert Disagreement in Medical Data Analysis through Structured Adjudication. In Proceedings of the 2019 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2019), Vol. 3. Austin, TX, 1--23. DOI: http://dx.doi.org/10.1145/3359178Google ScholarDigital Library
- Mike Schaekermann, Graeme Beaton, Elaheh Sanoubari, Andrew Lim, Kate Larson, and Edith Law. 2020. Ambiguity-aware AI Assistants for Medical Data Analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems - CHI '20. ACM Press, Honolulu, HI, USA. DOI: http://dx.doi.org/10.1145/3313831.3376506Google ScholarDigital Library
- Mike Schaekermann, Joslin Goh, Kate Larson, and Edith Law. 2018. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work. In Proceedings of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2018), Vol. 2. New York City, NY, 1--19. DOI: http://dx.doi.org/10.1145/3274423Google ScholarDigital Library
- Mike Schaekermann, Naama Hammel, Michael Terry, Tayyeba K. Ali, Yun Liu, Brian Basham, Bilson Campana, William Chen, Xiang Ji, Jonathan Krause, Greg S. Corrado, Lily Peng, Dale R. Webster, Edith Law, and Rory Sayres. 2019. Remote Tool-Based Adjudication for Grading Diabetic Retinopathy. Translational Vision Science & Technology 8, 6 (dec 2019), 40. DOI: http://dx.doi.org/10.1167/tvst.8.6.40Google ScholarCross Ref
- Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. 2018. Expert Disagreement in Sequential Labeling: A Case Study on Adjudication in Medical Time Series Analysis. In 1st Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing at HCOMP 2018. Zurich, Switzerland.Google Scholar
- Lili Shi, Huiqun Wu, Jiancheng Dong, Kui Jiang, Xiting Lu, and Jian Shi. 2015. Telemedicine for detecting diabetic retinopathy: a systematic review and meta-analysis. British Journal of Ophthalmology 99, 6 (jun 2015), 823--831. DOI: http://dx.doi.org/10.1136/bjophthalmol-2014--305631Google ScholarCross Ref
- Malathi Srinivasan, Michael Wilkes, Frazier Stevenson, Thuan Nguyen, and Stuart Slavin. 2007. Comparing Problem-Based Learning with Case-Based Learning: Effects of a Major Curricular Shift at Two Institutions. Academic Medicine 82, 1 (jan 2007), 74--82. DOI: http://dx.doi.org/10.1097/01.ACM.0000249963.93776.aaGoogle ScholarCross Ref
- Jens B. Stephansen, Alexander N. Olesen, Mads Olsen, Aditya Ambati, Eileen B. Leary, Hyatt E. Moore, Oscar Carrillo, Ling Lin, Fang Han, Han Yan, Yun L. Sun, Yves Dauvilliers, Sabine Scholz, Lucie Barateau, Birgit Hogl, Ambra Stefani, Seung Chul Hong, Tae Won Kim, Fabio Pizza, Giuseppe Plazzi, Stefano Vandi, Elena Antelmi, Dimitri Perrin, Samuel T. Kuna, Paula K. Schweitzer, Clete Kushida, Paul E. Peppard, Helge B. D. Sorensen, Poul Jennum, and Emmanuel Mignot. 2018. Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nature Communications 9, 1 (dec 2018), 5229. DOI: http://dx.doi.org/10.1038/s41467-018-07229--3Google ScholarCross Ref
- Daniel Shu Wei Ting, Carol Yim-Lui Cheung, Gilbert Lim, Gavin Siew Wei Tan, Nguyen D. Quang, Alfred Gan, Haslina Hamzah, Renata Garcia-Franco, Ian Yew San Yeo, Shu Yen Lee, Edmund Yick Mun Wong, Charumathi Sabanayagam, Mani Baskaran, Farah Ibrahim, Ngiap Chuan Tan, Eric A. Finkelstein, Ecosse L. Lamoureux, Ian Y. Wong, Neil M. Bressler, Sobha Sivaprasad, Rohit Varma, Jost B. Jonas, Ming Guang He, Ching-Yu Cheng, Gemmy Chui Ming Cheung, Tin Aung, Wynne Hsu, Mong Li Lee, and Tien Yin Wong. 2017. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA 318, 22 (dec 2017), 2211. DOI: http://dx.doi.org/10.1001/jama.2017.18152Google ScholarCross Ref
- Daniel Shu Wei Ting, Gemmy Chui Ming Cheung, and Tien Yin Wong. 2016. Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review. Clinical & experimental ophthalmology 44, 4 (may 2016), 260--77. DOI: http://dx.doi.org/10.1111/ceo.12696Google ScholarCross Ref
Index Terms
- Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
Recommendations
An international assessment of a web-based diagnostic tool in critically ill children
Improving diagnostic accuracy is essential. The extent of diagnostic uncertainty at patient admission is not well described in critically ill children. Therefore, we studied the extent that pediatric trainee diagnostic performance could be improved with ...
Medical wireless sensor diagnosis and children's respiratory tract infection nursing intervention
AbstractChildren with respiratory tract infections are among the most common reasons for parents to consult a health care professional. Most reproductive tract infections are self-limiting diseases; the virus is in time and support management. However, ...
A neural network based clinical decision-support system for efficient diagnosis and fuzzy-based prescription of gynecological diseases using homoeopathic medicinal system
As the analysis and diagnosis of gynecological diseases, especially using the homoeopathic system of medicine, gets more and more complicated, it becomes important for us to develop a decision-support system which can help a gynecologist analyze and ...
Comments