skip to main content
10.1145/3287560.3287596acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

Model Cards for Model Reporting

Published:29 January 2019Publication History

ABSTRACT

Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.

References

  1. Avrio AI. 2018. Avrio AI: AI Talent Platform. (2018). https:/www.goavrio.com/Google ScholarGoogle Scholar
  2. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. (2016). https:/www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencingGoogle ScholarGoogle Scholar
  3. Emily M. Bender and Batya Friedman. 2018. "Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science". Transactions of the ACL (TACL) (2018).Google ScholarGoogle Scholar
  4. Joy Buolamwini. 2016. How I'm fighting Bias in Algorithms. (2016). https:/www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms#t-63664Google ScholarGoogle Scholar
  5. Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, New York, NY, USA, 77--91. http://proceedings.mlr.press/v81/buolamwini18a.htmlGoogle ScholarGoogle Scholar
  6. Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153--163.Google ScholarGoogle Scholar
  7. Federal Trade Commission. 2016. Big Data: A Tool for Inclusion or Exclusion? Understanding the Issues. (2016). https:/www.ftc.gov/reports/big-data-tool-inclusion-or-exclusion-understanding-issues-ftc-reportGoogle ScholarGoogle Scholar
  8. Kimberle Crenshaw. 1989. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. U. Chi. Legal F. (1989), 139.Google ScholarGoogle Scholar
  9. Black Desi. 2009. HP computers are racist. (2009). https:/www.youtube.com/watch?v=t4DT3tQqgRMGoogle ScholarGoogle Scholar
  10. William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. (2016). https:/www.documentcloud.org/documents/2998391-ProPublica-Commentary-Final-070616.htmlGoogle ScholarGoogle Scholar
  11. Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and Mitigating Unintended Bias in Text Classification. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, Manindra Agrawal, Dingzhu Du, Zhenhua Duan, and Angsheng Li (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Entelo. 2018. Recruitment Software | Entelo. (2018). https:/www.entelo.com/Google ScholarGoogle Scholar
  14. Daniel Faggella. 2018. Follow the Data: Deep Learning Leads the Transformation of Enterprise - A Conversation with Naveen Rao. (2018).Google ScholarGoogle Scholar
  15. Thomas B Fitzpatrick. 1988. The validity and practicality of sun-reactive skin types I through VI. Archives of dermatology 124, 6 (1988), 869--871.Google ScholarGoogle Scholar
  16. Food and Drug Administration. 1989. Guidance for the Study of Drugs Likely to Be Used in the Elderly. (1989).Google ScholarGoogle Scholar
  17. U.S. Food and Drug Administration. 2013. FDA Drug Safety Communication: Risk of next-morning impairment after use of insomnia drugs; FDA requires lower recommended doses for certain drugs containing Zolpidem (Ambien, Ambien CR, Edluar, and Zolpimist). (2013). https://web.archive.org/web/20170428150213/ https:/www.fda.gov/drugs/drugsafety/ucm352085.htmGoogle ScholarGoogle Scholar
  18. IIHS (Insurance Institute for Highway Safety: Highway Loss Data Institute). 2003. Special Issue: Side Impact Crashworthiness. Status Report 38, 7 (2003).Google ScholarGoogle Scholar
  19. Institute for the Future, Omidyar Network's Tech, and Society Solutions Lab. 2018. Ethical OS. (2018). https://ethicalos.org/Google ScholarGoogle Scholar
  20. Clare Garvie, Alvaro Bedoya, and Jonathan Frankle. 2016. The Perpetual Line-Up. (2016). https:/www.perpetuallineup.org/Google ScholarGoogle Scholar
  21. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. CoRR abs/1803.09010 (2018). http://arxiv.org/abs/1803.09010Google ScholarGoogle Scholar
  22. Google. 2018. Responsible AI Practices. (2018). https://ai.google/education/responsible-ai-practicesGoogle ScholarGoogle Scholar
  23. Gooru. 2018. Navigator for Teachers. (2018). http://gooru.org/about/teachersGoogle ScholarGoogle Scholar
  24. Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval. Springer, 345--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Collins GS, Reitsma JB, Altman DG, and Moons KM. 2015. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement. Annals of Internal Medicine 162, 1 (2015), 55--63.Google ScholarGoogle ScholarCross RefCross Ref
  26. Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3315--3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Hind, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Alexandra Olteanu, and Kush R. Varshney. 2018. Increasing Trust in AI Services through Supplier's Declarations of Conformity. CoRR abs/1808.07261 (2018).Google ScholarGoogle Scholar
  28. Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. CoRR abs/1805.03677 (2018). http://arxiv.org/abs/1805.03677Google ScholarGoogle Scholar
  29. Ideal. 2018. AI For Recruiting Software | Talent Intelligence for High-Volume Hiring. (2018). https://ideal.com/Google ScholarGoogle Scholar
  30. DrivenData Inc. 2018. An Ethics Checklist for Data Scientists. (2018). http://deon.drivendata.org/Google ScholarGoogle Scholar
  31. Jigsaw. 2017. Conversation AI Research. (2017). https://conversationai.github.io/Google ScholarGoogle Scholar
  32. Jigsaw. 2017. Perspective API. (2017). https:/www.perspectiveapi.com/Google ScholarGoogle Scholar
  33. B. Kim, Wattenberg M., J. Gilmer, Cai C., Wexler J., F. Viegas, and R. Sayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). ICML (2018).Google ScholarGoogle Scholar
  34. Brendan F. Klare, Mark J. Burge, Joshua C. Klontz, Richard W. Vorder Bruegge, and Anil K. Jain. 2012. Face recognition performance: Role of demographic information. IEEE Transactions on Information Forensics and Security 7, 6 (2012), 1789--1801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Der-Chiang Li, Susan C Hu, Liang-Sian Lin, and Chun-Wu Yeh. 2017. Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. PloS one 12, 8 (2017), e0181853. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, FAT* '19, January 29-31, 2019, Atlanta, CA, USA Timnit GebruGoogle ScholarGoogle ScholarCross RefCross Ref
  36. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shira Mitchell, Eric Potash, and Solon Barocas. 2018. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 (2018).Google ScholarGoogle Scholar
  38. Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the Model Understand the Question? Proceedings of the Association for Computational Linguistics (2018).Google ScholarGoogle ScholarCross RefCross Ref
  39. AI Now. 2018. Litigating Algorithms: Challenging Government Use Of Algorithmic Decision Systems. AI Now Institute.Google ScholarGoogle Scholar
  40. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Inioluwa Raji. 2018. Black Panther Face Scorecard: Wakandans Under the Coded Gaze of AI. (2018).Google ScholarGoogle Scholar
  42. Microsoft Research. 2018. Project InnerEye - Medical Imaging AI to Empower Clinicians. (2018). https:/www.microsoft.com/en-us/research/project/medical-image-analysis/Google ScholarGoogle Scholar
  43. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. PMLR, Sydney, Australia.Google ScholarGoogle Scholar
  44. Digital Reasoning Systems. 2018. AI-Enabled Cancer Software | Healthcare AI: Digital Reasoning. (2018). https://digitalreasoning.com/solutions/healthcare/Google ScholarGoogle Scholar
  45. Turnitin. 2018. Revision Assistant. (2018). http://turnitin.com/en_us/what-we-offer/revision-assistantGoogle ScholarGoogle Scholar
  46. Shannon Vallor, Brian Green, and Irina Raicu. 2018. Ethics in Technology Practice: An Overview. (22 6 2018). https:/www.scu.edu/ethics-in-technology-practice/overview-of-ethics-in-tech-practice/Google ScholarGoogle Scholar
  47. Lucy Vasserman, John Li, CJ Adams, and Lucas Dixon. 2018. Unintended bias and names of frequently targeted groups. Medium (2018). https://medium.com/the-false-positive/unintended-bias-and-names-of-frequently-targeted-groups-8e0b81f80a23Google ScholarGoogle Scholar
  48. Sahil Verma and Julia Rubin. 2018. Fairness Definitions Explained. (2018).Google ScholarGoogle Scholar
  49. Joz Wang. 2010. Flickr Image. (2010). https:/www.flickr.com/photos/jozjozjoz/3529106844Google ScholarGoogle Scholar
  50. Amy Westervelt. 2018. The medical research gender gap: how excluding women from clinical trials is hurting our health. (2018).Google ScholarGoogle Scholar
  51. Mingyuan Zhou, Haiting Lin, S Susan Young, and Jingyi Yu. 2018. Hybrid sensing face detection and registration for low-light and unconstrained conditions. Applied optics 57, 1 (2018), 69--78.Google ScholarGoogle Scholar

Index Terms

  1. Model Cards for Model Reporting

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                FAT* '19: Proceedings of the Conference on Fairness, Accountability, and Transparency
                January 2019
                388 pages
                ISBN:9781450361255
                DOI:10.1145/3287560

                Copyright © 2019 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 29 January 2019

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed limited

                Upcoming Conference

                FAccT '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader