top of page

Operationalizing FUTURE-AI Guidelines to Evaluate AI in Healthcare: Where Can Monitoring and Evaluation (M&E) Experts Contribute?

Writer's picture: Roxana Salehi, PhDRoxana Salehi, PhD

Updated: 6 hours ago


A child smiling sitting in a hospital talking to her doctor with her caregiver present. At the bottom it says AI tools in health care should be FUTURE (Fair, Universal, Traceable, Usable, Robust, Explainable).

If we believe that AI can revolutionize healthcare for patients, the question then becomes: why does clinical adoption of emerging AI solutions remain challenging despite major advances in medical artificial intelligence (AI) research? Many factors contribute to the lack of adoption, ranging from macroeconomic context to technological, organizational, regulatory, and health provider readiness. Among them, Trustworthiness of AI is a crucial issues we need to tackle, and in order to tackle it, you guessed it, we have to measure it. And to measure it, we have to define it and evaluate it.

In this article, I will analyze the FUTURE-AI framework through a program evaluation lens and answer six key questions about it:


  1. What is FUTURE-AI?

  2. Were there any program evaluation experts engaged in this consortium?

  3. What do they mean when they use the word “evaluation”?

  4. What's the value add of this framework for evaluating AI in healthcare?

  5. At which stages and how can program evaluation experts enhance and contribute to these guidelines?

  6. What is the call to action?


The first author, Dr. Karim Lekadir presented this framework during Four Years From Now (4YFN) conference during Mobile World Congress in Barcelona on March 4th, the world's largest event for startups, investors, and innovators shaping the future of connectivity- with a large emphasis on health and health tech. Seeing the guidelines presented live provided more context and practical examples on how to operationalize the guidelines.



1. What is FUTURE-AI?


The FUTURE-AI framework, published in January 2025, is an international, consensus-based framework that provides guidance for the development and deployment of trustworthy AI tools in healthcare. Established by a group of 117 interdisciplinary experts from 50 countries, the framework is built on six key principles: Fairness, Universality, Traceability, Usability, Robustness, and Explainability. To operationalize these principles, the group has outlined 30 best practices that cover the lifecycle of healthcare AI, from design, development, and validation to deployment and monitoring.


2. Were There Any Program Evaluation Experts Engaged in This Consortium?


The list of consortium members is publicly available and demonstrates broad expertise across data science, medical research, computer engineering, ethics, and related fields. However, a search for terms like “evaluation” or “evaluator” did not yield any explicit representation of program evaluation professionals.


This does not necessarily mean program evaluation expertise was absent, just that such expertise was not prominently labeled or more likely, bundled with 'Research' ( If you are still wondering whether research and evaluation are different or the same, do yourself a favour and read Wanzer, D. L. (2021). What Is Evaluation?: Perspectives of How Evaluation Differs (or Not) From Research. American Journal of Evaluation, 42(1), 28-46. Or, Google, Research and Evaluation Hour Glass by Hallie Preskill, FSG-Social Impact Advisors.)


3. What Do They Mean When they Use the Word  “Evaluation”? 


In the context of AI and Evaluation, the word "evaluation" means different things depending on whom you ask. This may be evident to some, but I choose to mention it because I've often seen misunderstandings and people talking at cross purposes:


  • Computer engineers/data scientists / developers could be talking about model evaluation- we are talking accuracy, sensitivity, and specificity of the model, we are talking F1score, Dice, etc.


  • Clinicians maybe talking about different types of evaluation including feasibility, safety, health outcomes, patient experience, or efficiency. We could be talking about operations research, or process optimization, or quality control.


  • Anyone could be talking about clinical evaluation- a very specific term if AI is considered a Medical Device in your jurisdiction-- which is the assessment and analysis of clinical data pertaining to a medical device in order to verify the clinical safety and performance of the device.


  • And program evaluators? They are talking about a range of issues, the impact of AI on patients, clinicians, workflow, cost, efficiency, organizational change, population health outcomes, and environmental sustainability. We are talking Theories of Change, RE-AIM Framework, Consolidated Framework for Implementation Research, and participatory evaluation involving patients, etc.


In FUTURE-AI guidelines the term “evaluation” has been used to refer to both model validation as well as evaluation of the impact of the AI model on users and clinicians. I was pleased to see #5 in Dr. Lekadir's "Take-Away Messages" slide.


Dr Karim Lekadir's Presentation at 4YFN, 2025
Dr Karim Lekadir's Presentation at 4YFN, 2025

4. What's the Value Add of This Framework for Evaluating AI in Healthcare?


Evaluating AI in healthcare is not straightforward. Currently, there is no universal or standard framework for evaluating AI in healthcare. Existing theories, models, and frameworks about AI evaluation each serve different purposes:

The FUTURE-AI guidelines is:

  • Specific to healthcare

  • Focuses specifically on AI rather than general digital technology

  • Offers a global perspective rather than being confined to a particular jurisdiction

  • Covers the full lifecycle of AI tools, from design and development to validation and deployment

  • Has considered practical implications of implementing AI interventions in low resource health systems.


5. At Which Stages and How Can Program Evaluation Experts Enhance and Contribute to these Guidelines?


There are several logical places within this framework where monitoring and evaluation (M&E) professionals can make significant contributions. Below highlights key points, but is not an exhaustive list.


A. Design Phase of AI Development


G7: Investigating Social and Environmental Issues: Currently, this recommendation is broad. In addition to the factors outlined in the framework, an iterative and flexible M&E plan could be established at the onset to examine long-term population health, environmental, and societal impacts of “trustworthy AI tools”.  Depending on the scope of AI intervention (micro, miso, macro), not every AI solution is going to have direct implications for population health, but given population health it is one of the quintuple aims of health systems it deserves an explicit mention in the guidelines.



B. Validation Phase

Program evaluation experts are well-positioned to assess usability for patients and clinicians, cost, efficiency, safety, and training for end users. These align particularly well with FUTURE-AI’s Usability (4,5) and Explainability (2) principles.



We need more than surveys and focus groups at the usability testing phase; we need meaningful patient engagement which could begin at the design phase. Patients could co-design AI solutions, using methods like participatory evaluation and patient research advisory groups (see Banerjee, Alsop, Jones & Cardinal, 2022 for a discussion).

C. Deployment Phase in Healthcare Settings

In this phase, key areas for program evaluators include, integration with clinical workflows (Universality 4) and implementing creative, patient-centric methods for obtaining user feedback (Traceability 3).


Moreover, embedding program evaluation expertise into the multidisciplinary oversight team is essential to ensure that evaluation supports both accountability and learning, two foundational pillars of the discipline.



D. Evaluating Unintended and Long-Term Consequences of AI Post-Deployment


This is a tricky one and part of a larger debate. Post-deployment evaluations are vital to our understanding of what works, where, for whom and under what circumstances.  Health systems are meant to be learning systems, but in this context, post-release modifications may be complicated or slow.  The authors of FUTURE-AI guidelines point out that at the moment current regulations prevent post release modifications because they would formally invalidate the manufacturer’s initial validation.


It's a false dichotomy to view regulations as either 'the problem hindering innovation and learning' or 'the thing that is going to protect us'. What we know is that regulation is happening, it's necessary, it' slow, and it's complicated.


4YFN 2025- Matching the logo was not planned.
4YFN 2025- Matching the logo was not planned.
One of my takeaways from 4YFN was that asking whether we should regulate or not regulate AI is not the right question to ask. A better question? How to craft regulations that can evolve more quickly?




Venues such as Global Agency for Responsible AI Community of Practice are a good place to engage in these discussions and collaborate to find solutions (as an example, in our last meeting we talked about regulatory sandboxes).


6. Call to Action?


The FUTURE-AI guidelines represent a significant global effort. I was pleased to see that the guidelines are intended to be a dynamic document open to feedback. Looking at it through an M&E lens I offer the following suggestions:


For FUTURE-AI Network

  • Strengthen the role of patients within the guidelines as potential co-designers of AI solutions.

  • Explicitly mention population health outcomes (one of the quintuple aims of health systems) within the framework.

  • Engage (or continue engaging) with M&E experts for the future iterations of these valuable guidelines. M&E professionals can also support building use cases for implementing these guidelines in various contexts.

     

For Program/Project Evaluators:

  • Read the paper and explore the consortium website here. They have also started a webinar series, which I presume is going to dive deeper into each concept.


For AI Developers and Deployers

  • Engage program evaluation experts whose expertise can complement data science, computer engineering, and clinical expertise. Engage them early at the design phase. They can support interdisciplinary teams navigate AI’s complexities, ensuring evaluation efforts are robust, practical, context-specific, culturally appropriate, and relevant.



For Everyone



  • Learn about Global Agency for Responsible AI and apply to join our Community of Practice to stay engaged in relevant discussions related to trustworthy AI and regulations.


  • Another space to watch is the European project AHEAD (AI for Health: Evaluation of Applications & Datasets), coordinated by The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS). I will share what I learn about that project in a future post.


  • Join this conversation on LinkedIn

  • Email me at Roxana@vitus.ca if you would like to collaborate to advance these ideas or if you have a good resource to share.


Resources 


Banerjee, S., Alsop, P., Jones, L., & Cardinal, R. N. (2022). Patient and public involvement to build trust in artificial intelligence: A framework, tools, and case studies. Patterns, 3(6). https://doi.org/10.1016/j.patter.2022.100550

 

Brual, J., Rouleau, G., Fleury, C., Strom, M., Koshy, M., Rios, P., Bhattacharyya, O., Abejirinde, I. O. (2022). The Pan-Canadian Digital Health Evaluation Framework and Toolkit: Final Report (Version 1.0). Canadian Network for Digital Health Evaluation.


Collins, G. S., Moons, K. G. M., & Riley, R. D. (2024). TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ, 385, e078378. https://doi.org/10.1136/bmj-2023-078378

 

Damschroder, L. J., Reardon, C. M., Opra Widerquist, M. A., & Lowery, J. (2022). The updated Consolidated Framework for Implementation Research based on user feedback. Implementation Science, 17, Article 75. https://doi.org/10.1186/s13012-022-01245-0


FUTURE-AI Consortium. (2025, January). FUTURE-AI framework: Guidelines for trustworthy AI in healthcare.

 

Glasgow, R. E., Harden, S. M., Gaglio, B., Rabin, B. A., Smith, M. L., Porter, G. C., Ory, M. G., & Estabrooks, P. A. (2019). RE-AIM Planning and Evaluation Framework: Adapting to New Science and Practice With a 20-Year Review. Frontiers in Public Health, 7, Article 64. https://doi.org/10.3389/fpubh.2019.00064


Health Canada. (2025, January 15). Pan-Canadian AI for Health (AI4H) Guiding Principles. Government of Canada. https://www.canada.ca/en/health-canada/corporate/transparency/health-agreements/pan-canadian-ai-guiding-principles.html

 

HealthAI. (2024, December 10). HealthAI Unveils Community of Practice to Advance Responsible AI in Health. https://www.healthai.agency/news/healthai-launches-community-of-practice-dec2024

 

Lekadir K, Frangi A F, Porras A R, Glocker B, Cintas C, Langlotz C P et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare BMJ 2025; 388 :e081554 doi:10.1136/bmj-2024-081554

 

Nundy S, Cooper LA, Mate KS. The quintuple aim for health care improvement: A new imperative to advance health equity. JAMA. 2022;327(6):521-522.


Roppelt, J. S., Kanbach, D. K., & Kraus, S. (2024). Artificial intelligence in healthcare institutions: A systematic literature review on influencing factors. Technology in Society, 76, 102443.

 

Vasey, B., Nagendran, M., Campbell, B., Clifton, D. A., Collins, G. S., Denaxas, S., Denniston, A. K., Faes, L., Geerts, B., Ibrahim, M., Liu, X., Mateen, B. A., Mathur, P., McCradden, M. D., Morgan, L., Ordish, J., Rogers, C., Saria, S., Ting, D. S. W., Watkinson, P., Weber, W., Wheatstone, P., & McCulloch, P. (2022). Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nature Medicine, 28, 924–933. https://doi.org/10.1038/s41591-022-01772-9

 

World Health Organization. (2024, January 18). WHO releases AI ethics and governance guidance for large multi-modal models. https://www.who.int/news/item/18-01-2024-who-releases-ai-ethics-and-governance-guidance-for-large-multi-modal-models

 

 
 
 

Comments


Barcelona, Spain and Toronto, Canada

Vitus Research & Evaluation Consulting Inc.

(+34) 74 77 40 920

©2021, Vitus Research & Evaluation Consulting, Inc.

bottom of page