EchoCare: A fully open and generalizable foundation model for ultrasound clinical applications

Abstract

The inherent safety and versatility of ultrasound imaging have made it widely accessible in modern clinical settings for disease diagnosis and health management. Artificial intelligence (AI) that can effectively learn ultrasound representations by integrating multi-source data holds significant promise for advancing clinical care. However, the scarcity of large labeled datasets in real-world clinical environments and the limited generalizability of task-specific models have hindered the development of generalizable clinical AI models for ultrasound applications. In this study, we present EchoCare, a novel ultrasound foundation model for generalist clinical use, developed via self-supervised learning on our curated, publicly available, large-scale unlabeled dataset EchoAtlas. EchoAtlas comprises 4.5 million ultrasound images, sourced from over 20 countries across 5 continents and acquired via a diverse range of distinct imaging devices, thus encompassing global cohorts that are multi-center, multi-device, and multi-ethnic. Unlike prior studies that adopt off-the-shelf vision foundation model architectures, we introduce a hierarchical classifier into EchoCare to enable joint learning of pixel-level and representation-level features, capturing both global anatomical contexts and local ultrasound characteristics. With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative downstream ultrasound benchmarks of varying diagnostic difficulties, spanning disease diagnosis, lesion segmentation, organ detection, landmark prediction, quantitative regression, imaging enhancement and report generation. The code and pretrained model are publicly released, rendering EchoCare accessible for fine-tuning and local adaptation, supporting extensibility to additional applications. EchoCare provides a fully open and generalizable foundation model to boost the development of AI technologies for diverse clinical ultrasound applications.

EchoAtlas

EchoAtlas integrates multi-center, multi-region, and multi-device sources, covering 23 hospitals across 5 continents and 20 countries, ensuring diversity in clinical practices, patient demographics, and imaging equipment.

0 +

Hospitals Worldwide

0 +

Ultrasound Devices

0 +

Continents

0 +

Countries/Regions

Results

SEGMENTATION

We evaluated different foundation models on three representative ultrasound clinical benchmarks for anatomical segmentation: the DDTI dataset for thyroid node segmentation, the Mus-V dataset for arterial-venous vessel segmentation, and the abdomen multi-organ segmentation.

Thyroid Node

Arterial-venous Vessel

Abdomen

Thyroid

Kidney

Liver

Breast

Carotid Artery

ENHANCEMENT

We evaluated EchoCare on the low-quality ultrasound image enhancement task using the USenhance benchmark dataset, which encompasses real-world clinical scans from 109 patients across five anatomical regions: thyroid, kidney, liver, breast, and carotid artery.

REPORT GENERATION

To evaluate the effectiveness of our developed foundation model in ultrasound report generation, we integrate EchoCare into an existing Transformer-based encoder–decoder report generator, where the input is the global visual features extracted from ultrasound images. The integrated model is then fine-tuned on the USData Liver dataset, which contains paired ultrasound images and corresponding expert-written reports.

Related Work

EnlightenGAN: Deep Light Enhancement without Paired Supervision by Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, Zhangyang Wang.
RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning by Xueyan Mei, Zelong Liu, Philip M. Robson, Brett Marinelli, Mingqian Huang, Amish Doshi, Adam Jacobi, Chendi Cao, Katherine E. Link, Thomas Yang, Ying Wang, Hayit Greenspan, Timothy Deyer, Zahi A. Fayad, Yang Yang.
UltraSam: A Foundation Model for Ultrasound using Large Open-Access Segmentation Datasets by Adrien Meyer, Aditya Murali, Didier Mutter, Nicolas Padoy.
Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs by Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon.
Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin.
SimMIM: A Simple Framework for Masked Image Modeling by Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu.
USFM: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis by Jing Jiao, Jin Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang f, Yi Guo.