Researchers used 3 million days of Apple Watch data to train disease detection AI

December 11, 2025

0 Views 0

SaveSavedRemoved 0

Researchers used 3 million days of Apple Watch data to train disease detection AI

A new study by researchers at MIT and Empirical Health used 3 million person-days of Apple Watch data to develop a foundational model that predicts medical conditions with surprising accuracy. Here are the details:

a little background

While still Meta’s lead AI scientist, Yann LeCun proposed the Joint Embedded Prediction Architecture (JEPA). This essentially teaches the AI to infer the meaning of missing data rather than the data itself.

In other words, when dealing with gaps in data, the model learns to predict what is missing. representrather than trying to guess and reconstruct the exact value.

For example, for an image where some parts are masked and other parts visible, JEPA embeds both the visible and masked regions in a shared space (hence Joint-Embedding), forcing the model to infer the representation of the masked regions from the visible context rather than the exact context. content What was hidden.

When the company released a model called I-JEPA in 2023, Meta said:

Last year, Yann LeCun, lead AI scientist at Meta, proposed a new architecture aimed at overcoming key limitations of today’s most advanced AI systems. His vision is to create machines that can learn internal models of how the world works, allowing them to learn faster, plan how to perform complex tasks, and quickly adapt to unfamiliar situations.

Since the publication of LeCun’s original JEPA research, this architecture has become a cornerstone of the field exploring “world models” that stray from the token prediction focus of LLM and GPT-based systems.

In fact, LeCun recently left Meta to start a company focused solely on world models. This, he argues, is the true path to AGI.

So, Apple Watch has 3 million days worth of data?

Okay, let’s get back to research. A paper published a few months ago, “JETS: A Self-Supervised Collaborative Embedded Time Series Infrastructure Model for Behavioral Data in Healthcare,” was recently accepted for a NeurIPS workshop.

It applies JEPA’s joint embedding approach to irregular multivariate time series, such as long-term wearable data where heart rate, sleep, activity, and other measurements are inconsistent or have large gaps over time.

From research:

This study utilizes a longitudinal dataset consisting of wearable device data collected from a cohort of 16,522 individuals (approximately 3 million person-days total). For each individual, 63 different time series metrics were recorded at daily or lower resolution. These indicators are categorized into five physiological and behavioral domains: cardiovascular health, respiratory health, sleep, physical activity, and general statistics.

Interestingly, only 15% of participants labeled their medical history for assessment. This means that 85% of the data could not be used with traditional supervised learning approaches. Instead, JETS first learns from the complete dataset through self-supervised pre-training, after that Fine-tuned with labeled subsets.

To make the whole thing work, I created three pieces of data from observations corresponding to day, value, and metric type.

This allowed us to transform each observation into a token, which was then encoded through a masking process and fed to the predictor (to predict the embedding of the missing patch).

Once that was completed, the researchers compared JETS to other baseline models (including previous versions of JETS based on the Transformer architecture) and evaluated it using AUROC and AUPRC, two standard measures of how well an AI distinguishes between positive and negative cases.

JETS achieved an AUROC of 86.8% for hypertension, 70.5% for atrial flutter, 81% for chronic fatigue syndrome, and 86.8% for sick sinus syndrome. Of course it wasn’t everytime However, the benefits are very clear, as shown below.

It is worth emphasizing that AUROC and AUPRC are not exactly the same. accuracy index. These are measures of how well the model ranks or prioritizes likely cases, rather than how often the model makes predictions correctly.

Overall, this study presents an interesting approach to maximizing the potential for insight and lifesaving of data that may otherwise be ignored as incomplete or irregular. In some cases, health indicators were recorded only 0.4% of the time, and in other cases, health indicators appeared in 99% of daily measurements.

This study also supports the idea that there is a lot of potential for new models and training techniques to explore data that is already being collected by regular wearable devices, such as the Apple Watch, even when not being worn 100% of the time.

You can read the full study here.