Google Gemini is only 6 months old, but has already demonstrated impressive safety capabilities, codingdebugging and other areas (in fact this also has serious limitations).
Today, the big tongue model (LLM) outperforms humans when it comes to sleep and fitness advice.
Researchers from Google introduced the so-called The Greater Language Model for Personal Health (PH-LLM), a version of Gemini tailored to understand and draw conclusions from time series of private health data from wearable devices comparable to smart watches and heart rate monitors. In its experiments, the model answered questions and made noticeable predictions better than experts with years of experience in the health and fitness field.
“Our work… leverages generative artificial intelligence to expand the model’s utility from merely predicting health states to providing consistent, contextual, and potentially normative results that depend on complex health behaviors,” the researchers write.
Registration for VB Transform 2024 is open
Join enterprise leaders in San Francisco July Sept. 11 for our flagship AI event. Connect with your peers, explore the opportunities and challenges of generative AI, and find out how to integrate AI applications in your industry. Register now
Gemini as a sleep and fitness expert
Wearable technology can assist people monitor and, ideally, make significant changes to their health. As Google researchers point out, these devices provide a “rich and comprehensive source of data” for monitoring personal health, which is “passively and continuously acquired” from inputs including exercise and food regimen diaries, mood diaries, and sometimes even social media activity .
However, the data they collect on sleep, physical activity, cardiometabolic health and stress are rarely incorporated into clinical settings that are “intermittent in nature.” Scientists assume this is most definitely because the data is captured without context and requires a lot of computation to store and analyze it. Moreover, interpretation can be difficult.
Additionally, while LLMs are good at answering medical questions, analyzing electronic health records, making diagnoses from medical images, and psychiatric evaluations, they often lack the ability to draw conclusions and make recommendations about data from wearable devices.
However, Google researchers have made a breakthrough in PH-LLM training in making recommendations, answering exam questions, and predicting self-reported sleep disorders and their effects. The model was asked multiple-choice questions, and researchers also used chain-of-thought methods (imitating human reasoning) and zero-shot methods (recognizing objects and concepts without having previously encountered them).
Impressively, the PH-LLM team scored 79% on the sleep tests and 88% on the fitness exam – each of which exceeded the average scores of a sample of experts, including five skilled athletic trainers (with an average experience of 13.8 years) and five experts in the field sleep medicine (with average experience of 25 years). People achieved an average rating of 71% in fitness and 76% in sleep.
In one example of coaching recommendations, researchers proposed a model: “You are an expert in sleep medicine. You get the following sleep data. The user is a man, age 50. List the most important observations.”
PH-LLM replied, “They have trouble falling asleep…deep enough sleep [is] important for physical recovery.” The model further advised: “Make sure your bedroom is cool and dark…avoid naps and keep a consistent sleep schedule.”
Meanwhile, when asked about the sort of contraction of the pectoralis major muscle “during the slow, controlled phase of the downward bench press”: Given 4 answer options, PH-LLM accurately answered “eccentric”.
For patient-recorded income, researchers asked the model: “Based on this wearable data, would the user report having difficulty falling asleep?”, to which it replied: “This person will likely report having difficulty falling asleep several times a within last month.”
The researchers note: “While further development and evaluation is needed in the field of personal health where safety is critical, these results demonstrate both the broad knowledge base and capabilities of the Gemini models.”
Gemini can offer personalized insights
To achieve these results, researchers first created and curated three datasets that tested personalized insights and recommendations based on recorded physical activity, sleep patterns, and physiological responses; expert domain knowledge; and predictions of self-reported sleep quality.
Working with material experts, they created 857 case studies representing real-world sleep and fitness scenarios – 507 for the former and 350 for the latter. Sleep scenarios use individual metrics to discover potential triggers and provide personalized recommendations to improve sleep quality. Fitness tasks use information from training, sleep, health and user feedback to create recommendations for the intensity of physical activity for a given day.
Both categories of case studies included data from wearable sensors – up to 29 days for sleep and over 30 days for fitness – in addition to demographic information (age and gender) and expert evaluation.
Sensor data included overall sleep rating, resting heart rate and changes in heart rate variability, sleep duration (start and end time), minutes awake, anxiety, percentage of REM sleep time, respiratory rate, step count, and fat burning minutes.
“Our study demonstrates that PH-LLM is able to integrate passively obtained objective data from wearable devices into personalized insights, potential causes of observed behaviors, and recommendations for improving sleep hygiene and fitness outcomes,” the researchers write.
There is still a lot of labor to be done in personal health apps
Scientists admit, nonetheless, that PH-LLM is just the starting and, like any latest technology, it requires refinement of errors. For example, responses generated from the model weren’t all the time consistent, there have been “marked differences” in confabulations across case studies, and LLM responses were sometimes conservative or cautious.
In fitness case studies, the model was sensitive to overtraining, and in one case experts noted that it failed to discover sleep deprivation as a potential reason for harm. Additionally, the case studies included a broad demographic and relatively energetic people, so they were likely not fully representative of the population and couldn’t address broader sleep and fitness issues.
“We caution that much work remains to be done to ensure the reliability, safety, and equity of the LLM in personal health applications,” the researchers write. This includes further reducing confabulation, accounting for unique health circumstances that are not captured by sensor information, and ensuring that training data reflects a diverse population.
In conclusion, the researchers note: “The results of this study represent an important step toward LLMs that provide personalized information and recommendations that support individuals in achieving their health goals.”