"By My Eyes" is a novel approach for integrating sensor data into Multimodal Large Language Models (MLLMs) by transforming long sequences of sensor data into visual inputs, such as graphs and plots. This method uses visual prompting to guide MLLMs in performing sensory tasks (e.g., human activity recognition, health monitoring) more efficiently and accurately than text-based methods.
Text-based methods that use raw sensor data in LLM prompts face challenges like:
"By My Eyes" introduces visual prompts to represent sensor data as images (e.g., waveforms, spectrograms), making it easier for MLLMs to interpret. The key innovation is a visualization generator that automatically converts sensor data into optimal visual representations. This reduces token costs and enhances performance across various sensory tasks.
Steps of the method:
"By My Eyes" was tested on nine sensory tasks across four modalities (accelerometer, ECG, EMG, and respiration sensors). The approach consistently outperformed text-based prompts, showing:
Dataset | Modality | Task | Text Prompt Accuracy | Visual Prompt Accuracy | Token Reduction |
---|---|---|---|---|---|
HHAR | Accelerometer | Human activity recognition | 66% | 67% | 26.2× |
PTB-XL | ECG | Arrhythmia detection | 73% | 80% | 3.4× |
WESAD | Respiration | Stress detection | 48% | 61% | 49.8× |
The visual prompts also led to more efficient use of tokens, allowing MLLMs to handle larger datasets and more complex tasks without sacrificing accuracy.
The "By My Eyes" method provides a cost-effective and performance-boosting solution for handling sensor data in MLLMs. By transforming raw sensor data into visual prompts, it addresses the limitations of text-based approaches, making it easier for LLMs to solve real-world sensory tasks in fields like healthcare, environmental monitoring, and human activity recognition.