process and reason about visual and textual information
- gaps in contextual understanding
- difficulties with sapatial and temproal reasoning
- reliance on large-scale data → Not generalize well to real-world scenarios
1. The Lack of robust contextual reasoning
- deeper understanding of context or common-sense knowledge
📷 Image: 어떤 사람이 우산을 들고 있다 VLM): "A person holding an umbrella!" → ⭕ H): "Why is the person holding an umbrella?" VLM): "Umm... Because it's raining! (or sunny)" → ❌ (without addtional visual cues, visible rain or shadows) - misinterpreting cultural or *situational context
*EX) a traditional ceremonial outfit - everyday clothingEX) medical imaging analysis
EX) historical artifact documentation
WHY) training data lacks sufficient diversity or domain-specific annotations
2. Handling complex spatial and temporal relationships → reasoning
- Spatial resoning
- misrepresent object positions or interactions in cluttered scenes.
Image): a dog is chasing a ball behind a couch VLM)L the ball’s location relative to the dog - Temporal reasoning
- video based task → predicting the next action in a sequenceEX) whether a person in a kitchen video will…→ turn on a stove? open a fridge?
WHY) most VLMs process…
static images or short video clips(isolated snapshots) > dynamic, interconnected events.
3. Computationally expensive to train / reliance on biased or incomplete datasets
- Training a VLM (e.g CLIP or Flamingo) requires massive amounts of paired image-text data
- can introduce biases!
- EX) overrepresenting Western contexts or stereotypical gender roles
- Fine-tuning these models for specialized domains
- EX) like industrial quality control or satellite imagery analysis
- limited labeled data and the high cost of retraining.
4. Hallucination
- can be hard to detect without human oversight.
- EX) inventing object names or misattributing actions
[출처]
'Paper' 카테고리의 다른 글
| Qwen2.5-VL Technical Report [수정중] (0) | 2025.10.05 |
|---|---|
| LLaVa - Visual Instruction Tuning: 수정중 (0) | 2025.10.05 |