What are the limitations of current Vision-Language Models?

process and reason about visual and textual information

gaps in contextual understanding
difficulties with sapatial and temproal reasoning
reliance on large-scale data → Not generalize well to real-world scenarios

1. The Lack of robust contextual reasoning

deeper understanding of context or common-sense knowledge

📷 Image: 어떤 사람이 우산을 들고 있다

VLM):
"A person holding an umbrella!" → ⭕

H): 
"Why is the person holding an umbrella?"

VLM):
"Umm... Because it's raining! (or sunny)" → ❌  
(without addtional visual cues, visible rain or shadows)

misinterpreting cultural or *situational context
*EX) a traditional ceremonial outfit - everyday clothingEX) medical imaging analysis
EX) historical artifact documentation
WHY) training data lacks sufficient diversity or domain-specific annotations

2. Handling complex spatial and temporal relationships → reasoning

Spatial resoning

misrepresent object positions or interactions in cluttered scenes.

Image): a dog is chasing a ball behind a couch
VLM)L the ball’s location relative to the dog

Temporal reasoning
- video based task → predicting the next action in a sequenceEX) whether a person in a kitchen video will…→ turn on a stove? open a fridge?

WHY) most VLMs process…

static images or short video clips(isolated snapshots) > dynamic, interconnected events.

3. Computationally expensive to train / reliance on biased or incomplete datasets

Training a VLM (e.g CLIP or Flamingo) requires massive amounts of paired image-text data
- can introduce biases!
- EX) overrepresenting Western contexts or stereotypical gender roles
Fine-tuning these models for specialized domains
- EX) like industrial quality control or satellite imagery analysis
- limited labeled data and the high cost of retraining.

4. Hallucination

can be hard to detect without human oversight.
EX) inventing object names or misattributing actions

[출처]

What are the limitations of current Vision-Language Models?

'Paper' 카테고리의 다른 글

Qwen2.5-VL Technical Report [수정중] (0)	2025.10.05
LLaVa - Visual Instruction Tuning: 수정중 (0)	2025.10.05

process and reason about visual and textual information

1. The Lack of robust contextual reasoning

2. Handling complex spatial and temporal relationships → reasoning

3. Computationally expensive to train / reliance on biased or incomplete datasets

4. Hallucination

'Paper' 카테고리의 다른 글

티스토리툴바