What are the limitations of current Vision-Language Models?

2025. 10. 5. 02:27·Paper

process and reason about visual and textual information

  • gaps in contextual understanding
  • difficulties with sapatial and temproal reasoning
  • reliance on large-scale data → Not generalize well to real-world scenarios

 

1. The Lack of robust contextual reasoning

  • deeper understanding of context or common-sense knowledge
    📷 Image: 어떤 사람이 우산을 들고 있다
    
    VLM):
    "A person holding an umbrella!" → ⭕
    
    H): 
    "Why is the person holding an umbrella?"
    
    VLM):
    "Umm... Because it's raining! (or sunny)" → ❌  
    (without addtional visual cues, visible rain or shadows)
  • misinterpreting cultural or *situational context
    *
    EX) a traditional ceremonial outfit - everyday clothingEX) medical imaging analysis
    EX) historical artifact documentation
    WHY) training data lacks sufficient diversity or domain-specific annotations

 

2. Handling complex spatial and temporal relationships → reasoning

  • Spatial resoning
    • misrepresent object positions or interactions in cluttered scenes.
    Image): a dog is chasing a ball behind a couch
    VLM)L the ball’s location relative to the dog
  • Temporal reasoning
    • video based task → predicting the next action in a sequenceEX) whether a person in a kitchen video will…→ turn on a stove? open a fridge?

WHY) most VLMs process…

static images or short video clips(isolated snapshots) > dynamic, interconnected events.

 

 

3. Computationally expensive to train / reliance on biased or incomplete datasets

  • Training a VLM (e.g CLIP or Flamingo) requires massive amounts of paired image-text data
    • can introduce biases!
    • EX) overrepresenting Western contexts or stereotypical gender roles
  • Fine-tuning these models for specialized domains
    • EX) like industrial quality control or satellite imagery analysis
    • limited labeled data and the high cost of retraining.

 

4. Hallucination

  • can be hard to detect without human oversight.
  • EX) inventing object names or misattributing actions

 

[출처]

What are the limitations of current Vision-Language Models?

'Paper' 카테고리의 다른 글

Qwen2.5-VL Technical Report [수정중]  (0) 2025.10.05
LLaVa - Visual Instruction Tuning: 수정중  (0) 2025.10.05
'Paper' 카테고리의 다른 글
  • Qwen2.5-VL Technical Report [수정중]
  • LLaVa - Visual Instruction Tuning: 수정중
GoGoDDubi
GoGoDDubi
  • GoGoDDubi
    LetsGoDDubi
    GoGoDDubi
  • 전체
    오늘
    어제
    • 분류 전체보기 (12)
      • AI & ML (6)
        • LLM (6)
        • Vision (0)
      • Data (0)
      • Paper (3)
      • DevOps (3)
        • MLOps (0)
        • Airflow (3)
      • Infra (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

    vlm
    AI/ML
    mlops
    Airflow
    DevOps
  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.6
GoGoDDubi
What are the limitations of current Vision-Language Models?
상단으로

티스토리툴바