What are the limitations of current Vision-Language Models?
·
Paper
process and reason about visual and textual informationgaps in contextual understandingdifficulties with sapatial and temproal reasoningreliance on large-scale data → Not generalize well to real-world scenarios 1. The Lack of robust contextual reasoningdeeper understanding of context or common-sense knowledge📷 Image: 어떤 사람이 우산을 들고 있다VLM):"A person holding an umbrella!" → ⭕H): "Why is the person..