Table of Contents

VAGI : Vision AI for General Intelligence

1IIT Tirupati,

VAGI (Vision AI for General Intelligence) aims to extend the capabilities of multimodal large language models (MLLMs) beyond conventional tasks such as video/image summarization and question answering. The focus is on enabling deeper reasoning, tool use, and dynamic strategy adaptation for visual understanding, depending on the context, use case, and input. By pushing MLLMs past static interpretation, the project explores pathways toward achieving general Vision intelligence.

Image Understanding

Dense Image Captioning

Deep Image captioning

Deep Image Captioning : The image depicts a vibrant urban scene on a sunny day, showcasing a bustling street with a mix of vehicles, pedestrians, and buildings. The street, likely in a European town given the architecture and the presence of a speed limit sign indicating a 30 km/h limit, is lined with a variety of structures. On the left side, there are buildings with distinctive red roofs, while on the right side, a prominent beige building stands out, adorned with white accents and trim. This building appears to be multi-storied and hosts a barber shop, as indicated by the sign reading "BARBER" on its facade. The presence of a leafless tree on the left side of the image adds a touch of seasonal context, suggesting it might be winter or early spring.The road itself is active, with several cars navigating through the scene. A dark-colored sedan is prominently visible on the left front side of the image, heading towards the left. In the center of the road, a black car faces away from the viewer, indicating it is moving forward. On the left middle side, a bus is also in motion, contributing to the dynamic atmosphere.Pedestrians are seen walking on the sidewalk to the right, adding to the lively ambiance of the area. A traffic light on the right side of the street regulates the flow of traffic and pedestrians, while a speed limit sign nearby reinforces the 30 km/h limit. The sky above is a clear blue with a white cloud, enhancing the overall sense of a pleasant day in an urban setting.The detailed description of objects in the scene, including their attributes and additional observations, provides a comprehensive understanding of the image\'s content. From the dark sedan and black car to the bus, beige building, leafless tree, pedestrians, traffic light, and speed limit sign, each element contributes to a vivid portrayal of urban life. The scene captures the essence of a busy yet orderly urban environment, where various modes of transportation and daily activities coexist under the watchful eyes of regulatory signs and the natural beauty of the sky.


Object Level Reasoning

Object Level Reasoning

In complex scenes, traditional multimodal models often miss fine-grained details — for example, miscounting objects or overlooking attributes like clothing color. Object-level reasoning overcomes this by analyzing each object individually, reasoning about its attributes, relationships, and interactions with other objects. This enables the model to extract richer and more accurate information, such as correctly identifying the number of cars, detecting people wearing specific clothes, or distinguishing between drivers and cyclists. Such detailed understanding is essential for applications in surveillance, autonomous driving, medical imaging, and any domain where precision and context matter


Spatial Reasoning

Test 2
Test 1
Test 1


Add Text behind Object

Test 2
Test 1


Text to Image Edit

Test 2
Test 1


Focus Me

Coming Soon...


Video Understanding

Coming Soon...