Linqing Zhong1, Chen Gao2††footnotemark: , Zihan Ding1, Yue Liao3, Si Liu1
1Beihang University 2National University of Singapore 3MMLab, CUHKEqual contributionCorresponding author (liusi@buaa.edu.cn)
Abstract
The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment.However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information.In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM’s spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration.Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., SR and SPL absolute improvements on HM3D.
1 Introduction
In the realm of embodied AI, Zero-Shot Object Navigation (ZSON) is a fundamental challenge, requiring an agent to traverse in continuous environments to locate a previously-unseen object specified by category (e.g., fireplace). Such a zero-shot setting discards category-specific training and supports an open-category manner, emphasizing reasoning and exploration ability with minimized movement costs.
Recently, emerging works[34, 49, 48] have started to integrate Large Language Models (LLMs) into ZSON agents, aiming to improve the reasoning ability by harnessing the extensive knowledge embedded in LLMs. For instance, objects co-occurrence knowledge, e.g., a mouse often being found near a computer, can guide agent’s exploration.
As is well known, spatial information is vital for navigation decision-making, as it includes essential aspects such as room layouts and positional relationships among objects.Typically, navigation agents translate egocentric observations into a structured map to encode spatial information, i.e., top-view map or called bird’s eye view map. This map serves as the core representation, facilitating essential functionalities like obstacle avoidance and path planning.
However, current methods face notable limitations.As illustrated in Fig.1(a), these methods lie in two paradigms, i.e., frontier-based exploration (FBE) or waypoint-based exploration methods.Typically, LLM-based methods need to convert the top-view map into natural language, e.g., surrounding descriptions, and use LLM to conduct reasoning in the linguistic domain. This map-to-text conversion process leads to the loss of vital spatial information such as the layout information of the living room. Alternatively, if the agent knows the spatial layout of the living room and understands that the fireplace is generally positioned opposite the sofa and table, the spatial location of the fireplace can be directly inferred based on the room layout.Therefore, considering the top-view map contains useful spatial information, and that MLLMs have demonstrated capabilities in grasping spatial relationships within images in the field of image understanding[5, 25], an interesting question arises “can we leverage MLLM to reason directly on the image of top-view map to produce executable moving decisions?”
Besides, as shown in Fig.1(a), FBE methods select a point from frontier regions to move toward exploration, restricting the LLM’s action space to only the frontier boundaries.Waypoint-based methods use the waypoint predictor to generate navigable waypoints and select a waypoint to move toward exploration, restricting the LLM’s action space to only a predefined set of points.Moreover, since the waypoint predictor is trained offline using depth information, its predicted waypoints focus solely on traversability without semantics, also leading to weakly-semantic action space.Both FBE and waypoint-based paradigms suffer from constrained local action spaces. Thus, the question is “can we construct global and semantic-rich action space?”
Furthermore, when humans explore unfamiliar environments, they can use current observations to infer the layout of unseen areas and predict where the target object might be located[15].This predicted target location guides their movement decisions strategically, even in the absence of complete information.However, the action spaces of FBE and waypoint-based method are confined strictly to navigable regions, which prevents them from possessing this capability.Thus, the question is “can we leverage known observations to infer the probable location of the target object in unexplored areas to guide the current decision?”
Therefore, to address the limitations and questions mentioned above, we make multi-fold innovations.First, we propose an insightful method called TopV-Nav to fully unlock the top-view spatial reasoning potential of MLLM for ZSON task. Specifically, the current LLM-driven paradigm requires the map-to-language process for LLM reasoning in linguistic space, where the converting process may lose some crucial spatial information such as objects and room layout. Instead, we propose a novel paradigm that leverages MLLM to directly reason on the top-view map, discarding the map-to-language process and maximizing the utilization of all the spatial information.Second, we introduce an Adaptive Visual Prompt Generation (AVPG) method to enhance MLLM’s understanding of the top-view map. AVPG adaptively generates visual prompts directly onto the map, in which various elements are spatially arranged to reflect their spatial relationships within the environment.Therefore, MLLM can grasp and comprehend crucial information directly from the map, facilitating effective spatial reasoning.For instance, in Fig.1(b), our method can interpret the room’s layout and infer the fireplace’s location.Additionally, the moving location is predicted directly based on the top-view map, resulting in a global and more flexible action space.Third, for environments with numerous objects, the top-view map may not be able to visually represent all elements. Thus, we propose a Dynamic Map Scaling (DMS) mechanism to dynamically adjust the map’s scale via zooming operations. DMS further enhances agent’s local spatial reasoning and fine-grained exploration in local regions.Last but not least, we propose a Target-Guided Navigation (TGN) mechanism to first predict the final coordinate of the target object, which can even lie in unexplored areas. The target coordinate further guides the moving location within current navigable regions, mirroring human-like predictive reasoning and exploratory behavior.
Experiments are conducted on MP3D and HM3D benchmarks, which demonstrates that our TopV-Nav achieves promising performance, e.g., SR and SPL absolute improvements on HM3D.
2 Related Works
2.1 Object-goal Navigation
Object-goal navigation has been a fundamental challenge in embodied AI[28, 29, 11, 45, 46, 44, 4, 12, 40, 6, 50, 23, 39, 38, 22, 13, 47].Early methods leverage RL to train policies, which explore visual representations[24], meta-learning[33], and semantic priors[37, 35] etc., to enhance performance.Modular-based approaches[3, 27, 43] leverage perception models[14, 42] to construct episodic maps, based on which long-term goals are generated to guide the local policy.
To overcome closed-world assumption and achieve zero-shot object navigation (ZSON) task, EmbCLIP[16] and [21] leverage the multi-modal alignment ability of CLIP[26] to enable cross-domain zero-shot object navigation.Furthermore, CoWs[10] accelerates the progress of the ZSON task, where no simulation training is required, and a single model can be applied across multiple environments.Recent methods[34, 49] extract semantic information using powerful off-the-shelf detectors[19, 41, 20], based on which they employ LLMs to determine the next frontier[49] or waypoint[34] for exploration.However, spatial layout information is lost during map-to-language conversion.To address this limitation, we propose direct reasoning on the top-view map with MLLM, fully leveraging the complete spatial information.
2.2 Spatial Reasoning with MLLMs
Developing the spatial reasoning capabilities of MLLMs has become increasingly popular recently.KAGI[17] proposes generating a coarse robot movement trajectory as dense reward supervision through keypoint inference.SCAFFOLD[18] leverages scaffolding coordinates to promote vision-language coordination.PIVOT[25] iteratively prompts MLLM with images annotated with a visual representation of proposals and can be applied to a wide range of embodied tasks.In the domain of vision-language navigation, AO-Planner[7] proposes visual affordance prompting to enable MLLM to select candidate waypoints from front-view images.However, previous works focus on exploring MLLM’s spatial reasoning from egocentric perspectives, while the investigation from top-view perspective remains limited (top-view map is the core representation for robots).Our work explores the top-view spatial reasoning potential of MLLMs for the ZSON task.
3 Method
3.1 Problem Definition
The ZSON task requires an agent, which is randomly placed in a continuous environment as initialization, to navigate to an instance of a user-specified object category .At time step , the agent receives egocentric observations , which contains RGB-D images and its pose . The agent is expected to adopt a low-level action , i.e., move_forward, turn_left, turn_right, look_up, look_down and stop. The task is considered successful if the agent stops within a distance threshold from the target in less than steps and the target is visible in the egocentric observation.
3.2 Overview
We propose a novel MLLM-driven method, termed TopV-Nav, designed to fully unlock MLLM’s top-view perception and spatial reasoning capabilities for the ZSON task.
Specifically, as shown in Fig.2, the agent observes egocentric RGB-D images and its pose at time step . Then, the proposed Adaptive Visual Prompt Generation (AVPG) module converts these egocentric observations onto a top-view map as shown in (a). Note that adopts different colors and text-boxes as visual prompts to represent various elements, which are spatially arranged on the map to provide spatial layout information.Next, in the proposed Dynamic Map Scaling (DMS) module, we take to query MLLM to conduct spatial reasoning andpredict a ratio for zooming in the map as shown in (b)(c). Note that the zooming region is the agent’s surroundings, supporting local fine-grained exploration.Subsequently, we propose Target-Guided Navigation (TGN) as shown in Fig.2. The scaled map is taken as input to query MLLM to conduct spatial reasoning and produce the target location and moving location. The target location refers to the potential location of the target object, thus it may lie in unknown areas on the map. This target location is used to guide the prediction of the moving location, which is the agent’s next actual destination within navigable areas. Note that any location within navigable areas can be decided as the moving location, resulting in a global action space that is not restricted to frontiers or waypoints.Finally, we utilize the fast marching method[31] as the local policy to perform a series of low-level actions, by which the agent can gradually move towards the moving location.
3.3 Adaptive Visual Prompt Generation
To adaptively construct top-view maps on which MLLM can perform spatial layout reasoning, we propose the Adaptive Visual Prompt Generation (AVPG) method.
Technically, at each time step , we transform the egocentric depth image into 3D point clouds utilizing agent’s pose and then convert them to the global space.Points near the floor are seen to the navigable areas while points above a height threshold are considered obstacle areas.Moreover, we leverage Grounding-DINO[20] to identify objects and their bounding boxes from egocentric images. Each box is projected into 3D using corresponding depth information and further projected to the top-view map by compressing high dimensions.To build a map that MLLM can understand, we leverage different colors and text as visual prompts on the map to distinguish different areas, such as historical traveled areas, navigable areas, obstacles areas, frontiers and objects, as shown in Fig.2(a).For detected objects, we devise a simple but efficient sampling strategy to select goal-relevant objects to display on the map. We assign each bounding box of category a priority score which is formulated as:
(1) |
where denotes the detection confidence. Following[49], indicates the similarity score between object category and goal category , and is a hyper-parameter. The objects with above a threshold will be annotated with corresponding bounding box and category on top-view map and serve as visual prompts.
Key Area Markers Generation.We aim to further generate markers as visual prompts on the map to refer key areas that contain rich semantics, helping MLLM better interpret this map.Technically, we apply DBSCAN[9], a density-based spatial clustering algorithm.We cluster two types of elements: frontiers and the locations of observed objects. In one respect, by identifying the boundary points of the navigable areas and the locations of obstacles, we obtain the locations of frontiers not occupied by obstacles . In another respect, we leverage detection model and depth information to collect observed objects’ locations . Eventually, we merge these two parts into a candidate point set and apply DBSCAN method to it.
Concretely, We first randomly sample a point which is not yet clustered from the candidate point set . With this point as the center, we consider its -neighborhood which is formulated as:
(2) |
where is a hyper-parameter that denotes the radius of clustering. is the center point of this -neighborhood. In order to avoid the influence of noise, only if the number of points in the -neighborhood reaches a threshold , we treat and other points within its -neighborhood a novel cluster. Otherwise, these points are considered as noises. In other words, specifies the minimum number of points contained in a cluster.
Taking a step further, after a novel qualified cluster is determined, we merge it with the existing clusters according to the following formula:
(3) |
Note that if any two clusters have an intersection, we merge them and generate a novel neighborhood with a greater radius of clustering. The center point of this new cluster is also updated.We repeat the process of random sampling and merging clusters until all points in are processed (i.e., included in the cluster or regarded as noise). In this way, a series of cluster centers are generated, which serve as markers to refer key areas as shown in Fig.2(a).Moreover, this clustering method enables the number and distribution of key areas to be dynamically adjusted during the agent’s exploration.
3.4 Dynamic Map Scaling
The top-view map with fixed size and resolution may not effectively represent the spatial layout. Therefore, we devise the Dynamic Map Scaling (DMS) mechanism to dynamically adjust the map scale during exploration, further supporting local fine-grained exploration. For instance, as shown in the original top-view map (Fig.2(b)), some representative layouts are obscured by the textual annotations. However, in the zoomed map (c), not only the spatial layout is clearly visible, but more objects are also displayed for MLLM to perform spatial reasoning. Such a zooming operation allows MLLM to capture more spatial clues and to perform fine-grained exploration in local key areas.
Specifically, we query the MLLM to predict a scaling ratio according to the current top-view map, as shown in Fig.2(b)(c).The MLLM is asked to reason on the original top-view map , considering the spatial layout and agent’s current position etc.Then we require MLLM to select a map scaling ratio from a predefined set of candidate scales.According to the scaling ratio chosen by MLLM, the top-view map is cropped with the agent as the center.Then we apply AVPG once again to obtain the scaled top-view map with visual prompts. Note that if , which means the top-view map remains unchanged, the original top-view map is directly used for TGN to produce the moving location.
3.5 Target-Guided Navigation
We design the Target-Guided Navigation (TGN) mechanism to guide the decision process by predicting the target location. Note that the location can be in unknown areas, which means the MLLM can infer the potential location according to the room layout as shown in Fig.3.
Concretely, the textual prompts are based on templates and visual observations, specifying object-goal categories and some key coordinate information (e.g., the coordinate system). Along with , we leverage MLLM to perform layout reasoning. The MLLM predicts and justifies the potential location of the goal () directly on . For instance, in Fig.3, MLLM observes “sofas”, “tables”, and other representative objects of living room. However, the goal “fireplace” is unseen in the top-view map . Through layout reasoning, MLLM determines that the “fireplace” is also in the living room and is most likely located “opposite the sofa”. Therefore, it can roughly predict the potential target location () based on the layout.
Since the agent can only navigate within the navigable areas, we propose an effective soft scoring method to convert the target location in unknown areas into an actual moving location within the navigable areas.Specifically, we encourage MLLM to assign layout scores to each key area marker , where denotes markers of key areas. Note that markers cover the entire map and the agent can move towards every location in it. The layout score explicitly represents the spatial layout differences among these markers. The higher , the greater the likelihood that agent will find the goal by moving towards .
Moreover, focusing only on the layout information yet ignoring the distance between each decision location and potential target location () may reduce the agent’s navigation efficiency. Therefore, we calculate the position score for each location by utilizing a Gaussian probability distribution centered at the estimated target object position:
(4) |
where indicates standard deviation. decays as the distance between and () increases, which dissuades potentially ineffective long-distance exploration.Finally, the decision location is predicted through fusing layout and position scores, which is formulated as:
(5) |
Such a soft scoring method produces the final decision location via concerning both the spatial layout and semantic information.
3.6 Local Policy
Once agent’s decision location is obtained, we calculate a path from the current location to by following previous works[27, 49] (fast marching method[31]). Then a series of low-level actions are generated according to the path by utilizing a deterministic policy.During navigation, if any instance of the goal category is observed, the agent will directly navigate to it. Otherwise, the agent keeps exploring by following MLLM’s reasoning results until the maximum number of action steps is reached.
4 Experiments
MP3D HM3D Methods Zero-Shot Training-Free Reasoning Domain SR SPL SR SPL SemEXP[3] [NeurIPS2020] Latent Map 36.0 14.4 - - PONI[27] [CVPR2022] Latent Map 31.8 12.1 - - ProcTHOR[8] [NeurIPS2022] CLIP Embeddings - - 54.4 31.8 ProcTHOR-ZS[8] [NeurIPS2022] ✓ CLIP Embeddings - - 13.2 7.7 ZSON[21] [NeurIPS2022] ✓ CLIP Embeddings 15.3 4.8 25.5 12.6 PSL[32] [ECCV2024] ✓ CLIP Embeddings - - 42.4 19.2 Pixel-Nav[1] [ICRA2024] ✓ Linguistic - - 37.9 20.5 SGM[46] [CVPR2024] ✓ Linguistic 37.7 14.7 60.2 30.8 CoW[10] [CVPR2023] ✓ ✓ CLIP Embeddings 7.4 3.7 - - ESC[49] [ICML2023] ✓ ✓ Linguistic 28.7 14.2 39.2 22.3 VoroNav[34] [ICML2024] ✓ ✓ Linguistic - - 42.0 26.0 TopV-Nav (Ours) ✓ ✓ Top-view Map 31.9 16.1 45.9 28.0
4.1 Experimental Setup
Dataset.In order to evaluate our TopV-Nav, we conduct extensive experiments on two representative object navigation benchmarks: Matterport3d (MP3D)[2] and Habitat-Matterport3D (HM3D)[36] benchmarks, which are built upon the Habitat simulator[30].MP3D contains high-fidelity scenes and episodes for validation, with categories of object goal. As for HM3D, the standard dataset partition[36] incorporates validation episodes in buildings with goal categories. Since our work focuses on zero-shot object navigation, none of the episodes is utilized for training.
Evaluation Metrics.We adopt Success Rate (SR) and Success weighted by Path Length (SPL) to evaluate the object navigation performance. SR indicates the ratio of episodes that is considered successful, and SPL is the primary metric to measure agent’s efficiency of navigation.
Implementation Details.We set as the agent’s maximal navigation steps. Agent’s rotation angle is set to degrees and the distance for move_forward is meter.The top-view map we construct is a pixel-level map with a resolution of m. For hyper-parameters described in our method, we set and score threshold for selecting goal-relevant objects as visual prompts. The radius of clustering is set to and the minimum number of points contained in a cluster (i.e. ) is set to . As for map scaling ratios detailed in DMS method, we predefine candidate ratios . In TGN mechanism, the standard deviation of Gaussian probability distribution is while the center point is dynamically estimated by MLLM’s predicted target object position. We leverage publicly available Grounding-DINO[20] for object detection and adopt GPT-4o as the spatial reasoning MLLM in both DMS and TGN.
4.2 Comparison with Previous Methods
In this section, we compare our TopV-Nav with state-of-the-art object navigation methods on the MP3D and HM3D benchmarks. Since requesting OpenAI’s service is costly and time-consuming, we sample a subset of the dataset to conduct experiments efficiently. Specifically, we sample episodes from the HM3D and MP3D respectively. These sampled episodes cover all scenes and target object categories in the validation split. Therefore, it remains highly representative. We mainly compare to the ESC[49] and VoroNav[34] as shown in Tab.1. Note that ESC applies the frontier-based exploration method and VoroNav is the waypoint-based approach. This analysis explicitly compares the performance of the three paradigms (i.e., FBE-based, waypoint-based and our top-view spatial reasoning) on the object navigation task.
As shown in Tab.1, our TopV-Nav significantly improves the navigation performance. Specifically, our approach outperforms ESC on the validation split of MP3D by improving SR and SPL by and respectively. Moreover, on HM3D benchmark, SR is increased from to and SPL is raised from to . Such an improvement reveals that MLLM’s direct layout reasoning on the top-view map leverages more spatial clues than the map-to-language manner followed by both FBE and waypoint-based paradigms, demonstrating the superiority of our proposed approach.
4.3 Ablation Studies
To analyze the contribution of each module, we conduct ablation experiments, which are shown in Tab.2, Tab.3, and Tab.4. Also, in order to save time and cost, an indicative subset of HM3D dataset is sampled to carry out ablation studies. We sample episodes, which cover all validation scenes and target object categories on HM3D. In all ablation experiments, we utilize the same episodes for evaluation to ensure stability.
Adaptive Visual Prompt Generation. The Adaptive Visual Prompt Generation (AVPG) method is a core part for MLLM’s layout reasoning. As shown in Tab.2, the AVPG effectively supports and releases the MLLM’s spatial reasoning abilities, achieving a significant SR and SPL on the HM3D benchmark. The result suggests that MLLM has substantial potential to guide agent’s exploration through grasping visual prompts and performing layout reasoning on the top-view map.
Dynamic Map Scaling. As shown in Tab.2, we observe that applying the DMS mechanism promotes SR from to . It implies that the agent’s zooming operation benefits object navigation. Besides, SPL is improved from to . This improvement of navigation efficiency reveals that grasping more visual clues and layout information benefits fine-grained layout reasoning, preventing the agent from potentially long-distance exploration.
Target-Guided Navigation. Through comparing “#2” and “#3” in Tab.2, we observe that our proposed TGN module boosts SR from to and improves SPL from to , which indicates that MLLM’s spatial layout inference estimates the potential goal location and this location predicted by MLLM further effectively guides agent’s exploration. This significant augmentation also confirms the importance of a human-like predictive reasoning in unfamiliar environments.
HM3D Name AVPG DMS TGN SR SPL #1 ✓ 40.5 22.7 #2 ✓ ✓ 42.0 23.6 #3 ✓ ✓ ✓ 43.5 24.7
Visual Prompt Components. The results in Tab.3 indicate the effects of different visual prompt components on the task performance. The comparison between “#2” and “#1” confirms that explicitly marking the spatial locations of objects on the top-view map improves navigation performance. In this manner, SR is lifted up from to and SPL is improved by . Besides, integrating MLLM’s predicted target as prompts further gains a significant improvement of SR ( to ) and SPL ( to ). It illustrates that rich visual prompts embedded in the top-view map facilitate MLLM’s layout reasoning and therefore enhance the agent’s navigation ability.
Map Scaling Ratios. We investigate how the ratios of agent’s zooming operation affect the navigation performance. As shown in Tab.4, both SR and SPL increase as the candidate number of map scaling ratios increases. When the top-view map remains unchanged (i.e., “#1” in Tab.4), the object navigation performance is relatively poor. This limitation stems from the fact that MLLM cannot perform zooming for fine-grained layout reasoning, especially in object-dense scenes. By adding to map scaling ratios, SR rises from to while SPL is also increased by . Moreover, introducing more map scales further improves navigation performance as shown in “#3”, lifting SR and SPL from , to , .
Visual Prompt HM3D Name History Object Target SR SPL #1 ✓ 39.5 22.1 #2 ✓ ✓ 42.0 23.8 #3 ✓ ✓ ✓ 43.5 24.7
Map Scales HM3D Name 1 1/2 1/3 and 1/5 SR SPL #1 ✓ 39.0 21.6 #2 ✓ ✓ 41.5 23.1 #3 ✓ ✓ ✓ 43.5 24.7
4.4 Qualitative Analysis
We visualize the navigation process along with the MLLM’s spatial reasoning to give a more intuitive view.
Fig.4 displays several examples of MLLM’s spatial reasoning. Note that we utilize different colors to annotate specific areas (e.g., obstacles and navigable areas). Object categories are also directly marked in the top-view map according to their spatial locations, facilitating MLLM to perform layout reasoning.
In Fig.4(a), the agent is tasked to search for a bed in an indoor environment. In this time step, the representative objects of a living-room such as chairs and cabinets are located in this area. The MLLM observes these objects and the scene layout which is likely a living-room. Based on the embedded prior knowledge, MLLM performs layout reasoning and infers that key area 5 is likely the location of a bedroom. Besides, by leveraging commonsense “bed is generally located in the bedroom”, MLLM utilizes the provided coordinate system information and estimates the potential location of the goal to be “(2.5, 7.5)”. This prediction further guides the local action policy, allowing the agent to gradually approach the target object.
In Fig.4(b), considering the goal is “bed”, MLLM interprets this top-view map and notices the agent is in the kitchen area and near the living-room according to the spatial layout displayed in the map. Considering that the living room and bedroom are usually connected by a corridor, MLLM indicates the bedroom may be located in “(5.0, 8.0)” and guides the agent to navigate to it in search of the goal “bed”.For example in Fig.4(c), MLLM observes the spatial layout and notices the detected bed. Considering that the bed is located in the bedroom which also contains the goal “toilet”, MLLM predicts the target location “(2.5, 3.5)”.
The examples mentioned above reveal that our proposed method effectively unlocks the MLLM’s top-view spatial potential and completes zero-shot object navigation tasks. Benefiting from MLLM’s spatial inference over the top-view map, this paradigm allows the agent to act in global action space and to archive a highly flexible exploration.
5 Conclusion
In this paper, we tackle the Zero-Shot Object Navigation (ZSON) task, where spatial information plays a critical role in such a goal-oriented exploration task. However, previous LLM-based methods must transfer the top-view map to language descriptions, conducting exploration reasoning in the linguistic domain. Such a transformation process loses spatial information such as object and room layout. Therefore, we aim to study how we can directly adopt the top-view map for reasoning by using MLLM’s ability to understand images. Specifically, we propose several insightful methods to fully unlock the top-view spatial reasoning potential of MLLM. The proposed Adaptive Visual Prompt Generation (AVPG) method draws a semantically-rich map with visual prompts. The Dynamic Map Scaling (DMS) mechanism aims to adjust the map scale to appropriate scales for MLLM to interpret layout and decision-making. The Target-Guided Navigation (TGN) mechanism imitates human behavior to predict the target’s potential location to guide the current action. Experiments on MP3D and HM3D benchmarks demonstrate the effectiveness of our work.
References
- Cai etal. [2024]Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong.Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill.In ICRA, 2024.
- Chang etal. [2017]Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang.Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017.
- Chaplot etal. [2020a]DevendraSingh Chaplot, DhirajPrakashchand Gandhi, Abhinav Gupta, and RussR Salakhutdinov.Object goal navigation using goal-oriented semantic exploration.In NeurIPS, 2020a.
- Chaplot etal. [2020b]DevendraSingh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta.Neural topological slam for visual navigation.In CVPR, 2020b.
- Chen etal. [2024a]Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia.Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.In CVPR, 2024a.
- Chen etal. [2022]Jinyu Chen, Chen Gao, Erli Meng, Qiong Zhang, and Si Liu.Reinforced structured state-evolution for vision-language navigation.In CVPR, 2022.
- Chen etal. [2024b]Jiaqi Chen, Bingqian Lin, Xinmin Liu, Xiaodan Liang, and Kwan-YeeK Wong.Affordances-oriented planning using foundation models for continuous vision-language navigation.In CoRR, 2024b.
- Deitke etal. [2022]Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi.Procthor: Large-scale embodied ai using procedural generation.In NeurIPS, 2022.
- Ester etal. [1996]Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, etal.A density-based algorithm for discovering clusters in large spatial databases with noise.In KDD, 1996.
- Gadre etal. [2023]SamirYitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song.Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation.In CVPR, 2023.
- Gao etal. [2021]Chen Gao, Jinyu Chen, Si Liu, Luting Wang, Qiong Zhang, and Qi Wu.Room-and-object aware knowledge reasoning for remote embodied referring expression.In CVPR, 2021.
- Gao etal. [2023a]Chen Gao, Si Liu, Jinyu Chen, Luting Wang, Qi Wu, Bo Li, and Qi Tian.Room-object entity prompting and reasoning for embodied referring expression.TPAMI, 2023a.
- Gao etal. [2023b]Chen Gao, Xingyu Peng, Mi Yan, He Wang, Lirong Yang, Haibing Ren, Hongsheng Li, and Si Liu.Adaptive zone-aware hierarchical planner for vision-language navigation.In CVPR, 2023b.
- He etal. [2017]Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.Mask r-cnn.In ICCV, 2017.
- Kessler etal. [2024]Fabian Kessler, Julia Frankenstein, and ConstantinA Rothkopf.Human navigation strategies and their errors result from dynamic interactions of spatial uncertainties.Nature Communications, 2024.
- Khandelwal etal. [2022]Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi.Simple but effective: Clip embeddings for embodied ai.In CVPR, 2022.
- Lee etal. [2024]OliviaY Lee, Annie Xie, Kuan Fang, Karl Pertsch, and Chelsea Finn.Affordance-guided reinforcement learning via visual prompting.RSS, 2024.
- Lei etal. [2024]Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu.Scaffolding coordinates to promote vision-language coordination in large multi-modal models.arXiv preprint arXiv:2402.12058, 2024.
- Li etal. [2022]LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, etal.Grounded language-image pre-training.In CVPR, 2022.
- Liu etal. [2024]Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, etal.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.In ECCV, 2024.
- Majumdar etal. [2022]Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra.Zson: Zero-shot object-goal navigation using multimodal goal embeddings.NeurIPS, 2022.
- Maksymets etal. [2021]Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, and Dhruv Batra.Thda: Treasure hunt data augmentation for semantic navigation.In ICCV, 2021.
- Mayo etal. [2021]Bar Mayo, Tamir Hazan, and Ayellet Tal.Visual navigation with spatial attention.In CVPR, 2021.
- Mousavian etal. [2019]Arsalan Mousavian, Alexander Toshev, Marek Fišer, Jana Košecká, Ayzaan Wahid, and James Davidson.Visual representations for semantic target driven navigation.In ICRA, 2019.
- Nasiriany etal. [2024]Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, etal.Pivot: Iterative visual prompting elicits actionable knowledge for vlms.In ICML, 2024.
- Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, 2021.
- Ramakrishnan etal. [2022]SanthoshKumar Ramakrishnan, DevendraSingh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman.Poni: Potential functions for objectgoal navigation with interaction-free learning.In CVPR, 2022.
- Ramrakhya etal. [2022]Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das.Habitat-web: Learning embodied object-search strategies from human demonstrations at scale.In CVPR, 2022.
- Ramrakhya etal. [2023]Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das.Pirlnav: Pretraining with imitation and rl finetuning for objectnav.In CVPR, 2023.
- Savva etal. [2019]Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, etal.Habitat: A platform for embodied ai research.In ICCV, 2019.
- Sethian [1999]JamesA Sethian.Fast marching methods.SIAM review, 1999.
- Sun etal. [2024]Xinyu Sun, Lizhao Liu, Hongyan Zhi, Ronghe Qiu, and Junwei Liang.Prioritized semantic learning for zero-shot instance navigation.In ECCV, 2024.
- Wortsman etal. [2019]Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.Learning to learn how to learn: Self-adaptive visual navigation using meta-learning.In CVPR, 2019.
- Wu etal. [2024]Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu.Voronav: Voronoi-based zero-shot object navigation with large language model.ICML, 2024.
- Wu etal. [2018]Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, Georgia Gkioxari, and Yuandong Tian.Learning and planning with a semantic model.arXiv preprint arXiv:1809.10842, 2018.
- Yadav etal. [2023]Karmesh Yadav, Ram Ramrakhya, SanthoshKumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, AngelXuan Chang, Dhruv Batra, Manolis Savva, etal.Habitat-matterport 3d semantics dataset.In CVPR, 2023.
- Yang etal. [2018]Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi.Visual semantic navigation using scene priors.arXiv preprint arXiv:1810.06543, 2018.
- Ye etal. [2021]Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans.Auxiliary tasks and exploration enable objectnav.In ICCV, 2021.
- Ye and Yang [2021]Xin Ye and Yezhou Yang.Hierarchical and partially observable goal-driven policy learning with goals relational graph.In CVPR, 2021.
- Zhai and Wang [2023]AlbertJ Zhai and Shenlong Wang.Peanut: Predicting and navigating to unseen targets.In ICCV, 2023.
- Zhang etal. [2022a]Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao.Glipv2: Unifying localization and vision-language understanding.In NeurIPS, 2022a.
- Zhang etal. [2020]Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu.Fusion-aware point convolution for online semantic 3d scene segmentation.In CVPR, 2020.
- Zhang etal. [2023a]Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang.3d-aware object goal navigation via simultaneous exploration and identification.In CVPR, 2023a.
- Zhang etal. [2022b]Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang.Generative meta-adversarial network for unseen object navigation.In ECCV, 2022b.
- Zhang etal. [2023b]Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang.Layout-based causal inference for object navigation.In CVPR, 2023b.
- Zhang etal. [2024]Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang.Imagine before go: Self-supervised generative map for object goal navigation.In CVPR, 2024.
- Zhao etal. [2022]Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu.Target-driven structured transformer planner for vision-language navigation.In ACM MM, 2022.
- Zhou etal. [2024]Gengze Zhou, Yicong Hong, and Qi Wu.Navgpt: Explicit reasoning in vision-and-language navigation with large language models.In AAAI, 2024.
- Zhou etal. [2023]Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and XinEric Wang.Esc: Exploration with soft commonsense constraints for zero-shot object navigation.In ICML, 2023.
- Zhu etal. [2022]Minzhao Zhu, Binglei Zhao, and Tao Kong.Navigating to objects in unseen environments by distance prediction.In IROS, 2022.