Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (2024)

Linqing Zhong1, Chen Gao2footnotemark: , Zihan Ding1, Yue Liao3, Si Liu1
1Beihang University  2National University of Singapore  3MMLab, CUHK
Equal contributionCorresponding author (liusi@buaa.edu.cn)

Abstract

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment.However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information.In this paper, we introduce TopV-Nav, a MLLM-based method that directly reasons on the top-view map with complete spatial information. To fully unlock the MLLM’s spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Target-Guided Navigation (TGN) mechanism to predict and to utilize target locations, facilitating global and human-like exploration.Experiments on MP3D and HM3D benchmarks demonstrate the superiority of our TopV-Nav, e.g., +3.9%percent3.9+3.9\%+ 3.9 % SR and +2.0%percent2.0+2.0\%+ 2.0 % SPL absolute improvements on HM3D.

1 Introduction

Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (1)

In the realm of embodied AI, Zero-Shot Object Navigation (ZSON) is a fundamental challenge, requiring an agent to traverse in continuous environments to locate a previously-unseen object specified by category (e.g., fireplace). Such a zero-shot setting discards category-specific training and supports an open-category manner, emphasizing reasoning and exploration ability with minimized movement costs.

Recently, emerging works[34, 49, 48] have started to integrate Large Language Models (LLMs) into ZSON agents, aiming to improve the reasoning ability by harnessing the extensive knowledge embedded in LLMs. For instance, objects co-occurrence knowledge, e.g., a mouse often being found near a computer, can guide agent’s exploration.

As is well known, spatial information is vital for navigation decision-making, as it includes essential aspects such as room layouts and positional relationships among objects.Typically, navigation agents translate egocentric observations into a structured map to encode spatial information, i.e., top-view map or called bird’s eye view map. This map serves as the core representation, facilitating essential functionalities like obstacle avoidance and path planning.

However, current methods face notable limitations.As illustrated in Fig.1(a), these methods lie in two paradigms, i.e., frontier-based exploration (FBE) or waypoint-based exploration methods.Typically, LLM-based methods need to convert the top-view map into natural language, e.g., surrounding descriptions, and use LLM to conduct reasoning in the linguistic domain. This map-to-text conversion process leads to the loss of vital spatial information such as the layout information of the living room. Alternatively, if the agent knows the spatial layout of the living room and understands that the fireplace is generally positioned opposite the sofa and table, the spatial location of the fireplace can be directly inferred based on the room layout.Therefore, considering the top-view map contains useful spatial information, and that MLLMs have demonstrated capabilities in grasping spatial relationships within images in the field of image understanding[5, 25], an interesting question arises “can we leverage MLLM to reason directly on the image of top-view map to produce executable moving decisions?”

Besides, as shown in Fig.1(a), FBE methods select a point from frontier regions to move toward exploration, restricting the LLM’s action space to only the frontier boundaries.Waypoint-based methods use the waypoint predictor to generate navigable waypoints and select a waypoint to move toward exploration, restricting the LLM’s action space to only a predefined set of points.Moreover, since the waypoint predictor is trained offline using depth information, its predicted waypoints focus solely on traversability without semantics, also leading to weakly-semantic action space.Both FBE and waypoint-based paradigms suffer from constrained local action spaces. Thus, the question is “can we construct global and semantic-rich action space?”

Furthermore, when humans explore unfamiliar environments, they can use current observations to infer the layout of unseen areas and predict where the target object might be located[15].This predicted target location guides their movement decisions strategically, even in the absence of complete information.However, the action spaces of FBE and waypoint-based method are confined strictly to navigable regions, which prevents them from possessing this capability.Thus, the question is “can we leverage known observations to infer the probable location of the target object in unexplored areas to guide the current decision?”

Therefore, to address the limitations and questions mentioned above, we make multi-fold innovations.First, we propose an insightful method called TopV-Nav to fully unlock the top-view spatial reasoning potential of MLLM for ZSON task. Specifically, the current LLM-driven paradigm requires the map-to-language process for LLM reasoning in linguistic space, where the converting process may lose some crucial spatial information such as objects and room layout. Instead, we propose a novel paradigm that leverages MLLM to directly reason on the top-view map, discarding the map-to-language process and maximizing the utilization of all the spatial information.Second, we introduce an Adaptive Visual Prompt Generation (AVPG) method to enhance MLLM’s understanding of the top-view map. AVPG adaptively generates visual prompts directly onto the map, in which various elements are spatially arranged to reflect their spatial relationships within the environment.Therefore, MLLM can grasp and comprehend crucial information directly from the map, facilitating effective spatial reasoning.For instance, in Fig.1(b), our method can interpret the room’s layout and infer the fireplace’s location.Additionally, the moving location is predicted directly based on the top-view map, resulting in a global and more flexible action space.Third, for environments with numerous objects, the top-view map may not be able to visually represent all elements. Thus, we propose a Dynamic Map Scaling (DMS) mechanism to dynamically adjust the map’s scale via zooming operations. DMS further enhances agent’s local spatial reasoning and fine-grained exploration in local regions.Last but not least, we propose a Target-Guided Navigation (TGN) mechanism to first predict the final coordinate of the target object, which can even lie in unexplored areas. The target coordinate further guides the moving location within current navigable regions, mirroring human-like predictive reasoning and exploratory behavior.

Experiments are conducted on MP3D and HM3D benchmarks, which demonstrates that our TopV-Nav achieves promising performance, e.g., +3.9%percent3.9+3.9\%+ 3.9 % SR and +2.0%percent2.0+2.0\%+ 2.0 % SPL absolute improvements on HM3D.

2 Related Works

Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (2)

2.1 Object-goal Navigation

Object-goal navigation has been a fundamental challenge in embodied AI[28, 29, 11, 45, 46, 44, 4, 12, 40, 6, 50, 23, 39, 38, 22, 13, 47].Early methods leverage RL to train policies, which explore visual representations[24], meta-learning[33], and semantic priors[37, 35] etc., to enhance performance.Modular-based approaches[3, 27, 43] leverage perception models[14, 42] to construct episodic maps, based on which long-term goals are generated to guide the local policy.

To overcome closed-world assumption and achieve zero-shot object navigation (ZSON) task, EmbCLIP[16] and [21] leverage the multi-modal alignment ability of CLIP[26] to enable cross-domain zero-shot object navigation.Furthermore, CoWs[10] accelerates the progress of the ZSON task, where no simulation training is required, and a single model can be applied across multiple environments.Recent methods[34, 49] extract semantic information using powerful off-the-shelf detectors[19, 41, 20], based on which they employ LLMs to determine the next frontier[49] or waypoint[34] for exploration.However, spatial layout information is lost during map-to-language conversion.To address this limitation, we propose direct reasoning on the top-view map with MLLM, fully leveraging the complete spatial information.

2.2 Spatial Reasoning with MLLMs

Developing the spatial reasoning capabilities of MLLMs has become increasingly popular recently.KAGI[17] proposes generating a coarse robot movement trajectory as dense reward supervision through keypoint inference.SCAFFOLD[18] leverages scaffolding coordinates to promote vision-language coordination.PIVOT[25] iteratively prompts MLLM with images annotated with a visual representation of proposals and can be applied to a wide range of embodied tasks.In the domain of vision-language navigation, AO-Planner[7] proposes visual affordance prompting to enable MLLM to select candidate waypoints from front-view images.However, previous works focus on exploring MLLM’s spatial reasoning from egocentric perspectives, while the investigation from top-view perspective remains limited (top-view map is the core representation for robots).Our work explores the top-view spatial reasoning potential of MLLMs for the ZSON task.

3 Method

3.1 Problem Definition

The ZSON task requires an agent, which is randomly placed in a continuous environment as initialization, to navigate to an instance of a user-specified object category 𝒢𝒢\mathcal{G}caligraphic_G.At time step t𝑡titalic_t, the agent receives egocentric observations 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which contains RGB-D images and its pose ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The agent is expected to adopt a low-level action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e., move_forward, turn_left, turn_right, look_up, look_down and stop. The task is considered successful if the agent stops within a distance threshold from the target in less than 500500500500 steps and the target is visible in the egocentric observation.

3.2 Overview

We propose a novel MLLM-driven method, termed TopV-Nav, designed to fully unlock MLLM’s top-view perception and spatial reasoning capabilities for the ZSON task.

Specifically, as shown in Fig.2, the agent observes egocentric RGB-D images Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its pose ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t. Then, the proposed Adaptive Visual Prompt Generation (AVPG) module converts these egocentric observations onto a top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as shown in (a). Note that 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT adopts different colors and text-boxes as visual prompts to represent various elements, which are spatially arranged on the map to provide spatial layout information.Next, in the proposed Dynamic Map Scaling (DMS) module, we take 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to query MLLM to conduct spatial reasoning andpredict a ratio r[0,1]𝑟01r\in[0,1]italic_r ∈ [ 0 , 1 ] for zooming in the map as shown in (b)(c). Note that the zooming region is the agent’s surroundings, supporting local fine-grained exploration.Subsequently, we propose Target-Guided Navigation (TGN) as shown in Fig.2. The scaled map is taken as input to query MLLM to conduct spatial reasoning and produce the target location and moving location. The target location refers to the potential location of the target object, thus it may lie in unknown areas on the map. This target location is used to guide the prediction of the moving location, which is the agent’s next actual destination within navigable areas. Note that any location within navigable areas can be decided as the moving location, resulting in a global action space that is not restricted to frontiers or waypoints.Finally, we utilize the fast marching method[31] as the local policy to perform a series of low-level actions, by which the agent can gradually move towards the moving location.

3.3 Adaptive Visual Prompt Generation

To adaptively construct top-view maps on which MLLM can perform spatial layout reasoning, we propose the Adaptive Visual Prompt Generation (AVPG) method.

Technically, at each time step t𝑡titalic_t, we transform the egocentric depth image dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into 3D point clouds utilizing agent’s pose ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and then convert them to the global space.Points near the floor are seen to the navigable areas while points above a height threshold hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are considered obstacle areas.Moreover, we leverage Grounding-DINO[20] to identify objects and their bounding boxes from egocentric images. Each box is projected into 3D using corresponding depth information and further projected to the top-view map by compressing high dimensions.To build a map that MLLM can understand, we leverage different colors and text as visual prompts on the map to distinguish different areas, such as historical traveled areas, navigable areas, obstacles areas, frontiers and objects, as shown in Fig.2(a).For detected objects, we devise a simple but efficient sampling strategy to select goal-relevant objects to display on the map. We assign each bounding box j𝑗jitalic_j of category i𝑖iitalic_i a priority score scorei,j𝑠𝑐𝑜𝑟subscript𝑒𝑖𝑗score_{i,j}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT which is formulated as:

scorei,j=αconfi,j+(1α)si,𝒢,𝑠𝑐𝑜𝑟subscript𝑒𝑖𝑗𝛼𝑐𝑜𝑛subscript𝑓𝑖𝑗1𝛼subscript𝑠𝑖𝒢score_{i,j}=\alpha\cdot conf_{i,j}+(1-\alpha)\cdot s_{i,\mathcal{G}},italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_α ⋅ italic_c italic_o italic_n italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ( 1 - italic_α ) ⋅ italic_s start_POSTSUBSCRIPT italic_i , caligraphic_G end_POSTSUBSCRIPT ,(1)

where confi,j𝑐𝑜𝑛subscript𝑓𝑖𝑗conf_{i,j}italic_c italic_o italic_n italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the detection confidence. Following[49], si,𝒢subscript𝑠𝑖𝒢s_{i,\mathcal{G}}italic_s start_POSTSUBSCRIPT italic_i , caligraphic_G end_POSTSUBSCRIPT indicates the similarity score between object category i𝑖iitalic_i and goal category 𝒢𝒢\mathcal{G}caligraphic_G, and α𝛼\alphaitalic_α is a hyper-parameter. The objects with scorei,j𝑠𝑐𝑜𝑟subscript𝑒𝑖𝑗score_{i,j}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT above a threshold scoreθ𝑠𝑐𝑜𝑟subscript𝑒𝜃score_{\theta}italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT will be annotated with corresponding bounding box and category on top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and serve as visual prompts.

Key Area Markers Generation.We aim to further generate markers as visual prompts on the map to refer key areas that contain rich semantics, helping MLLM better interpret this map.Technically, we apply DBSCAN[9], a density-based spatial clustering algorithm.We cluster two types of elements: frontiers and the locations of observed objects. In one respect, by identifying the boundary points of the navigable areas and the locations of obstacles, we obtain the locations of frontiers not occupied by obstacles iF(xi,yi)subscript𝑖𝐹subscript𝑥𝑖subscript𝑦𝑖\bigcup_{i\in F}(x_{i},y_{i})⋃ start_POSTSUBSCRIPT italic_i ∈ italic_F end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). In another respect, we leverage detection model and depth information to collect observed objects’ locations iO(xi,yi)subscript𝑖𝑂subscript𝑥𝑖subscript𝑦𝑖\bigcup_{i\in O}(x_{i},y_{i})⋃ start_POSTSUBSCRIPT italic_i ∈ italic_O end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Eventually, we merge these two parts into a candidate point set iOF(xi,yi)subscript𝑖𝑂𝐹subscript𝑥𝑖subscript𝑦𝑖\bigcup_{i\in O\cup F}(x_{i},y_{i})⋃ start_POSTSUBSCRIPT italic_i ∈ italic_O ∪ italic_F end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and apply DBSCAN method to it.

Concretely, We first randomly sample a point (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) which is not yet clustered from the candidate point set iOF(xi,yi)subscript𝑖𝑂𝐹subscript𝑥𝑖subscript𝑦𝑖\bigcup_{i\in O\cup F}(x_{i},y_{i})⋃ start_POSTSUBSCRIPT italic_i ∈ italic_O ∪ italic_F end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). With this point as the center, we consider its ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood which is formulated as:

Nϵ((xc,yc))={(xj,yj)dist((xj,yj),(xc,yc))ϵ},subscript𝑁italic-ϵsubscript𝑥𝑐subscript𝑦𝑐conditional-setsubscript𝑥𝑗subscript𝑦𝑗𝑑𝑖𝑠𝑡subscript𝑥𝑗subscript𝑦𝑗subscript𝑥𝑐subscript𝑦𝑐italic-ϵN_{\epsilon}((x_{c},y_{c}))=\{(x_{j},y_{j})\mid dist((x_{j},y_{j}),(x_{c},y_{c%}))\leq\epsilon\},italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) = { ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_d italic_i italic_s italic_t ( ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ≤ italic_ϵ } ,(2)

where ϵitalic-ϵ\epsilonitalic_ϵ is a hyper-parameter that denotes the radius of clustering. (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is the center point of this ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood. In order to avoid the influence of noise, only if the number of points in the ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood reaches a threshold μ𝜇\muitalic_μ, we treat (xc,yc)subscript𝑥𝑐subscript𝑦𝑐(x_{c},y_{c})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) and other points within its ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood a novel cluster. Otherwise, these points are considered as noises. In other words, μ𝜇\muitalic_μ specifies the minimum number of points contained in a cluster.

Taking a step further, after a novel qualified cluster is determined, we merge it with the existing clusters according to the following formula:

{Nϵ((xi,yi))}={Nϵ((xi,yi))}Nϵ((xc,yc)).subscript𝑁italic-ϵsubscript𝑥𝑖subscript𝑦𝑖subscript𝑁italic-ϵsubscript𝑥𝑖subscript𝑦𝑖subscript𝑁italic-ϵsubscript𝑥𝑐subscript𝑦𝑐\{N_{\epsilon}((x_{i},y_{i}))\}=\{N_{\epsilon}((x_{i},y_{i}))\}\cup N_{%\epsilon}((x_{c},y_{c})).{ italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } = { italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } ∪ italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) .(3)

Note that if any two clusters have an intersection, we merge them and generate a novel neighborhood with a greater radius of clustering. The center point of this new cluster is also updated.We repeat the process of random sampling and merging clusters until all points in iOF(xi,yi)subscript𝑖𝑂𝐹subscript𝑥𝑖subscript𝑦𝑖\bigcup_{i\in O\cup F}(x_{i},y_{i})⋃ start_POSTSUBSCRIPT italic_i ∈ italic_O ∪ italic_F end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are processed (i.e., included in the cluster or regarded as noise). In this way, a series of cluster centers iIcisubscript𝑖𝐼subscript𝑐𝑖\bigcup_{i\in I}c_{i}⋃ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are generated, which serve as markers to refer key areas as shown in Fig.2(a).Moreover, this clustering method enables the number and distribution of key areas to be dynamically adjusted during the agent’s exploration.

3.4 Dynamic Map Scaling

The top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with fixed size and resolution may not effectively represent the spatial layout. Therefore, we devise the Dynamic Map Scaling (DMS) mechanism to dynamically adjust the map scale during exploration, further supporting local fine-grained exploration. For instance, as shown in the original top-view map (Fig.2(b)), some representative layouts are obscured by the textual annotations. However, in the zoomed map (c), not only the spatial layout is clearly visible, but more objects are also displayed for MLLM to perform spatial reasoning. Such a zooming operation allows MLLM to capture more spatial clues and to perform fine-grained exploration in local key areas.

Specifically, we query the MLLM to predict a scaling ratio according to the current top-view map, as shown in Fig.2(b)(c).The MLLM is asked to reason on the original top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, considering the spatial layout and agent’s current position etc.Then we require MLLM to select a map scaling ratio r[0,1]𝑟01r\in[0,1]italic_r ∈ [ 0 , 1 ] from a predefined set of candidate scales.According to the scaling ratio chosen by MLLM, the top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is cropped with the agent as the center.Then we apply AVPG once again to obtain the scaled top-view map with visual prompts. Note that if r=1𝑟1r=1italic_r = 1, which means the top-view map remains unchanged, the original top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is directly used for TGN to produce the moving location.

Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (3)

3.5 Target-Guided Navigation

We design the Target-Guided Navigation (TGN) mechanism to guide the decision process by predicting the target location. Note that the location can be in unknown areas, which means the MLLM can infer the potential location according to the room layout as shown in Fig.3.

Concretely, the textual prompts are based on templates and visual observations, specifying object-goal categories and some key coordinate information (e.g., the coordinate system). Along with 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we leverage MLLM to perform layout reasoning. The MLLM predicts and justifies the potential location of the goal (xg,ygsubscript𝑥𝑔subscript𝑦𝑔x_{g},y_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) directly on 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For instance, in Fig.3, MLLM observes “sofas”, “tables”, and other representative objects of living room. However, the goal “fireplace” is unseen in the top-view map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Through layout reasoning, MLLM determines that the “fireplace” is also in the living room and is most likely located “opposite the sofa”. Therefore, it can roughly predict the potential target location (xg,ygsubscript𝑥𝑔subscript𝑦𝑔x_{g},y_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) based on the layout.

Since the agent can only navigate within the navigable areas, we propose an effective soft scoring method to convert the target location in unknown areas into an actual moving location within the navigable areas.Specifically, we encourage MLLM to assign layout scores ni,layout[0,1]subscript𝑛𝑖𝑙𝑎𝑦𝑜𝑢𝑡01n_{i,layout}\in[0,1]italic_n start_POSTSUBSCRIPT italic_i , italic_l italic_a italic_y italic_o italic_u italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] to each key area marker (xi,yi)imark(xi,yi)subscript𝑥𝑖subscript𝑦𝑖subscript𝑖𝑚𝑎𝑟𝑘subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})\in\bigcup_{i\in mark}(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ ⋃ start_POSTSUBSCRIPT italic_i ∈ italic_m italic_a italic_r italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where mark𝑚𝑎𝑟𝑘markitalic_m italic_a italic_r italic_k denotes markers of key areas. Note that markers cover the entire map 𝒱t𝒱subscript𝑡\mathcal{VM}_{t}caligraphic_V caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the agent can move towards every location in it. The layout score ni,layoutsubscript𝑛𝑖𝑙𝑎𝑦𝑜𝑢𝑡n_{i,layout}italic_n start_POSTSUBSCRIPT italic_i , italic_l italic_a italic_y italic_o italic_u italic_t end_POSTSUBSCRIPT explicitly represents the spatial layout differences among these markers. The higher ni,layoutsubscript𝑛𝑖𝑙𝑎𝑦𝑜𝑢𝑡n_{i,layout}italic_n start_POSTSUBSCRIPT italic_i , italic_l italic_a italic_y italic_o italic_u italic_t end_POSTSUBSCRIPT, the greater the likelihood that agent will find the goal by moving towards (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Moreover, focusing only on the layout information yet ignoring the distance between each decision location (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and potential target location (xg,ygsubscript𝑥𝑔subscript𝑦𝑔x_{g},y_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) may reduce the agent’s navigation efficiency. Therefore, we calculate the position score ni,possubscript𝑛𝑖𝑝𝑜𝑠n_{i,pos}italic_n start_POSTSUBSCRIPT italic_i , italic_p italic_o italic_s end_POSTSUBSCRIPT for each location (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) by utilizing a Gaussian probability distribution centered at the estimated target object position:

ni,pos=12πσ2exp((xixg)2+(yiyg)22σ2),subscript𝑛𝑖𝑝𝑜𝑠12𝜋superscript𝜎2superscriptsubscript𝑥𝑖subscript𝑥𝑔2superscriptsubscript𝑦𝑖subscript𝑦𝑔22superscript𝜎2n_{i,pos}=\frac{1}{2\pi\sigma^{2}}\cdot\exp\left(-\frac{(x_{i}-x_{g})^{2}+(y_{%i}-y_{g})^{2}}{2\sigma^{2}}\right),italic_n start_POSTSUBSCRIPT italic_i , italic_p italic_o italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ roman_exp ( - divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,(4)

where σ𝜎\sigmaitalic_σ indicates standard deviation. ni,possubscript𝑛𝑖𝑝𝑜𝑠n_{i,pos}italic_n start_POSTSUBSCRIPT italic_i , italic_p italic_o italic_s end_POSTSUBSCRIPT decays as the distance between (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and (xg,ygsubscript𝑥𝑔subscript𝑦𝑔x_{g},y_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT) increases, which dissuades potentially ineffective long-distance exploration.Finally, the decision location (x,y)superscript𝑥superscript𝑦(x^{*},y^{*})( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is predicted through fusing layout and position scores, which is formulated as:

(x,y)=argmaxi(ni,layout+ni,pos),imark.formulae-sequencesuperscript𝑥superscript𝑦subscriptargmax𝑖subscript𝑛𝑖𝑙𝑎𝑦𝑜𝑢𝑡subscript𝑛𝑖𝑝𝑜𝑠𝑖𝑚𝑎𝑟𝑘(x^{*},y^{*})=\operatorname*{arg\,max}_{i}\,(n_{i,layout}+n_{i,pos}),i\in mark.( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_i , italic_l italic_a italic_y italic_o italic_u italic_t end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_i , italic_p italic_o italic_s end_POSTSUBSCRIPT ) , italic_i ∈ italic_m italic_a italic_r italic_k .(5)

Such a soft scoring method produces the final decision location (x,y)superscript𝑥superscript𝑦(x^{*},y^{*})( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) via concerning both the spatial layout and semantic information.

3.6 Local Policy

Once agent’s decision location (x,y)superscript𝑥superscript𝑦(x^{*},y^{*})( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is obtained, we calculate a path from the current location to (x,y)superscript𝑥superscript𝑦(x^{*},y^{*})( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) by following previous works[27, 49] (fast marching method[31]). Then a series of low-level actions are generated according to the path by utilizing a deterministic policy.During navigation, if any instance of the goal category is observed, the agent will directly navigate to it. Otherwise, the agent keeps exploring by following MLLM’s reasoning results until the maximum number of action steps is reached.

4 Experiments

MP3DHM3D
MethodsZero-ShotTraining-FreeReasoning DomainSR\uparrowSPL\uparrowSR\uparrowSPL\uparrow
SemEXP[3][NeurIPS2020]×\times××\times×Latent Map36.014.4--
PONI[27][CVPR2022]×\times××\times×Latent Map31.812.1--
ProcTHOR[8][NeurIPS2022]×\times××\times×CLIP Embeddings--54.431.8
ProcTHOR-ZS[8][NeurIPS2022]×\times×CLIP Embeddings--13.27.7
ZSON[21][NeurIPS2022]×\times×CLIP Embeddings15.34.825.512.6
PSL[32][ECCV2024]×\times×CLIP Embeddings--42.419.2
Pixel-Nav[1][ICRA2024]×\times×Linguistic--37.920.5
SGM[46][CVPR2024]×\times×Linguistic37.714.760.230.8
CoW[10][CVPR2023]CLIP Embeddings7.43.7--
ESC[49][ICML2023]Linguistic28.714.239.222.3
VoroNav[34][ICML2024]Linguistic--42.026.0
TopV-Nav(Ours)Top-view Map31.916.145.928.0

4.1 Experimental Setup

Dataset.In order to evaluate our TopV-Nav, we conduct extensive experiments on two representative object navigation benchmarks: Matterport3d (MP3D)[2] and Habitat-Matterport3D (HM3D)[36] benchmarks, which are built upon the Habitat simulator[30].MP3D contains 11111111 high-fidelity scenes and 2195219521952195 episodes for validation, with 21212121 categories of object goal. As for HM3D, the standard dataset partition[36] incorporates 2000200020002000 validation episodes in 20202020 buildings with 6666 goal categories. Since our work focuses on zero-shot object navigation, none of the episodes is utilized for training.

Evaluation Metrics.We adopt Success Rate (SR) and Success weighted by Path Length (SPL) to evaluate the object navigation performance. SR indicates the ratio of episodes that is considered successful, and SPL is the primary metric to measure agent’s efficiency of navigation.

Implementation Details.We set 500500500500 as the agent’s maximal navigation steps. Agent’s rotation angle is set to 30303030 degrees and the distance for move_forward is 0.250.250.250.25 meter.The top-view map we construct is a 800×800800800800\times 800800 × 800 pixel-level map with a resolution of 0.050.050.050.05 m. For hyper-parameters described in our method, we set α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 and score threshold scoreθ=0.6𝑠𝑐𝑜𝑟subscript𝑒𝜃0.6score_{\theta}=0.6italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 0.6 for selecting goal-relevant objects as visual prompts. The radius of clustering ϵitalic-ϵ\epsilonitalic_ϵ is set to 7777 and the minimum number of points contained in a cluster (i.e. μ𝜇\muitalic_μ) is set to 5555. As for map scaling ratios detailed in DMS method, we predefine candidate ratios r[1,1/2,1/3,1/5]𝑟1121315r\in[1,1/2,1/3,1/5]italic_r ∈ [ 1 , 1 / 2 , 1 / 3 , 1 / 5 ]. In TGN mechanism, the standard deviation of Gaussian probability distribution is σ=100𝜎100\sigma=100italic_σ = 100 while the center point is dynamically estimated by MLLM’s predicted target object position. We leverage publicly available Grounding-DINO[20] for object detection and adopt GPT-4o as the spatial reasoning MLLM in both DMS and TGN.

4.2 Comparison with Previous Methods

In this section, we compare our TopV-Nav with state-of-the-art object navigation methods on the MP3D and HM3D benchmarks. Since requesting OpenAI’s service is costly and time-consuming, we sample a subset of the dataset to conduct experiments efficiently. Specifically, we sample 1,50015001,5001 , 500 episodes from the HM3D and MP3D respectively. These sampled episodes cover all scenes and target object categories in the validation split. Therefore, it remains highly representative. We mainly compare to the ESC[49] and VoroNav[34] as shown in Tab.1. Note that ESC applies the frontier-based exploration method and VoroNav is the waypoint-based approach. This analysis explicitly compares the performance of the three paradigms (i.e., FBE-based, waypoint-based and our top-view spatial reasoning) on the object navigation task.

As shown in Tab.1, our TopV-Nav significantly improves the navigation performance. Specifically, our approach outperforms ESC on the validation split of MP3D by improving SR and SPL by 3.2%percent3.23.2\%3.2 % and 1.9%percent1.91.9\%1.9 % respectively. Moreover, on HM3D benchmark, SR is increased from 42.0%percent42.042.0\%42.0 % to 45.9%percent45.945.9\%45.9 % and SPL is raised from 26.0%percent26.026.0\%26.0 % to 28.0%percent28.028.0\%28.0 %. Such an improvement reveals that MLLM’s direct layout reasoning on the top-view map leverages more spatial clues than the map-to-language manner followed by both FBE and waypoint-based paradigms, demonstrating the superiority of our proposed approach.

4.3 Ablation Studies

To analyze the contribution of each module, we conduct ablation experiments, which are shown in Tab.2, Tab.3, and Tab.4. Also, in order to save time and cost, an indicative subset of HM3D dataset is sampled to carry out ablation studies. We sample 200200200200 episodes, which cover all validation scenes and target object categories on HM3D. In all ablation experiments, we utilize the same episodes for evaluation to ensure stability.

Adaptive Visual Prompt Generation. The Adaptive Visual Prompt Generation (AVPG) method is a core part for MLLM’s layout reasoning. As shown in Tab.2, the AVPG effectively supports and releases the MLLM’s spatial reasoning abilities, achieving a significant 40.5%percent40.540.5\%40.5 % SR and 22.7%percent22.722.7\%22.7 % SPL on the HM3D benchmark. The result suggests that MLLM has substantial potential to guide agent’s exploration through grasping visual prompts and performing layout reasoning on the top-view map.

Dynamic Map Scaling. As shown in Tab.2, we observe that applying the DMS mechanism promotes SR from 40.5%percent40.540.5\%40.5 % to 42.0%percent42.042.0\%42.0 %. It implies that the agent’s zooming operation benefits object navigation. Besides, SPL is improved from 22.7%percent22.722.7\%22.7 % to 23.6%percent23.623.6\%23.6 %. This improvement of navigation efficiency reveals that grasping more visual clues and layout information benefits fine-grained layout reasoning, preventing the agent from potentially long-distance exploration.

Target-Guided Navigation. Through comparing “#2” and “#3” in Tab.2, we observe that our proposed TGN module boosts SR from 42%percent4242\%42 % to 43.5%percent43.543.5\%43.5 % and improves SPL from 23.6%percent23.623.6\%23.6 % to 24.7%percent24.724.7\%24.7 %, which indicates that MLLM’s spatial layout inference estimates the potential goal location and this location predicted by MLLM further effectively guides agent’s exploration. This significant augmentation also confirms the importance of a human-like predictive reasoning in unfamiliar environments.

HM3D
NameAVPGDMSTGNSR\uparrowSPL\uparrow
#140.522.7
#242.023.6
#343.524.7

Visual Prompt Components. The results in Tab.3 indicate the effects of different visual prompt components on the task performance. The comparison between “#2” and “#1” confirms that explicitly marking the spatial locations of objects on the top-view map improves navigation performance. In this manner, SR is lifted up from 39.5%percent39.539.5\%39.5 % to 42%percent4242\%42 % and SPL is improved by 1.7%percent1.71.7\%1.7 %. Besides, integrating MLLM’s predicted target as prompts further gains a significant improvement of SR (42%percent4242\%42 % to 43.5%percent43.543.5\%43.5 %) and SPL (23.8%percent23.823.8\%23.8 % to 24.7%percent24.724.7\%24.7 %). It illustrates that rich visual prompts embedded in the top-view map facilitate MLLM’s layout reasoning and therefore enhance the agent’s navigation ability.

Map Scaling Ratios. We investigate how the ratios of agent’s zooming operation affect the navigation performance. As shown in Tab.4, both SR and SPL increase as the candidate number of map scaling ratios increases. When the top-view map remains unchanged (i.e., “#1” in Tab.4), the object navigation performance is relatively poor. This limitation stems from the fact that MLLM cannot perform zooming for fine-grained layout reasoning, especially in object-dense scenes. By adding 1/2121/21 / 2 to map scaling ratios, SR rises from 39.0%percent39.039.0\%39.0 % to 41.5%percent41.541.5\%41.5 % while SPL is also increased by 1.5%percent1.51.5\%1.5 %. Moreover, introducing more map scales 1/3,1/513151/3,1/51 / 3 , 1 / 5 further improves navigation performance as shown in “#3”, lifting SR and SPL from 41.5%percent41.541.5\%41.5 %, 23.1%percent23.123.1\%23.1 % to 43.5%percent43.543.5\%43.5 %, 24.7%percent24.724.7\%24.7 %.

Visual PromptHM3D
NameHistoryObjectTargetSR\uparrowSPL\uparrow
#139.522.1
#242.023.8
#343.524.7

Map ScalesHM3D
Name11/21/3 and 1/5SR\uparrowSPL\uparrow
#139.021.6
#241.523.1
#343.524.7

Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (4)

4.4 Qualitative Analysis

We visualize the navigation process along with the MLLM’s spatial reasoning to give a more intuitive view.

Fig.4 displays several examples of MLLM’s spatial reasoning. Note that we utilize different colors to annotate specific areas (e.g., obstacles and navigable areas). Object categories are also directly marked in the top-view map according to their spatial locations, facilitating MLLM to perform layout reasoning.

In Fig.4(a), the agent is tasked to search for a bed in an indoor environment. In this time step, the representative objects of a living-room such as chairs and cabinets are located in this area. The MLLM observes these objects and the scene layout which is likely a living-room. Based on the embedded prior knowledge, MLLM performs layout reasoning and infers that key area 5 is likely the location of a bedroom. Besides, by leveraging commonsense “bed is generally located in the bedroom”, MLLM utilizes the provided coordinate system information and estimates the potential location of the goal to be “(2.5, 7.5)”. This prediction further guides the local action policy, allowing the agent to gradually approach the target object.

In Fig.4(b), considering the goal is “bed”, MLLM interprets this top-view map and notices the agent is in the kitchen area and near the living-room according to the spatial layout displayed in the map. Considering that the living room and bedroom are usually connected by a corridor, MLLM indicates the bedroom may be located in “(5.0, 8.0)” and guides the agent to navigate to it in search of the goal “bed”.For example in Fig.4(c), MLLM observes the spatial layout and notices the detected bed. Considering that the bed is located in the bedroom which also contains the goal “toilet”, MLLM predicts the target location “(2.5, 3.5)”.

The examples mentioned above reveal that our proposed method effectively unlocks the MLLM’s top-view spatial potential and completes zero-shot object navigation tasks. Benefiting from MLLM’s spatial inference over the top-view map, this paradigm allows the agent to act in global action space and to archive a highly flexible exploration.

5 Conclusion

In this paper, we tackle the Zero-Shot Object Navigation (ZSON) task, where spatial information plays a critical role in such a goal-oriented exploration task. However, previous LLM-based methods must transfer the top-view map to language descriptions, conducting exploration reasoning in the linguistic domain. Such a transformation process loses spatial information such as object and room layout. Therefore, we aim to study how we can directly adopt the top-view map for reasoning by using MLLM’s ability to understand images. Specifically, we propose several insightful methods to fully unlock the top-view spatial reasoning potential of MLLM. The proposed Adaptive Visual Prompt Generation (AVPG) method draws a semantically-rich map with visual prompts. The Dynamic Map Scaling (DMS) mechanism aims to adjust the map scale to appropriate scales for MLLM to interpret layout and decision-making. The Target-Guided Navigation (TGN) mechanism imitates human behavior to predict the target’s potential location to guide the current action. Experiments on MP3D and HM3D benchmarks demonstrate the effectiveness of our work.

References

  • Cai etal. [2024]Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong.Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill.In ICRA, 2024.
  • Chang etal. [2017]Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang.Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017.
  • Chaplot etal. [2020a]DevendraSingh Chaplot, DhirajPrakashchand Gandhi, Abhinav Gupta, and RussR Salakhutdinov.Object goal navigation using goal-oriented semantic exploration.In NeurIPS, 2020a.
  • Chaplot etal. [2020b]DevendraSingh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta.Neural topological slam for visual navigation.In CVPR, 2020b.
  • Chen etal. [2024a]Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia.Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.In CVPR, 2024a.
  • Chen etal. [2022]Jinyu Chen, Chen Gao, Erli Meng, Qiong Zhang, and Si Liu.Reinforced structured state-evolution for vision-language navigation.In CVPR, 2022.
  • Chen etal. [2024b]Jiaqi Chen, Bingqian Lin, Xinmin Liu, Xiaodan Liang, and Kwan-YeeK Wong.Affordances-oriented planning using foundation models for continuous vision-language navigation.In CoRR, 2024b.
  • Deitke etal. [2022]Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi.Procthor: Large-scale embodied ai using procedural generation.In NeurIPS, 2022.
  • Ester etal. [1996]Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, etal.A density-based algorithm for discovering clusters in large spatial databases with noise.In KDD, 1996.
  • Gadre etal. [2023]SamirYitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song.Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation.In CVPR, 2023.
  • Gao etal. [2021]Chen Gao, Jinyu Chen, Si Liu, Luting Wang, Qiong Zhang, and Qi Wu.Room-and-object aware knowledge reasoning for remote embodied referring expression.In CVPR, 2021.
  • Gao etal. [2023a]Chen Gao, Si Liu, Jinyu Chen, Luting Wang, Qi Wu, Bo Li, and Qi Tian.Room-object entity prompting and reasoning for embodied referring expression.TPAMI, 2023a.
  • Gao etal. [2023b]Chen Gao, Xingyu Peng, Mi Yan, He Wang, Lirong Yang, Haibing Ren, Hongsheng Li, and Si Liu.Adaptive zone-aware hierarchical planner for vision-language navigation.In CVPR, 2023b.
  • He etal. [2017]Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.Mask r-cnn.In ICCV, 2017.
  • Kessler etal. [2024]Fabian Kessler, Julia Frankenstein, and ConstantinA Rothkopf.Human navigation strategies and their errors result from dynamic interactions of spatial uncertainties.Nature Communications, 2024.
  • Khandelwal etal. [2022]Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi.Simple but effective: Clip embeddings for embodied ai.In CVPR, 2022.
  • Lee etal. [2024]OliviaY Lee, Annie Xie, Kuan Fang, Karl Pertsch, and Chelsea Finn.Affordance-guided reinforcement learning via visual prompting.RSS, 2024.
  • Lei etal. [2024]Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu.Scaffolding coordinates to promote vision-language coordination in large multi-modal models.arXiv preprint arXiv:2402.12058, 2024.
  • Li etal. [2022]LiunianHarold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, etal.Grounded language-image pre-training.In CVPR, 2022.
  • Liu etal. [2024]Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, etal.Grounding dino: Marrying dino with grounded pre-training for open-set object detection.In ECCV, 2024.
  • Majumdar etal. [2022]Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra.Zson: Zero-shot object-goal navigation using multimodal goal embeddings.NeurIPS, 2022.
  • Maksymets etal. [2021]Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, and Dhruv Batra.Thda: Treasure hunt data augmentation for semantic navigation.In ICCV, 2021.
  • Mayo etal. [2021]Bar Mayo, Tamir Hazan, and Ayellet Tal.Visual navigation with spatial attention.In CVPR, 2021.
  • Mousavian etal. [2019]Arsalan Mousavian, Alexander Toshev, Marek Fišer, Jana Košecká, Ayzaan Wahid, and James Davidson.Visual representations for semantic target driven navigation.In ICRA, 2019.
  • Nasiriany etal. [2024]Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, etal.Pivot: Iterative visual prompting elicits actionable knowledge for vlms.In ICML, 2024.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In ICML, 2021.
  • Ramakrishnan etal. [2022]SanthoshKumar Ramakrishnan, DevendraSingh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman.Poni: Potential functions for objectgoal navigation with interaction-free learning.In CVPR, 2022.
  • Ramrakhya etal. [2022]Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das.Habitat-web: Learning embodied object-search strategies from human demonstrations at scale.In CVPR, 2022.
  • Ramrakhya etal. [2023]Ram Ramrakhya, Dhruv Batra, Erik Wijmans, and Abhishek Das.Pirlnav: Pretraining with imitation and rl finetuning for objectnav.In CVPR, 2023.
  • Savva etal. [2019]Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, etal.Habitat: A platform for embodied ai research.In ICCV, 2019.
  • Sethian [1999]JamesA Sethian.Fast marching methods.SIAM review, 1999.
  • Sun etal. [2024]Xinyu Sun, Lizhao Liu, Hongyan Zhi, Ronghe Qiu, and Junwei Liang.Prioritized semantic learning for zero-shot instance navigation.In ECCV, 2024.
  • Wortsman etal. [2019]Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.Learning to learn how to learn: Self-adaptive visual navigation using meta-learning.In CVPR, 2019.
  • Wu etal. [2024]Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu.Voronav: Voronoi-based zero-shot object navigation with large language model.ICML, 2024.
  • Wu etal. [2018]Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, Georgia Gkioxari, and Yuandong Tian.Learning and planning with a semantic model.arXiv preprint arXiv:1809.10842, 2018.
  • Yadav etal. [2023]Karmesh Yadav, Ram Ramrakhya, SanthoshKumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, AngelXuan Chang, Dhruv Batra, Manolis Savva, etal.Habitat-matterport 3d semantics dataset.In CVPR, 2023.
  • Yang etal. [2018]Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, and Roozbeh Mottaghi.Visual semantic navigation using scene priors.arXiv preprint arXiv:1810.06543, 2018.
  • Ye etal. [2021]Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans.Auxiliary tasks and exploration enable objectnav.In ICCV, 2021.
  • Ye and Yang [2021]Xin Ye and Yezhou Yang.Hierarchical and partially observable goal-driven policy learning with goals relational graph.In CVPR, 2021.
  • Zhai and Wang [2023]AlbertJ Zhai and Shenlong Wang.Peanut: Predicting and navigating to unseen targets.In ICCV, 2023.
  • Zhang etal. [2022a]Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao.Glipv2: Unifying localization and vision-language understanding.In NeurIPS, 2022a.
  • Zhang etal. [2020]Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu.Fusion-aware point convolution for online semantic 3d scene segmentation.In CVPR, 2020.
  • Zhang etal. [2023a]Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang.3d-aware object goal navigation via simultaneous exploration and identification.In CVPR, 2023a.
  • Zhang etal. [2022b]Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang.Generative meta-adversarial network for unseen object navigation.In ECCV, 2022b.
  • Zhang etal. [2023b]Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang.Layout-based causal inference for object navigation.In CVPR, 2023b.
  • Zhang etal. [2024]Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang.Imagine before go: Self-supervised generative map for object goal navigation.In CVPR, 2024.
  • Zhao etal. [2022]Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu.Target-driven structured transformer planner for vision-language navigation.In ACM MM, 2022.
  • Zhou etal. [2024]Gengze Zhou, Yicong Hong, and Qi Wu.Navgpt: Explicit reasoning in vision-and-language navigation with large language models.In AAAI, 2024.
  • Zhou etal. [2023]Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and XinEric Wang.Esc: Exploration with soft commonsense constraints for zero-shot object navigation.In ICML, 2023.
  • Zhu etal. [2022]Minzhao Zhu, Binglei Zhao, and Tao Kong.Navigating to objects in unseen environments by distance prediction.In IROS, 2022.
Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Tish Haag

Last Updated:

Views: 5813

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.