Embodied Agent in Urban Environment

Abstract

Embodied intelligence is considered one of the possible approaches toward artificial general intelligence (AGI), which focuses on the ability to perceive first-view data from the world and make decisions adaptively based on the feedback received. We first review advances in LLM agent-based embodied intelligence and provide resources and future directions for research and development.

Then, we propose a comprehensive benchmark platform and agents powered by large pre-trained models in complex urban environment. This platform includes a simulator and datasets for evaluating five representative tasks of embodied intelligence in urban environments, focusing on scene understanding, reasoning, and decision-making.

Within this benchmark, we delve into two critical embodied tasks. The first task is vision-and- language navigation (VLN), where an agent navigates following human language instructions. Traditional VLN agents, which perform well in indoor environments using navigation graphs, struggle in 3D continuous urban spaces. To address this, we introduce CityNav, a city-level VLN agent that employs semantic map-based spatial reasoning and a topological memory graph for robust navigation in urban environments. CityNav constructs a 3D local semantic map, utilizes LLMs for common-sense reasoning, and records navigable locations in a memory graph, significantly enhancing navigation performance.

The second key task is location-goal embodied navigation, particularly relevant to drone delivery systems in urban settings. We present DeliverGPT, an agent powered by large pre-trained models, featuring modules for perception, planning, motion, and memory. DeliverGPT builds a semantic graph for spatial understanding and uses a memory mechanism to improve delivery efficiency. Experimental results demonstrate a substantial performance improvement over existing methods, showcasing the effectiveness of integrating semantic graphs and memory mechanisms in drone-based navigation tasks.

Large Language Model Agent for Embodied Intelligence: A Survey and Perspectives

Author

Chen Gao, Xiaochong Lan, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Sibo Li, Jiaao Tang, Zihan Huang, Yuwei Du, Jie Feng, Yong Li

Introduction

Illustration of Embodied Vision Modules

Embodied intelligence is considered one of the promising approaches toward artificial general intelligence (AGI), focusing on the ability to perceive first-view data from the world and make decisions adaptively based on feedback received. Alan Turing's concept of "situated AI," which aims to build embodied intelligences situated in the real world, underscores the necessity of embodiment in AI. However, despite advancements in machine learning and deep learning, achieving true embodied intelligence has remained challenging. Traditional machine learning methods, reliant on offline data, struggle with generalization, indicating a significant gap before reaching genuine embodied intelligence. Recently, large language models (LLMs) have redefined the boundaries of artificial intelligence with their impressive abilities in language-related tasks, reasoning, and decision-making. While LLMs excel in understanding and generating human language and possess rich knowledge and reasoning abilities, they also have notable limitations, such as susceptibility to errors and hallucinations, and challenges in logical reasoning and fine-grained visual understanding. To address these shortcomings, researchers have proposed LLM-empowered agents, which integrate advanced reasoning, learning mechanisms, and interactive capabilities. These agents represent a promising direction for achieving human-like intelligence by combining the strengths of LLMs with additional components to simulate more holistic and dynamic cognitive processes. This paper reviews recent advances in LLM agent-based embodied intelligence, discusses the necessity and potential of LLM agents in this area, and provides resources and future directions for research and development.

Conclusion

In this paper, we take the pioneering step to systematically review the recent advances of large language model agents in the research area of embodied intelligence. For the embodied intelligence for which the tasks are various, we first present the basic taxonomy. We then elaborate on those recent advances with large language model agents, following the above taxonomy. We further discuss the remaining open problems and the promising research. We believe the survey paper can help the readers quickly grasp the recent advances and inspire the following research.

Benchmark

Author

Chen Gao , Baining Zhao , Weichen Zhang, Jinzhu Mao, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, Xinlei Chen, Yong Li

Introduction

We first constructed a city-embodied environment simulator. This platform is developed based on a city simulator, providing 3D environments and interactions. The basic environment of the simulator includes a large business district in Beijing, one of the biggest city in China, in which we build 3D model for buildings, streets, and other elements, hosted by Unreal Engine 4.17. We further build the interface of embodied agents to ensure the agents can indeed embod themselves in the system. To implement it,we use the AirSim plugin provide by Microsoft. Specifically, AirSim is originally designed for airdrones, for which the observations are conducted through a first-view manner, and the control for airdrones includes motion, velocity, accelerated velocity, etc.

Benchmark Framework

Overview of Simulator

We defined a system of five tasks, including embodied scene description, embodied question answering, embodied dialogue, embodied visual language navigation, and embodied task planning. For each task, we carefully and manually set up the input/output, and construct the ground truth data combined with large language models and human labor. We also provide the interface for the platform through which the agents can obtain the embodied observations and take actions in real-time simulation, after which the agent can be evaluated. Moreover, we deploy the most famous and widely used large language models to construct the embodied agents, the intelligence level of which is evaluated on five tasks.

Embodied Tasks

Result

Table 1 :Results of embodied first-view scene understanding, including typical evaluation metrics: BLEU, ROUGE, METEOR, and CIDEr.

Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	CIDEr
fuyu-8B	40.25	20.26	8.40	1.57	17.29	15.80	21.55
Qwen-VL	40.57	17.59	5.90	0.98	14.61	19.13	18.40
Claude 3	57.38	31.73	16.83	7.19	21.60	29.00	29.20
GPT-4 Turbo	54.01	27.63	12.73	4.53	21.99	28.48	22.39

Table 2 :Results of embodied question answering. The Counting task involves querying the number of a specific object within the field of view. The Property task entails inquiring about the attributes of spatial entities such as city buildings or objects within the field of view, including aspects like shape and color. The Position task concerns querying the spatial relationships between different urban elements within the field of view.

Type	Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	CIDEr
Counting	fuyu-8B	12.00	7.15	1.07	0.40	16.45	15.41	8.87
	Qwen-VL	5.49	1.19	0.10	0.00	11.46	17.89	3.58
	Claude 3	6.08	4.33	2.79	2.13	10.54	16.82	7.95
	GPT-4 Turbo	12.84	8.81	4.33	2.78	19.26	20.18	11.56
Property	fuyu-8B	20.19	18.36	16.39	14.64	31.55	20.34	22.56
	Qwen-VL	55.77	48.43	40.90	31.94	65.33	61.73	33.30
	Claude 3	49.34	41.88	34.10	23.44	60.51	55.29	29.84
	GPT-4 Turbo	76.63	72.17	68.57	65.51	80.16	77.10	61.44
Position	fuyu-8B	7.46	0.15	0.00	0.00	18.94	4.40	12.86
	Qwen-VL	7.88	4.63	3.81	0.83	18.03	22.00	16.62
	Claude 3	7.57	5.85	4.37	1.56	19.04	34.28	18.82
	GPT-4 Turbo	64.54	61.85	59.44	55.31	70.72	68.87	58.45

Table 3 :Results of embodied dialogue.
Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	CIDEr
fuyu-8B	29.05	16.73	8.24	4.30	28.53	30.12	14.47
Qwen-VL	17.91	9.54	3.90	2.03	19.33	19.65	10.30
Claude 3	24.86	18.02	13.14	9.70	29.06	38.56	28.62
GPT-4 Turbo	41.77	34.27	27.82	23.26	42.29	51.72	35.64

Table 4: Results of embodied vision-and-language navigation.
Model	Short			Long			Mean
Model	SR/%	SPL/%	NE/m	SR/%	SPL/%	NE/m	SR/%	SPL/%	NE/m
Qwen-VL	33.33	29.60	67.30	8.33	6.67	145.3	22.22	19.33	120.44
Claude 3	76.92	75.60	139.11	20.00	19.65	185.48	34.90	34.25	162.35
GPT-4 Turbo	60.90	55.21	95.93	15.62	14.16	127.87	27.71	25.12	111.92
GPT-4O	76.92	75.60	77.23	20.00	19.65	102.98	34.90	34.25	90.11

Table 5 :Results of embodied task planning.
Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	CIDEr
fuyu-8B	15.11	6.37	1.71	0.45	14.72	19.11	16.84
Qwen-VL	20.28	9.10	3.75	1.44	19.42	17.90	11.36
Claude 3	29.21	16.22	9.17	4.40	22.85	31.58	21.78
GPT-4 Turbo	28.23	13.72	6.26	2.82	21.61	28.47	16.41

CityNav: Embodied Zero-shot Vision-and-Language Navigation in City Environment with LLM-empowered Agent

Author

Weichen Zhang, Chen Gao, Shiquan Yu, Baining Zhao,Han Li Qian Zhang, Susu Xu, Jinqiang Cui, Xinlei Chen, Yong Li

Introduction

CityNav Framework

We propose CityNav, an LLM-empowered aerial agent for urban VLN. The agent should have the ability to understand the navigation instruction and decompose the navigation task to a sequence of subgoals and perceive the spatial information of surroundings and plan for the next waypoint. In that case, a model with vision and reasoning capacity is required for such complex navigation task. Besides, to improve the navigation efficiency and stability, the agent also needs to leverage its historical trajectories. In our work, we leverage the GroundingDino and GPT-4 for visual perception and reasoning, respectively. CityNav is composed of four key modules: 1) LLM task planner to decompose the original navigation task to a sequence of subgoals, 2) 3D Scene Understanding to extract the structed spatial information, 3) LLM waypoint planner to predict the next navigable waypoint, 4) memory for storage and retrieval of history trajectories.

Implementation

Major Result

Table 1 : Results of embodied vision-and-language navigation.
Method	Easy			Normal			Hard			Mean
Method	SR/% ↑	SPL/% ↑	NE/m ↓	SR/% ↑	SPL/% ↑	NE/m ↓	SR/% ↑	SPL/% ↑	NE/m ↓	SR/% ↑	SPL/% ↑	NE/m ↓
Random	0.0	0.0	85.6	0.0	0.0	127.9	0.0	0.0	164.9	0.0	0.0	129.6
AC	0.0	0.0	242.2	0.0	0.0	315.6	0.0	0.0	263.2	0.0	0.0	290.4
VELMA	0.0	0.0	76.5	0.0	0.0	141.7	0.0	0.0	192.4	0.0	0.0	138.0
LM-Nav	15.4	13.7	123.1	18.1	14.1	124.3	33.3	28.1	114.2	23.6	19.2	119.4
CityNav(Ours)	25.0	21.3	74.7	27.8	23.3	93.4	33.3	26.3	121.5	28.3	23.5	95.1

Our CityNav method significantly outperforms previous SOTAs on all metrics. Specifically, CityNav improves SR by approximately 28.3% and 4.7% over VELMA and LM-Nav, respectively. This demonstrates the critical importance and navigation efficiency of semantic map-based waypoint planning for continuous outdoor navigation. Furthermore, in terms of SPL, our approach achieves improvements of 23.5% and 2.5% compared to VELMA and LM-Nav, respectively, indicating that CityNav predicts navigation paths that are closer to the ground truth. The lowest error rate shows that even for those failed cases, our method still stops relatively close to the target. For easy and normal tasks, our method consistently surpasses the baseline by at least 5% in SR and SPL. For hard tasks, our method still has a similar performance to the best method.

Implementation

Case Study : The agent is spawned at a random location with a navigation instruction. The agent has to explore the ordered landmarks in the instruction based on its visual observation. Thanks to its reasoning capabilities, the agent infers objects in its FOV that are semantically related to landmarks, even when those landmarks are not visually observed. In the given example, the agent tries to find the obelisk in the park which is currently invisible. Hinted by the instruction that the obelisk is in the park, CityNav reasons that the trees probably appear in the park and decides to explore the areas near the trees while LM-Nav is gradually lost due to its lack of exploration ability.

Conclusion

In our work, we approach the problem of zero-shot vision-language navigation by proposing an embodied aerial agent, CityNav, which leverages the pre-trained knowledge in large models in three-dimensional urban spaces. The agent is composed of two modules: the semantic map-based exploration module and the memory graph-based exploitation module. The former enables the agent to explore unseen targets or landmarks with the object reasoning ability of LLM. The latter exploits historical experience to reduce the risk of long-distance exploration and improve navigation stability. The experimental results illustrate the efficacy and robustness of our method from different perspectives.

DeliverGPT: Location-Goal Embodied Navigation for Drone Delivery with Large Pre-Trained Models

Author

Baining Zhao , Chen Gao , Zile Zhou , Yanggang Xu ,Weichen Zhang , Qian Zhang , Susu Xu , Jinqiang Cui ,Xinlei Chen , Yong Li

Introduction

Task Description

DeliverGPT Framework

To address the mentioned above challenges, we propose an embodied agent named DeliverGPT. We first construct agent's comprehensive perception of the environment. We then develop spatial planning capabilities for the agent and convert the planning results into drone motion. Additionally, we design the memory module to assist the agent in spatial perception and planning for similar delivery tasks.

Major Result

Table 1: Results of zero-shot location-goal embodied navigation.
Method	Easy			Normal			Hard
Method	SR/% ↑	SPL/% ↑	DTG/m ↓	SR/% ↑	SPL/% ↑	DTG/m ↓	SR/% ↑	SPL/% ↑	DTG/m ↓
Random	0	0	80.2	0	0	146.1	0	0	227.9
Action Sampling	0.2	0.1	209.7	0.1	0	305.5	0	0	389.4
AG-GPT4	5.5	3.2	112.1	2.6	1.7	180.4	1.4	0.8	297.1
NavGPT	17.8	10.6	57.4	12.3	7.4	109.9	7.0	5.8	231.5
CoW	9.0	5.9	72.6	5.8	3.1	132.8	3.5	2.4	207.7
SayNav	25.9	19.7	55.0	19.7	15.3	90.2	10.8	7.4	183.5
DeliverGPT	41.4	33.7	45.2	35.9	30.6	72.3	23.1	16.8	130.3

All delivery cases were categorized into three difficulty levels: easy, normal, and hard, corresponding to navigation distances of 0-100m, 100m-200m, and >200m, respectively. The results lead us to the following observations. DeliverGPT outperforms other methods across various difficulty scenarios, especially in the hard scenario where it achieves SR and SPL scores that are more than twice those of the other methods. Additionally, its DTG is significantly lower than that of other methods, indicating that the drone's final position is close to the goal location. This highlights the importance of the agent's spatial perception and spatial planning abilities in outdoor urban conditions, emphasizing the significance of these capabilities compared to relevant embodied methods in indoor settings.

Case Analysis and Ablation Study of Semantic Graph

Conclusion

We introduce the location-goal embodied task for drone delivery and construct a benchmark including the simulator, dataset, and platform. Furthermore, we propose a large pre-trained model-empowered agent, which builds its spatial perception and planning capabilities around a semantic graph tailored for large-scale urban environments. The long-term memory mechanism simulates the human skill acquisition process, emphasizing that practice leads to proficiency. The experiment results illustrate the effectiveness of our simulator and method from different perspectives.