无人集群系统行为决策学习奖励机制

张婷婷; 蓝羽石; 宋爱国

doi:10.13700/j.bh.1001-5965.2020.0600

无人集群系统行为决策学习奖励机制

doi: 10.13700/j.bh.1001-5965.2020.0600

张婷婷^{1, 2, 3, ,},
蓝羽石²,
宋爱国³

1.
陆军工程大学指挥控制工程学院, 南京 210017
2.
中国电子科技集团公司第二十八研究所, 南京 210017
3.
东南大学仪器科学与工程学院, 南京 210096

基金项目:

国家自然科学基金 61802428

中国博士后科学基金 2019M651991

军委科技委国防科技基金 2019-JCJQJJ-014

详细信息

通讯作者:
张婷婷. E-mail: 101101964@seu.edu.cn

中图分类号: TP181
计量
- 文章访问数: 340
- HTML全文浏览量: 90
- PDF下载量: 221
- 被引次数: 0
出版历程
- 收稿日期: 2020-10-23
- 录用日期: 2021-04-23
- 网络出版日期: 2021-12-20

Behavioral decision learning reward mechanism of unmanned swarm system

ZHANG Tingting^{1, 2, 3
, ,},
LAN Yushi²,
SONG Aiguo³

1.
School of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210017, China
2.
The 28th Research Institute of China Electronics Technology Group Corporation, Nanjing 210017, China
3.
School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Funds:

National Natural Science Foundation of China 61802428

China Postdoctoral Science Foundation 2019M651991

National Defense Science and Technology Project Fund of Science and Technology Commission of the Military Commission 2019-JCJQJJ-014

More Information

Corresponding author: ZHANG Tingting. E-mail: 101101964@seu.edu.cn

摘要

摘要:
未来作战的发展方向是由多智能体系统构成的无人集群系统通过智能体之间自主协同来完成作战任务。由于每个智能体自主采取行为和改变状态，增加了智能群体行为策略训练的不稳定性。通过先验约束条件和智能体间的同构特性增强奖励信号的实时性，提高训练效率和学习的稳定性。采用动作空间边界碰撞惩罚、智能体间时空距离约束满足程度奖励；通过智能体在群体中的关系特性，增加智能体之间经验共享，进一步优化学习效率。在实验中，将先验增强的奖励机制和经验共享应用到多智能体深度确定性策略梯度（MADDPG）算法中验证其有效性。结果表明，学习收敛性和稳定性有大幅提高，从而提升了无人集群系统行为学习效率。
- 无人集群系统 /
- MADDPG算法 /
- 对抗任务 /
- 行为决策 /
- 奖励机制
Abstract:
Unmanned swarm system is composed of a multi-agent system, which can meet task requirements through autonomous and cooperative behavior. The instability of agent training is increased because agents adopt behavior and change states autonomously. In this paper, the prior constraints and the isomorphism between agents are used to enhance the real-time performance of reward signals and improve the efficiency of training and the stability of learning. Specifically, it includes the punishment of action space boundary collision and the reward for the satisfaction degree of the space-time distance constraint between agents. At the same time, through the relationship characteristics of agents in the group, experience sharing among agents is increased to further optimize the learning efficiency. In the experiment, the prior enhanced reward mechanism and experience sharing are applied to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm to verify its effectiveness. It is observed that the learning convergence and stability are greatly improved, and thus the behavior learning efficiency of unmanned swarm system is enhanced.
- unmanned swarm system /
- Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm /
- confrontation mission /
- behavioral decision /
- reward mechanism

HTML全文

图 1 捕食-逃逸几何模型

Figure 1. Geometric predation-escape model

下载: 全尺寸图片幻灯片

图 2 MADDPG算法训练执行视图

Figure 2. Training execution of MADDPG algorithm

下载: 全尺寸图片幻灯片

图 3 MADDPG算法训练框架

Figure 3. Training framework of MADDPG algorithm

下载: 全尺寸图片幻灯片

图 4 捕食者1奖励函数曲线

Figure 4. Curves of Predator 1 reward function

下载: 全尺寸图片幻灯片

图 5 捕食者2奖励函数曲线

Figure 5. Curves of Predator 2 reward function

下载: 全尺寸图片幻灯片

图 6 捕食者3奖励函数曲线

Figure 6. Curves of Predator 3 reward function

下载: 全尺寸图片幻灯片

图 7 逃逸者奖励函数曲线

Figure 7. Curves of escaper reward function

下载: 全尺寸图片幻灯片

图 8 奖励函数曲线总和

Figure 8. Reward function curve sum

下载: 全尺寸图片幻灯片

图 9 MADDPG、PD-MADDPG、PES-MADDPG算法奖励函数收敛性对比

Figure 9. Reward function convergence comparison among MADDPG, PD-MADDPG and PES-MADDPG algorithms

下载: 全尺寸图片幻灯片

图 10 对抗任务下双方航迹

Figure 10. Track map of both parties under confrontation mission

下载: 全尺寸图片幻灯片

图 11 捕食者无人机航迹

Figure 11. Predator UAV track map

下载: 全尺寸图片幻灯片

图 12 逃逸者无人机航迹

Figure 12. Escaper UAV track map

下载: 全尺寸图片幻灯片

图 13 智能体3V1围捕结果

Figure 13. Result of agent 3V1 roundup

下载: 全尺寸图片幻灯片

图 14 智能体20V6围捕结果

Figure 14. Result of agent 20V6 roundup

下载: 全尺寸图片幻灯片

表 1 奖励机制设置

Table 1. Reward mechanism setting

执行者	碰撞	不碰撞
捕食者	+10	-1
逃逸者	-10	+1

下载: 导出CSV

表 2 3V1对抗实验场景

Table 2. Experimental scenario of 3 versus 1 confrontation

场景名称	是否对抗	智能体数量	获胜条件
Simple_tag	是	3红 VS 1蓝	蓝：蓝色Agent尽可能避免与3个红色Agent的碰撞红：3个红色Agent之间尽可能协同与蓝色Agent发生碰撞

下载: 导出CSV

表 3 平均每步碰撞次数

Table 3. Average number of collisions per step

算法	奖励机制是否改进	平均碰撞次数
MADDPG	是否	0.538 0.521
DDPG	是否	0.532 0.523

下载: 导出CSV

参考文献(15)

[1]	张婷婷, 宋爱国, 蓝羽石. 集群无人系统自适应结构建模与预测[J]. 中国科学: 信息科学, 2020, 50(1): 347-362. https://www.cnki.com.cn/Article/CJFDTOTAL-PZKX202003005.htm ZHANG T T, SONG A G, LAN Y S. Adaptive structure modeling and prediction of cluster unmanned system[J]. Chinese Science: Information Science, 2020, 50(1): 347-362(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-PZKX202003005.htm
[2]	孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题[J]. 自动化学报, 2020, 46(7): 1301-1309. https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO202007001.htm SUN C Y, MU C X. Important scientific problems of multi-agent deep reinforcement learning [J]. Journal of Automatica Sinica, 2020, 46(7): 1301-1309(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-MOTO202007001.htm
[3]	陈杰. 多智能体系统中的几个问题[J]. 中国科学人, 2019, 12(1): 40-43. https://www.cnki.com.cn/Article/CJFDTOTAL-KXZG201912022.htm CHEN J. Several problems in multi-agent system [J]. Scientific Chinese, 2019, 12(1): 40-43(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-KXZG201912022.htm
[4]	LOWE R, WU Y I, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[EB/OL]. (2020-03-14)[2020-03-22]. http://arxiv.org/abs/1706.02275.
[5]	许诺, 杨振伟. 稀疏奖励下基于MADDPG算法的多智能体协同[J]. 现代计算机, 2020(15): 47-51. https://www.cnki.com.cn/Article/CJFDTOTAL-XDJS202015009.htm XU N, YANG Z W. Multi-agent collaboration based on MADDPG algorithm under sparse reward[J]. Modern Computer, 2020(15): 47-51(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-XDJS202015009.htm
[6]	杨慧慧, 黄万荣, 敖富江. 基于强化学习的鱼群自组织行为模拟[J]. 国防科技大学学报, 2020, 42(1): 194-202. https://www.cnki.com.cn/Article/CJFDTOTAL-GFKJ202001027.htm YANG H H, HUANG W R, AO F J. Simulation on self-organization behaviors of fish school based on reinforcement learning[J]. Journal of National University of Defense Technology, 2020, 42(1): 194-202(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-GFKJ202001027.htm
[7]	王毅然, 经小川, 贾福凯, 等. 基于多智能体协同强化学习的多目标追踪方法[J]. 计算机工程, 2020, 46(11): 90-96. doi: 10.3778/j.issn.1002-8331.1911-0132 WANG Y R, JING X C, JIA F K, et al. Multi-target tracking method based on multi-agent collaborative reinforcement learning[J]. Computer Engineering, 2020, 46(11): 90-96(in Chinese). doi: 10.3778/j.issn.1002-8331.1911-0132
[8]	邹长杰, 郑皎凌, 张中雷. 基于GAED-MADDPG多智能体强化学习的协作策略研究[J]. 计算机应用研究, 2020, 37(12): 3656-3661. https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ202012027.htm ZOU C J, ZHENG J L, ZHANG Z L. Research on collaborative strategy based on GAED-MADDPG multi-agent reinforcement learning[J]. Application Research of Computers, 2020, 37(12): 3656-3661(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ202012027.htm
[9]	高昂, 董志明, 李亮, 等. MADDPG算法并行优先经验回放机制[J]. 系统工程与电子技术, 2021, 43(2): 420-433. https://www.cnki.com.cn/Article/CJFDTOTAL-XTYD202102018.htm GAO A, DONG Z M, LI L, et al. Parallel priority experience replay mechanism algorithm of MADDPG[J]. Systems Engineering and Electronics, 2021, 43(2): 420-433(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-XTYD202102018.htm
[10]	WEIREN K, DEYUN Z, ZHEN Y. Air combat strategies generation of CGF based on MADDPG and reward shaping[C]//2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL). Piscataway: IEEE Press, 2020: 651-655.
[11]	SUN Y, LAI J, CAO L, et al. A novel multi-agent parallel-critic network architecture for cooperative-competitive reinforcement learning[J]. IEEE Access, 2020, 8: 135605-135616. doi: 10.1109/ACCESS.2020.3011670
[12]	ZHU P, DAI W, YAO W, et al. Multi-robot flocking control based on deep reinforcement learning[J]. IEEE Access, 2020, 8: 150397-150406. doi: 10.1109/ACCESS.2020.3016951
[13]	VAN OTTERLO M, WIREING M. Reinforcement learning and Markov decision processes[M]//WIREING M, VAN OTTERLO M. Reinforcement learning. Berlin: Springer, 2012: 3-42.
[14]	陈亮, 梁宸, 张景异, 等. Actor-Critic框架下一种基于改进DDPG的多智能体强化学习算法[J]. 控制与决策, 2021, 36(1): 75-82. https://www.cnki.com.cn/Article/CJFDTOTAL-KZYC202101008.htm CHEN L, LIANG C, ZHANG J Y, et al. A multi-agent reinforcement learning algorithm based on improved DDPG under actor critical framework[J]. Control and Decision, 2021, 36(1): 75-82(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-KZYC202101008.htm
[15]	孙彧, 曹雷, 陈希亮, 等. 多智能体深度强化学习研究综述[J]. 计算机工程与应用, 2020, 56(5): 13-24. https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202005003.htm SUN Y, CAO L, CHEN X L, et al. A review of multi-agent deep reinforcement learning research[J]. Computer Engineering and Application, 2020, 56(5): 13-24(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSGG202005003.htm