Large Language Model Integration for Achieving Dynamic Drone Flight

Our budding Defence Engineers, Serene Zhang, Zeaus Koh, Benedict Lee, developed a novel Multi-Agent Adaptive Retrieval Augmented Generation (RAG) system that integrates Large Language Models (LLMs) to enable drone operators, regardless of coding expertise, to employ simple natural language commands to achieve dynamic manoeuvres. Their system was pitted against- and outperformed the commonly-used LLM, GPT-4–1106, through a series of dynamic drone flight challenges, achieving a 15-fold improvement in task success. The team incorporated compatibility with Airsim and Dronekit, to pave the way for field validation in defense and security applications. The team was mentored by DSTA Engineers, Lim Gang Le, Perrie Lim, Jeremy Wong, and David Wong.

Published in

d*classified

13 min readJan 22, 2024

Motivation

The effectiveness of current teleoperated drones relies either on the proficiency of human pilots or the drone’s capacity to carry out a limited set of automated tasks designated by humans. This constraint restricts the deployment of autonomous drones to non-critical missions and diminishes their utility in challenging and contested environments like disaster zones or conflict areas. In domains such as drone-based search and rescue or security missions, significant challenges persist in deploying drones capable of adapting to dynamically complex situations with minimal or no human intervention [1]. However, there is currently no dependable model for programming drones to execute unforeseen critical tasks in such environments without prior training. This gap motivates the exploration of robotic control and advancements in Large Language Models (LLMs) in this project.

Although LLMs have recently found diverse applications, their limitations are becoming apparent. These models are inherently generalists and struggle with accuracy in specialized fields [2]. They often provide outdated information, and even those claiming to offer up-to-date data, such as Bard AI, face challenges related to relevance and reliability [3]. This is due to imbalances in online information, where reliable, specialized data is overshadowed by erroneous responses from internet users. In our project, we tackle these issues by introducing a novel, context-specific, adaptive Retrieval Augmented Generation (RAG) system. Through self-training and other carefully curated modifications, our approach aims to enhance adaptability, accuracy, user-friendliness, and speed, while also being mindful of storage efficiency for field deployment.

Our hypothesis posits that our multi-agent adaptive RAG model will surpass existing LLMs like ChatGPT in providing reliable responses to dynamic scenarios using simple natural language prompts. This addresses the need for a more precise and context-relevant solution. To substantiate our claims and objectively evaluate our system, we conduct a direct comparison with a very capable LLM, GPT-4–1106, to establish a proof of concept through empirical benchmarking.

Photo by Emiliano Vittoriosi on Unsplash

Under the Hood

Our system consists of 1 main and 4 assisting segments. The adaptive RAG system, 2 error correcting modules, an inference module and a simulation segment.

It is important to note that our system (Fig 1) is modular in nature and can be swapped with a module for use in an intranet for more secure matters. For example, the use of SerpAPI for querying Google is mentioned as testing on the internet with Google is similar to an intranet in the logical pathway and allows these research results to remain unclassified. The base LLM used and tested for this system is GPT-4–1106. This is also modular in nature and can easily be swapped with other LLMs available including local LLMs if data sensitivity is a concern. This would also progressively improve our model when better LLMs emerge, increasing duration of relevance. Our test, having a baseline comparison with the LLM utilised (GPT-4–1106), proves improvement based on a percentage success rate comparison.

Including “Common Sense Code”

To enhance user-friendliness and tackle the high-specificity instructions required in LLM-produced code [4], this system employs a two-part solution. First, an LLM agent interprets input into a more precise prompt for the coding LLM. Second, we integrate hidden prompts for common drone operations like crash avoidance, ensuring essential functions are included even if omitted by the user. Conflicting effects are inhibited by positioning the hidden prompt such that there is a disproportionate weightage favouring the input.

Figure 2: Adaptive RAG Model Training Flowchart

Adaptive RAG System

Our adaptive RAG system (Fig 2) is tailored to address the relevance issue in traditional RAG databases [5]. If a data set is prepopulated with an extensive set of files, the abundance of irrelevant files results in a reduced response accuracy [6]; if the data set is prepopulated with only a specially curated set of files, its adaptability becomes limited with its performance dropping in anything outside its designed scope. Our model gets the best of both worlds, starting with a high-quality core dataset and actively downloads relevant GitHub repositories based on user queries to add to the dataset.

First, an LLM agent takes the processed user input and outputs a string of repositories relevant to the query, which is then split into individual queries sent to Google searching API, SerpAPI [7]. The algorithm then obtains GitHub repositories in the top 10 search results and downloads them in one file. Several filters are applied, the first checking for duplication in cloned repositories to eliminate storage issues, and the second checking if a repository with a similar function, such as infrared sensor integration, exists in the database. Through a cosine similarity test comparing the ReadME of the repository with the descriptions of those in our database, the repository will only be cloned upon passing both filters.

From here, the model works like a traditional RAG model [8], using OpenAI’s text embedding model for vector database embedding. AutoLLM [9] was piloted as the core RAG model as we can easily optimise the model’s parameters for this coding use case.

Error correction

Error correction is critical for code generated by LLMs. We mitigate errors through a two-tiered approach: one LLM agent corrects syntax errors, while another comments on logical issues, ensuring the output is both syntactically sound and logically consistent. If the first agent outputs an accurate code, it is used as the final code. Otherwise, the second agent (commenter) will provide comments in the form of an appended prompt fed back into the RAG to refine the script. This is designed to reduce logical discontinuities within the code.

Simulation

For simulation compatibility, our system works with both Airsim and Dronekit codes, featuring a converter for seamless integration. Each specialised LLM agent within our network is fine-tuned through prompt engineering, utilising unique system messages and hidden prompts to guide operations effectively.

Testing & Results

We focused primarily on the AirSim simulator for our tests which is widely recognised for its drone trust tests and realistic environments [10].

However, as LLM integration with unmanned aerial vehicles (UAV) is an emerging domain, there are no widely accepted benchmarking frameworks to evaluate such a pipeline. To generate a proof of concept, we developed a testing framework to demonstrate its reliability and evaluate its task accuracy, closely mirroring real UAV mission requirements. This testing framework is categorised into three stages: first, a set of nine tests of basic UAV capabilities (covers basic manoeuvres like determining local North, takeoff and landing); second, a set of 11 tests involving operational requirements within contested environments (which requires computation and path planning such as flying in an equilateral triangle formation); and third, a set of five advanced tests which involved technical requirements for challenging operations (specifically obstacle detection and avoidance). The first and second stages are aligned with the Singapore Unmanned Aircraft Pilot License practical test requirements [11]. For the third stage, we manually set up blocks directly in the drone’s flight path which required obstacle detection and avoidance.

During testing, we benchmarked our system against GPT-4–1106, a leading LLM exhibiting near-general intelligence, surpassing prior models in difficult tasks and showcasing near-human-level performance [12]. As GPT-4–1106 is a part of our pipeline, this comparison is essential in testing our hypothesis that an adaptive RAG system and multiple integrated agents bring significant improvement in performance. All tests were designed to be adaptive, meaning our system was not pre-trained or fine-tuned.

For each test, we crafted natural language commands that specified parameters like altitude and speed of the drone. We repeated each test three times for both set-ups. Results are classified as “Successfully Complete”, “Partially Complete”, or “Fail”, and results are graphically analysed for a clear comparison. The list of test objectives and accompanying natural language commands can be found in Appendix 1.

Results

In this section, we cover the results when we applied the testing framework on both our system and GPT-4–1106. For each set of results, we converted them into percentages on a stacked bar chart, with each category representing Complete Success, Partial Success and Fail. Complete Success is recorded when the drone in AirSim finishes the assigned mission correctly. Partial Success is recorded when the drone in AirSim does part of the assigned mission but deviates or stops. Failure is recorded when the drone in AirSim either does not take off or takes off but does not do any part of the assigned mission.

Figure 3: Column charts representing levels of success by percentage; separated by stages

Fig 3 represents the results by Stages of our testing. Stage 1, or basic capabilities, consisted of 9 tests run thrice each, separately for test and control. As shown, across 27 tests of basic capabilities, our pipeline partially succeeds approximately 56% of the time and completely succeeds 44% of the time. In comparison, GPT-4–1106 fails 22% of the time, partially succeeds 67% of the time and completely succeeds 11% of the time.

Stage 2, or operational requirements, consisted of 11 tests run thrice each, separately for test and control. Across 33 tests, our pipeline fails approximately 18% of the time, partially succeeds 9% of the time, and completely succeeds 73% of the time. In contrast, GPT-4–1106 fails 91% of the time and only partially succeeds 9% of the time.

Stage 3, advanced technical requirements, consisted of 5 tests run thrice each, separately for test and control. As depicted, across 15 tests, our system partially succeeds 40% of the time and completely succeeds 60% of the time. In contrast, the drone following the code generated by GPT-4–1106 failed in completing the mission for all 15 runs.

Figure 4: Level of mission success as a proportion of tests; overall

Fig 4 displays the total successes, partial successes and failures across all stages of testing. It helps us visualise the overall performance differences between our pipeline and GPT-4–1106. Notably, our pipeline completely succeeded 60% of the time, whereas GPT-4–1106 only completely succeeded 4% of the time. Our pipeline partially succeeded 32% of the time, while GPT-4–1106 partially succeeded 28% of the time.

The proof is in the pudding — a sample of comparative test trajectories

An example of the multirotor flight path for both test and control set-ups for Test 15 (flying in a circle). Our pipeline completely succeeded while GPT-4–1106 failed.

Another example of the flight path for the multirotor operating under GPT-4–1106 code versus our pipeline’s code in Test 17 (flying in a triangle). Our pipeline completely succeeded while GPT-4–1106 had a partial success.

Another example of the flight path for the multirotor operating under GPT-4–1106 code versus our pipeline’s code in Test 19 (flying in a figure of 8). Our pipeline completely succeeded while GPT-4 failed.

Discussion

Our testing framework categorised UAV requirements which are success metrics for drone operations into three stages: basic manoeuvrability, operational requirements, and advanced technical capabilities. Across these three stages, our pipeline significantly outperformed GPT-4–1106. In terms of complete success, our system is 15 times as effective as GPT-4–1106. This is a strong indicator of the accuracy of the code our pipeline generated, compared to a powerful LLM like GPT-4–1106, proving our hypothesis that the RAG system and multiagent setup are highly effective.

In Stage 1, our pipeline had more partial successes, such as flying in L-shape and U-shape formations, than complete successes, where the drone did not fully complete either formation and deviated off course. As our pipeline allows for administrative control over the vector database, some of these errors can be eliminated by prepopulating the database with the repositories for manoeuvring a drone through basic formations. Nevertheless, it still performed better than GPT-4–1106 which failed 22% of the time compared to 0% in our pipeline. The code generated by GPT-4–1106 lacked the right preflight commands needed to arm the drone — a critical flaw given that Stage 1 consists only of basic capabilities a UAV should have in order to fly.

In Stage 2, which focuses on operational requirements such as path-planning, our pipeline outperformed GPT-4–1106 by a large margin, reducing the failure rate by 73 percentage points. Further, our pipeline successfully completed its assigned task 73% of all tests. This is again a positive sign that our pipeline can generate accurate code for operational requirements. For instance, Test 17 commanded the drone to move in an equilateral triangle of length 5 metres. Our system accurately calculated the angles for turning the drone through the manoeuvre and completed a perfect triangle. The flight path can be found in Appendix 2. When applied to real-world environments, non-technical drone operators can use simple natural language prompts to accurately direct drones in whichever path they specify.

In Stage 3, which involved more technical advanced skills such as obstacle avoidance and complex navigation, our pipeline again fared much better than GPT-4–1106 which failed every single run, while our system completely succeeded in 9 out of 15 runs with a 0% failure rate. The system intelligently searched for relevant repositories like the Bug2 algorithm, an efficient path-planning, obstacle-avoidance method [13] to be used as additional context for generating drone code, heightening its success rate. This proves the concept of an adaptive system combined with a multi-agent setup, is effective in enhancing code accuracy for complex tasks.

Future Work

Our algorithm is at a Technology Readiness Level (TRL) [14] of 4. To push this technology further towards real-world use, physical rather than virtual testing needs to be conducted. Such testing was intended and will be carried out but not within the timeline of this report. Instead, experimental verification to get our algorithm to TRL 5–7 would be done as future work. A user interface needs to be developed, with a feature to checkpoint at the post-simulation section for the user to verify the code’s accuracy and functionality (against desired flight plan) before being uploaded to the drone. Most importantly, the code processing duration is to be further shortened. Our current code takes approximately 30 minutes to run each time. Although this length is shorter than the time generally taken to program and deploy a semi-autonomous drone for flight and hence is already viable, it would be ideal to increase the time efficiency of our algorithm to future-proof our algorithm with the decreasing set-up duration that drone systems appear to be heading towards.

Game for a similar challenge? Step into your future

Excited by what you’ve read? There could be a thinker and tinkerer in you that seeks a greater challenge. Learning never stops — chart your next adventure, and push the envelope in defence tech with us through the Young Defence Scientist Programme.

Slide into our DMs here to fuel your passion for science and technology and be mentored by Singapore’s top engineers and developers.

Appendix — Test Cases

References

[1] K. Okada and E. Hayakawa, “Flow-based ROS2 programming environment for control drone,” in Communications in Computer and Information Science, Cham: Springer International Publishing, 2020, pp. 449–453 (accessed Jan. 1, 2024).

[2] M. Bilan, “Hallucinations in LLMS: What you need to know before integration,” Master of Code Global, https://masterofcode.com/blog/hallucinations-in-llms-what-you-need-to-know-before-integration (accessed Jan. 1, 2024).

[3] T. Lacoma and S. Winkelman, “Google Bard explained: What this AI-powered ChatGPT competitor can do,” Android Police, 08-Feb-2023. [Online]. Available: https://www.androidpolice.com/google-bard-explained/. (accessed Jan. 1, 2024).

[4] R. Carter, “How to talk to an LLM: Prompt engineering for beginners,” UC Today, 24-Nov-2023. [Online]. Available: https://www.uctoday.com/unified-communications/how-to-talk-to-an-llm-llm-prompt-engineering-for-beginners/. (accessed: Jan. 1, 2024).

[5] “lancedb/README.md at main · lancedb/lancedb,” GitHub. https://github.com/lancedb/lancedb/blob/main/README.md (accessed Jan. 1, 2024).

[6] “Retrieval Augmented Generation (RAG),” Cohere AI. [Online]. Available: https://docs.cohere.com/docs/retrieval-augmented-generation-rag. (accessed Jan. 1, 2024).

[7] “Google search API,” SerpApi, https://serpapi.com/ (accessed Jan. 1, 2024).

[8] “What is retrieval-augmented generation (RAG)?,” Oracle.com, 19-Sep-2023. [Online]. Available: https://www.oracle.com/sg/artificial-intelligence/generative-ai/retrieval-augmented-generation-rag/. (accessed Jan. 1, 2024).

[9] Safevideo, “Safevideo/autollm: Ship rag based LLM Web Apps in seconds.,” GitHub, https://github.com/safevideo/autollm (accessed Jan. 1, 2024).

[10]“Google Search Results in Python,” GitHub, Jan. 02, 2024. https://github.com/serpapi/google-search-results-python (accessed Jan. 2, 2024).

[11] “UA Pilot Licence,” CAAS — CWP. https://www.caas.gov.sg/public-passengers/unmanned-aircraft/ua-regulatory-requirements/ua-pilot-licence (accessed Jan. 1, 2024).

[12] S. Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4,” arXiv [cs.CL], 2023.(accessed Jan. 1, 2024).

[13] MehdiShahbazi, “Mehdishahbazi/AirSim-multirotor-bug2-algorithm: Python implementation of Bug2 algorithm to navigate a quadcopter/multirotor in the AirSim simulator.,” GitHub, https://github.com/MehdiShahbazi/AirSim-Multirotor-Bug2-Algorithm (accessed Jan. 2, 2024).

[14] “What are Technology Readiness Levels (TRL)?,” Twi-global.com. [Online]. Available: https://www.twi-global.com/technical-knowledge/faqs/technology-readiness-levels. (accessed Jan. 1, 2024).