Multi-Agent Arena: Insights from London Great Agent Hack 2025

People are going to use more and more AI. Acceleration is going to be the path forward for computing. These fundamental trends, I completely believe in them.

Jensen Huang. Nvidia CEO

Some days ago I had the amazing opportunity to participate in the Great Agent Hack 2025, hosted by Holistic AI at UCL[2, 3]. The hackathon was structured around three big challenges: Agent Iron Man, Agent Glass Box, and Dear Grandma, each representing a different philosophy of agentic AI. These weren’t just creative names for convenient categories; they reflected three pillars of how we think about agents today: robustness, transparency, and user safety (of anyone, including your grandma 😄). Being immersed in that environment for a weekend was a kind of reset button for me: it was energising, it reminded me why I enjoy working in this field, and it left me genuinely inspired to keep learning and building, even if there’s never enough time to explore everything that’s happening around AI.

In this hackathon, more than 50 projects were developed across three tracks. The focus of this article will be on key moments from the event and a handful of projects that stood out to me personally, while recognizing that every team contributed something valuable to the broader conversation on building robust and trustworthy agents. For readers who want to explore the full range of ideas, the complete gallery of 51 submissions is available here: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1 [4].

Figure 1. Official leaflet and my T-shirt from The Great Agent Hack 2025. Image by the author.

Hosted by the UCL Centre for Digital Innovation (CDI), we spent the weekend in some truly unique spaces in East London, the kind of place where you walk past the Orbit Tower (the red sculpture from the 2012 Olympics) and then code under a rotating floating Earth inside the building (Figure 2). London was already covered in Christmas lights everywhere you walked, so moving between the hackathon and the city felt like stepping between a research lab and a holiday postcard.

**Figure 2.** East London views: UCL East campus and the ArcelorMittal Orbit (also called Orbit Tower) (left), and the floating Earth installation inside the UCL Centre for Digital Innovation (right). Photos by the author.

In total, the hackathon brought together more than 200 participants and roughly 25 different awards across all kinds of categories. Teams weren’t dropped in cold: before the weekend they had access to tutorials, example notebooks, and other resources that helped them prepare [5], choose a track, and hit the ground running once the clock started. As deliverables, each team was expected to submit a public GitHub repository, record a short demo, and create a poster or slide deck to present their solution to the jury, which made it much easier to understand the full workflow and real-world potential of every project.

The jury came from a surprisingly diverse mix of organisations: Holistic AI (the organiser), the UCL Centre for Digital Innovation (CDI), AWS, Valyu, NVIDIA, Entrepreneurs First, and others, including companies interested in the talent and ideas on display. They selected the winners for each of the three main tracks, but also handed out a whole constellation of mystery and special awards that celebrated much more than just the most technically advanced solution.

Among these special awards there was a Brave Soldier-style prize for the team that showed true resilience and kept going even when their teammates started disappearing, literally leaving one soldier standing; a Best Pitch award, because selling your idea is also part of getting the job done (especially since technical professionals tend to struggle a bit with this); and a Highest Resource Usage prize for the teams that really leaned into AWS and squeezed every last spark out of the cloud. These and other award categories are summarised on the hackathon website [2].

One of the most curious things about the weekend was the chance to see NVIDIA’s ultra‑compact AI supercomputer up close and even take a photo with the iconic leather‑jacket setup to recreate the famous Elon Musk × Jensen Huang “leather jacket moment” [6] shown on the big screen (Figure 3). To make it even better, some of the agents we were trying to break in the Dear Grandma challenge were actually running on similar NVIDIA GPU hardware, so this tiny supercomputer was literally the brain behind the agents that competitors were attacking.

**Figure 3.** The full NVIDIA experience: the leather-jacket photo setup with the DGX Spark (left) and a close-up of the ultra-compact DGX Spark (right). Images by the author.

The Agentic Arena

As mentioned at the beginning of this article, the heart of the weekend was structured around three tracks (Figure 4). Each one explored a different question about modern AI agents: how to build them so they work, how to make them transparent, and how to make sure they don’t go rogue.

Teams could pick whichever track best fit their use case, but in practice many projects naturally crossed track boundaries; a sign of how eager people were to learn, connect, and bring together different aspects of the agent lifecycle (yes, the idea that the more tracks you join the greater your chances of winning was floating around too, but we’ll skip that for now 😉).

**Figure 4.** The three tracks of the Great Agent Hack 2025: *Agent Iron Man* (build agents that don’t break), *Agent Glass Box* (understand agent behaviour), and *Dear Grandma* (attack like a red team, defend like a guardian). Image by Author.

Track A. Agent Iron Man: Agents that work, and last

This was the engineering reality check track. The goal was to build a high-performing, production-ready multi-agent architecture with clear agent roles, tools, and memory wired together in a way that could actually survive outside a hackathon.

Evaluation focused on things that usually only hurt you in production: performance (speed, latency, cost), robustness (how the agent handles tool failures, bad inputs, and edge cases), architecture quality (clean separation between agents, safe tool orchestration, sensible fallbacks), and monitoring (observability, structured outputs, basic health checks). Teams were also expected to account for carbon footprint by favouring smaller or cheaper models where possible and measuring energy and token usage, so the agent remains a conservative, responsible use of compute.

This track is also a small taste of what is coming as agents become more widely used and systems grow more complex, with many services talking to each other while still needing to meet tight latency and cost targets.

Between the projects, one that caught my eye was FairQuote [4]: an intelligent car‑insurance underwriting system that uses an orchestrator agent plus specialised intake, pricing, and policy agents that coordinate to collect data, assess risk, calculate premiums, and generate explainable policies in a single conversation; architecturally, it points toward the next wave of multi‑agent enterprise workflows, where robustness, clear responsibilities, and strong observability matter just as much as the underlying models.

Underwriting is a good example because it’s one of the hardest and most business-critical problems in insurance. It sits at the intersection of regulation, actuarial science, and customer experience: every decision about accepting a risk, pricing it, or applying exclusions passes through this process. When underwriting is slow or opaque, customers get frustrated, partners lose trust, and insurers risk mispriced portfolios and regulatory scrutiny. When it works well, it quietly keeps the system stable, allocating capital efficiently, protecting the balance sheet, and supporting fair pricing across segments.

So, in this track, it was great to see not only solid engineering, but also the real problems teams tackled: underwriting, end-to-end claims handling, fraud investigation, and even emergency-services dispatch, where multi-agent systems coordinated triage and decision support in real time. Even if the weekend outputs were still demos, they pointed toward the multi-agent patterns, safeguards, and monitoring that will matter as similar architectures move from hackathon tables into live enterprise environments.

Team tool choices lined up closely with the hackathon’s recommended stack: AWS AgentCore with the Strands Agents SDK for orchestration, Amazon Nova and other Bedrock-hosted models (smaller SLMs to stay frugal), and evaluation frameworks like AgentHarm [7]. The latter lets you test whether an LLM agent can correctly sequence synthetic tools such as dark-web search, web scrapers, email senders, payment or bank-transfer functions, and code or shell tools; so you can measure both its robustness to jailbreaks and how capable it remains at executing multi-step harmful workflows once safety barriers are bypassed.

Track B. Agent Glass Box: Agents you can see, and trust

The transparency track focused on making agentic systems explainable, auditable, and interpretable for humans and organisations. Teams were asked to build agents whose reasoning, memory updates, and actions could be traced and inspected in real time, instead of remaining opaque black boxes. In practice, the projects fell into several families: observability pipelines, explainability tools, governance and safety layers and expert‑discovery or traceability tools.

For me, one of the projects that best captured the idea of a “glass box” was GenAI Explainer. We all know text-to-image diffusion models can be powerful but risky: traditional diffusion systems have already been shown to reproduce societal biases [8], and even newer models like FLUX.1 can still reflect patterns in their training data [9] while offering almost no insight into why a particular image appears the way it does. At the hackathon, the GenAI Explainer team tackled this by wrapping FLUX.1 with an explainability layer that lets you see how each word or segment of a prompt influences the generated image, audit outputs for brand, legal, or safety compliance, and iteratively refine prompts while watching the impact live, with every generation step tracked. In practice, they turned diffusion from a black box into something much closer to a glass-box, auditable workflow.

In the end, Track B was a reminder that algorithmic transparency is no longer optional: legal and risk teams increasingly need to show that automated decisions are explainable and not biased, and the kind of ‘glass‑box’ thinking behind projects like GenAI Explainer is something we should carry into every agentic application we build.

In this track, team tool choices combined tracing platforms such as LangSmith or LangFuse, AWS observability services like CloudWatch, X‑Ray, or Bedrock monitoring, and research tools like AgentGraph [10] (converting traces into interactive knowledge graphs), AgentSeer [11] (building action graphs and doing failure/vulnerability analysis), and the Who_and_When failure‑attribution [12] dataset to analyse and visualise agent traces in depth, to mention just a few.

Track C. Dear Grandma: Agents that stay safe, and behave

In this track, teams were given seven secret LLM agents 🐺🦊🦅🐻🐜🐘🦎, each represented by an animal, and the mission was to break them, understand them, and identify them. These seven hidden “stealth agents” symbolised different behaviours, strengths, and attack surfaces that teams needed to uncover. The challenge was to build a red‑teaming framework that could attack any of the seven live animal‑agent endpoints using the API provided by the event organisers, backed by NVIDIA powered infrastructure.

In the hackathon, each “animal” agent was a live AI system exposed through a single API service, with different routes for each animal. Teams could send prompts to these animal‑specific routes and observe how the agents behaved in real time, each with its own personality and capabilities, which helped red‑teamers design targeted tests and attacks.

Figure 5. Example of a jailbreak test against some of the “animal” agents: in front of a DAN‑style prompt, each model responds with a playful refusal and a consistent safety message, revealing both their shared guardrails and their distinct personalities.

Track C wasn’t limited to the seven “animal” agents behind the API; attacking commercial systems like ChatGPT, Claude, or Gemini was also allowed as long as teams treated it as part of a systematic security assessment.

In this way, the solution should analyse, attack, and explain AI agent vulnerabilities, perform behavioural forensics, and understand why the attack works.

The jailbreaking lab team use a two‑step process where they first built an attack library of proven jailbreak prompts, based on techniques reported in the literature such as Base64 obfuscation, CSS/HTML injection, and other prompt‑level tricks. Second, they applied a genetic algorithm to mutate and improve these prompts: whenever an attack from step one partially succeeded, the algorithm would tweak it (changing wording, adding context, combining two prompts, or further obfuscating instructions) so that successful variants were kept and weak ones were discarded. Over time, this evolutionary search produced stronger and stronger adversarial prompts and even uncovered entirely new ways to break the agents.

HSIA was another standout project that pushed these ideas into the robotics world. Instead of attacking the animal agents, they targeted a Visual–Language–Action (VLA) robotic system and showed how its perception could be corrupted at the semantic level. The pixels in the image stayed exactly the same; what changed was the internal caption generated by the model. With subtle, carefully crafted perturbations, the VLA system could flip from “I see a bottle in the image” to “I see a knife in the image,” even though no knife was present, leading the robot to act on a false belief about its environment. Their work highlights that multimodal systems can be compromised without touching the raw image, exposing a critical vulnerability for next-generation robotic AI.

Lessons Learned

If I had to summarise what this hackathon taught me, it would be:

Be a Brave Soldier. Perseverance matters more than competition. It’s not about beating others; it’s about staying resilient, adapting when things break (because they will), and delivering the best version of your idea. Events like this aren’t just technical challenges; they’re opportunities to showcase your talent and the kind of determination companies genuinely value.

Prepare ahead of time. The teams that did well weren’t necessarily the most senior, they were the ones who arrived already knowing the format, the expectations, the evaluation criteria, and had gone through the tutorials and resources shared in advance.

Master the 5-minute pitch. This is critical. Evaluators and judges move fast. You might spend several days building something, but you only get a few minutes to make them care. So, have a pitch ready that explains the value of your project clearly, quickly, and in a way that sparks curiosity. If those 5 minutes are great, the judges will ask for more. This applies equally to junior profiles and senior engineers (storytelling is part of the job). I struggle with this too; in real life we usually don’t have much time to prove our ideas.

These Events Are Becoming More Meaningful Than Ever. These events are gaining more interest every year, and the organisers even doubled the number of spots this year, which shows how valuable the experience is. That’s why it’s so important to participate only if you truly want to be there and can commit your time and energy.

Study the sponsors. Before the event, look up the companies involved and think about which ones might be most interested in your approach. Tailor your pitch accordingly. Sponsors are not just judges they’re potential collaborators, mentors, or even future teammates.

Strong Fundamentals Beat Shiny Models. One key takeaway from the hackathon is that winning wasn’t about using the newest or most hyped models. The top teams didn’t succeed because they relied on the largest or flashiest architectures, they excelled because they built strong solutions on top of solid, well-understood techniques: genetic algorithms, robust diffusion models, between other. The real differentiator was how creatively they combined these foundations with agentic methodologies, clever evaluation setups, and smart engineering to tackle persistent challenges.

Collaborative Innovation Accelerates Progress. The event highlighted how cross-disciplinary collaboration between academia, industry, and AI governance experts can significantly strengthen both AI development and governance frameworks. Even participants who weren’t in technical roles contributed valuable ideas grounded in real problems from their own domains, bringing perspectives that pure engineering alone can’t provide. It’s also a great opportunity to connect with people outside your usual technical bubble, expanding not just your network, but the way you think about the impact and applications of AI.

Finally, a bigger reflection: agents are evolving fast, and with that comes new architectural challenges, safety concerns, and responsibilities. These are not hypothetical problems of the future, they are happening right now. Being responsible with AI applications is not a hype-driven slogan; it’s part of the daily job of any AI or data science professional.

Conclusions

These events are quietly shaping how we think about AI governance. When you put powerful agentic systems under time pressure and in messy, realistic scenarios, you’re forced to confront unpredictable behaviour head-on. That’s where the real learning happens: how do we balance rapid innovation with trust and safety? How do we design evaluation frameworks and guardrails that let us move fast without losing control? This hackathon didn’t just reward clever models, it rewarded thoughtful governance.

And while there are plenty of AI events popping up everywhere, this is one of the few you should really keep an eye on, the kind that genuinely helps you grow, exposes you to real-world challenges, and reminds you why it’s worth staying curious and keeping your skills sharp.

References

References in order of appearance:

[1] “NVIDIA CEO Jensen Huang kicks off CES 2025. The Future is Here!” SupplyChainToday, 2025. Link.

[2] Great Agent Hack 2025: Holistic AI x UCL. Available at: https://hackathon.holisticai.com/ (accessed November 22, 2025).

[3] Valyu AI. (2025). The Great Agent Hack 2025: Agent Performance, Reliability and Valyu-Powered Retrieval. Retrieved from https://www.valyu.ai/blogs/the-great-agent-hack-2025-agent-performance-reliability-and-valyu-powered-retrieval

[4] Great Agent Hack 2025. “Project gallery — Great Agent Hack 2025: Build and test transparent, robust, and safe AI agents for real‑world impact.” Devpost. Available at: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1.

[5] Holistic AI. (2025). Hackathon 2025 [Source code]. GitHub. https://github.com/holistic-ai/hackathon-2025 (Last accessed: November 30, 2025)

[6] Elon Musk Stunned by Jensen Huang’s DGX Spark Gift. (n.d.). YouTube Shorts. https://www.youtube.com/shorts/l7x_Tfrbubs

[7] Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., Winsor, E., Wynne, J., Gal, Y., & Davies, X. (2024). AgentHarm: A benchmark for measuring harmfulness of LLM agents. arXiv. https://arxiv.org/abs/2410.09024

[8] Tiku N., Schaul K. and Chen S. (2023, November 01). This is how AI image generators see the world. Washington Post. https://www.washingtonpost.com/technology/interactive/2023/ai-generated-images-bias-racism-sexism-stereotypes/ (last accessed Aug 20, 2025).

[9] Porikli, S., & Porikli, V. (2025). Hidden Bias in the Machine: Stereotypes in Text-to-Image Models. Available at: https://openreview.net/pdf?id=u4KsKVp53s

[10] Wu, Z., Cho, S., Munoz, C., King, T., Mohammed, U., Kazimi, E., Pérez-Ortiz, M., Bulathwela, S., & Koshiyama, A. (2025). AgentGraph: Trace-to-Graph platform for interactive analysis and robustness testing in agentic AI systems. Holistic AI & University College London.

[11] Wicaksono, I., Wu, Z., Patel, R., King, T., Koshiyama, A., & Treleaven, P. (2025). Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

[12] Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., & Wu, Q. (2025). Which agent causes task failures and when? On automated failure attribution of LLM multi-agent systems (arXiv Preprint No. 2505.00212).