Shane Weisz

Q1 2026: Dashboard Developments and Presenting in Peru

Wed, 25 Mar 2026 00:00:00 GMT

With the start of April nearly upon us, I'm now back in Cambridge after a whirlwind trip to the Peruvian rainforest! This post outlines what I've been up to this year thus far, including updates about the Red List Dashboard, a description of my current agentic coding workflow, some initial thoughts for what's coming next, and finally reflections from several talks about my early PhD work (including presenting at the inaugural ICTC in Lima). If you make it to the end, you'll be rewarded with colourful pics of birds and frogs and butterflies from the Peruvian Amazon :)

Contents:

Dashboard Developments: Towards A Living IUCN Red List of the World's Species {#dashboard}

My main focus thus far this year has been continued development of the Red List Dashboard. When I first started on this back in December, my goal was just to help answer my own questions about what are the biggest data gaps and maintenance challenges facing the IUCN Red List. However, it has rapidly evolved since then, and my hope is that bringing disparate biodiversity data sources together into a single user-friendly view will prove a highly useful resource for various Red List stakeholders.

A screenshot from the new-look dashboard, now supporting both reassessments (172,620 species) and new assessments (550k+ species), more filter charts, and more data sources

For now, the dashboard provides an easy way to navigate the 2+ million described species (yes, that's a lot of species!); using filters on country, taxonomy, conservation status, assessors, years since assessment; all the way down to individual species data from citizen science records (GBIF and iNaturalist), scientific literature (OpenAlex), use and trade data (CITES), and species experts' conservation status assessments (IUCN Red List). Together the idea is to bring all the key information about a species of interest from all the important biodiversity datasets all to one place.

Here's a video where I show the dashboard in action, for an example use case of helping prioritise reassessments of dragonflies and damselflies in India:

Video showing how the dashboard could be used to find data-deficient damselflies and dragonflies in India that could be re-assessed in light of new GBIF records.

Part of our long-term vision here is that this could become part of a platform for open-source collaboration between assessors and agents (implementation-wise, Anil has suggested a great idea of linking to a Zulip server as a 'database' for hosting such communications). I've been thinking a lot recently about what differentiates humans from AI as AI agents become increasingly capable. And one of the key things is accountability – if we choose to use AI to help with some work, at the end of the day we still are the ones who need to take responsibility for the output and put a rubber stamp on it (and, crucially, be the one whose finger we point at if something goes wrong). As a result, whenever we are accountable for a piece of work, it's crucial that we still understand the data and reasoning AI uses well enough to verify it. To this end, having a visible, traceable, and transparent visual evidence-base view, as a shared reference for both assessors and agents, seems an extremely useful prospect.

Check out the dashboard for yourself at red.cst.cam.ac.uk, and let me know what you think!

My Agentic Coding Workflow: Managing parallel Claude Code agents {#agentic-coding-workflow}

I've narrowed in on an AI coding workflow that I've found very productive for my work on the dashboard, and so I thought I'd talk through an example from it here. This workflow was heavily influenced by reading Boris Cherny, the creator of Claude Code, describe his own setup. I used this workflow to merge over 20 PRs last week, some of which comprised significant architectural improvements (e.g. redesigning the entire taxonomic system underlying the dashboard). I'd estimate this all would have taken me well over 3 months (at least 10x longer) before the age of AI coding agents. And consequently I probably wouldn't have even attempted it... It's a crazy time we live in. Anyways, here's a quick intro to my setup.

I work on a single repo for the dashboard, and often have a backlog of several features I want to add, which I try keep track of via GitHub Issues. When developing, I open up to 4 terminals at a time, and on each terminal instruct a separate Claude Code instance to work in a git worktree (which means agent can work on independent features without conflicting). What's my mental model for this? I think about how I would delegate this work to four different engineers in a team, each of whom would independently clone the repo (in fact, this is literally what happens with git worktrees, a new clone of the repo gets created on your filesystem – shoutout to my friend Ryan for showing me this).

In each terminal, I then start to tackle one of my backlog issues. I start by working with Claude to come up with a plan. If it's well spec'd, I'll sometimes submit it to Claude Code web, so it can run in a sandboxed cloud container and doesn't need to check in with me for permissions. When I'm happy, I ask the agent to create a PR. For local changes that need back-and-forth collaboration, I'll often ask the agent to start a dev server, and occasionally use playwright so the agent can take screenshots of the app in action.

Three Claude Code agents working on independent features using git worktrees. For reference, here were the respective resulting PRs: #163, #141, #164.

To avoid polluting the context window of the current chat (and likewise avoid having the main chat bias the context window for the investigation), I'll often ask the orchestrator agent to dispatch a subagent to investigate something (like how we'd assign an intern a task). For big structural changes, I'll ask the orchestrator agent to dispatch a couple agents for independent code reviews.

On the whole, the agents have succeeded with almost every task I've thrown at them. The one interesting stumbling block I ran into was with database design. I lost a bit of time on this, due to classic AI sycophantic tendencies at full play. It started by my deciding that it's time to upgrade to a Supabase database for the app's backend, to set us up for scale in the months to come. Naturally Claude Code fully agreed! But by the end of the week, after trying a lot of different approaches and schema designs, I concluded this was all premature optimisation – having a database would just slow down iteration, compared to the current approach of just serving static CSV and JSON files over the network. Getting some experience setting up a Supabase database was not a waste though – we will probably still want to set up a database at some point, and it was all valuable learning.

One other observation I had, was that database schema design felt like something I still had 'taste' for over the agents (for now at least). Part of the reason for this, I think, is that good schema design relies heavily on domain expertise, and deeply understanding the user's problems, for which I have context that the agent does not automatically have. And so of course it's now my job to clearly provide this context.

As a final note, I wanted to also mention my information diet for keeping up to date with AI developments. I subscribe to the writings of a few high-quality feeds: Simon Willison (co-creator of Django), Armin Ronacher, (creator of Flask) and Thorsten Ball (developer at Sourcegraph) – their writing is all highly recommended!

Looking Forward: AI-Assisted Red List Assessments {#ai-assisted-assessments}

It's been widely felt like that there was another step-function change in AI coding agent abilities since November (Opus 4.5 and Codex 5.2), and again in February with Opus 4.6 and Codex 5.3. Consequently, I've become increasingly convinced we should all be raising our ambition about the tasks we attempt using AI coding agents. In the Red List case, this means using AI to draft completion of some SIS fields and even criteria parameter estimation (single-shot, with all tools allowed), and benchmarking how it does and where it fails. The agents will certainly make mistakes, as many of the most important information that goes into assessments is experts' knowledge that isn't encoded on the web. But benchmarking this will help narrow in and clearly understand where this boundary is. In the battle to scale out the Red List, we should be helping experts to focus exclusively on their unique insights into species' biology and ecology, and leaving all the mechanical work to AI. Red Listing is not a creative endeavor for humans; in fact it's the opposite, we want it to be as standardised, automated and consistent as possible.

A reminder of the results from my work showing AI can pass the Red List exam. The next steps are to extend this to real-world Red List workflows.

Moreover, from chatting with assessors during my time in Peru (more on that later), I'm increasingly realizing that even just starting assessments is a daunting undertaking. Many assessors are time-strained volunteers – and as a result anything that helps save time, and makes the process easier, would be highly valued. In the future, these AI-assisted draft assessments could use SIS Connect to feed into the exciting new SIS 3.0 platform (hopefully released end of this year) that stores the actual assessments.

In these benchmarking exercises, the batch of assessments received between each consecutive Red List release (e.g. between the November 2025 and June 2026 releases) provide a perfect window for running these experiments with the LLMs at 'full-power mode'. This is because we can give full access to the web, and know that the models can't 'cheat' (since assessments from this window will be restricted to the private SIS database, not accessible through the web, and won't be in the models' training data). For example, a possible methodology we could follow includes using Oct, Nov, Dec submissions for validation and tuning the system, then using the Jan, Feb, March submissions as a held-out test set. Evaluation could happen on multiple levels – firstly on the risk category for each criterion, but also on completion of SIS fields (habitat and ecology, use and trade etc), and perhaps even the range maps (using skills and multi-modal models). I'm excited to see where this goes and provide an update here in the months to come.

I also remain convinced that using AI for consistency checking of draft assessments, to reduce the workload on the Red List Unit, is a very feasible and useful win. A lot of the work here will be in the data-wrangling and eval creation using SIS data. With a lot of competing high-impact opportunities competing for my attention, it hasn't yet felt like the right time for me to work on it just yet. I hope to give more attention to this in the coming months, or else it could also possibly be a project I could co-supervise at some point (this could be a great setup – I could give guidance but leave the actual implementation to a Masters student for whom it's a well-sized project. I have experience in this from my time leading our ML team at Aerobotics, so I think I could be well-suited to such a role.) I've realised that as much as I'd love to participate with several important initiatives at the DAB (IUCN, KBA, CITES, habitat mapping, etc), I have only so much time, and need to be very careful where I invest my primary attention. And also remembering that some of the highest leverage work one can do is to empower others.

Finally, just a note on having had great chats with others who have developed useful tools for the Red List over the years, including Victor Cazalis who built sRedList, and Steve Bachman who has created several impactful tools including GeoCAT and Rapid Least Concern. Steve has been extremely friendly and helpful, and I'm hoping to spend a lot of time with him and his team in the coming months. His team at Kew submits many of the plant assessments worldwide, and with over 2000 newly described plant species each year on top of an existing backlog, any AI assistance would be very valuable. Steve has encouraged me to try submit an assessment or two myself, and possibly shadow an assessor at Kew for a couple days, both of which are fantastic ideas that I hope to take him up on.

Talks, talks and more talks: from the EEG to Peru {#talks}

Over the past two months I've now given several talks about my initial PhD work, of varying lengths and to various audiences.

It started with a 3-minute thesis talk at a college dinner in early January, followed by a 45-minute presentation to computer scientists at the EEG in early February, a 7-minute talk to conservationists at ICTC in Lima in late February, a 1-minute elevator pitch to my PhD cohort last week, and a 10-minute talk to a general audience at the Jesus MCR conference. This has all been excellent experience for me, as it had been many years since I'd last done serious public speaking.

Presenting at the Jesus College MCR Conference. The audience was really engaged and asked some great questions. Continuing my track record of serendipitous encounters with keynote speakers at conferences, I met Jon Hutton, who it turns out, along with leadership roles at WWF and UNEP-WCMC, was also Chair of the IUCN Red List for 10 years! He was very enthusiastic and supportive.

You can watch a YouTube recording of my talk at the EEG seminar here:

Presenting at ICTC in Peru and Birding in the Amazon {#peru}

Finally, a note on my incredible trip to Peru at the end of February! I spent 3 nights at the inaugural International Conservation Technology Conference in Lima, followed by 6 nights in the Peruvian Amazon!

On the morning of the final day, I presented my work on using AI to support the Red List, as part of a great session on LLMs for Conservation, alongside Ali Swanson from Conservation International and Sarah Heubner from Smithsonian. The talk was well-received (lots of photos snapped from the audience) and afterwards we had good discussions sharing opportunities to use LLMs in conservation (including Hannah Murray hoping to use agentic LLMs to scan social media to detect illegal Pangolin trade...).

I also had a great chat with Mikel Maron from Earth Genome who approached me with questions after my talk. I'm a big fan of the work they do at Earth Genome. We discussed our shared recent experience of the huge impact one can have using agentic coding to build internal tools to transform operations at small orgs. His post about this is a great read.

Overall, I found ICTC very inspiring. I came away feeling like we have all the technological solutions ready for wide-scale biodiversity monitoring, for both land and sea, across taxa – and that it's just a matter of driving down costs now. For marine species, we have amazing technologies for monitoring at-sea with both on-board computer vision (e.g Tryolabs) and real-time satellite-monitoring via solutions like Ai2's Skylight. For land species, there's incredible innovations in drones and camera traps and bioacoustics and eDNA. Each of these have different taxonomic strengths: bioacoustics for birds, camera traps with lights to attract insects, GPS collars for mammals, drones for trees, eDNA for freshwater species, etc.

Some noteworthy projects I enjoyed learning about:

Limelight Rainforest's $5M XPRIZE winning solution (crazy cool tech, insect monitoring camera traps, drone collecting eDNA, connected bioacoustic sensors etc.)
ARM's 0.5W camera, £80.
MothBox for low-cost insect monitoring
Tryolabs for marine ship monitoring
Ai2's SkyLight and EarthRanger. Both free. Ai2's model is incredibly powerful. Just need a couple philanthropists to fund top people. Interesting thought is how much good we could do for the world if we followed their model, and just convinced 1 or 2 billionaires to distribute StarLink and Solar Panels worldwide. A lot of the time, money does solve problems...

After ICTC, I then spent an incredible week at Los Amigos Biological Station, going on daily walks (and night walks!) through the Amazon. You can check out my eBird checklists here and my iNaturalist contributions here.

I submitted over 140 iNaturalist observations, with plenty of butterflies, frogs, lizards, spiders, snakes, monkeys, plants and birds. The Bearded Emperor Tamarin was a highlight, check out its mustache!

I was absolutely blown away by the diversity of birds, butterflies, insects, frogs, trees, sights and sounds... Here are some of my favourite colourful photos as promised:

Miscellaneous personal asides {#miscellaneous}

I'm a huge fan of Hannah Ritchie's work and Our World in Data more generally. After she recently released an awesome energy usage visualisation tool, I had a feature idea for shareable URLs, like I have in place for the dashboard. So I then thought, with the ease modern coding agents, why not try help by just adding the feature myself rather than pester her with a feature request... So I submitted a PR and sent her an email explaining what I'd done and how I'd tested it (along with a pointer to my dashboard if she were interested – given she wrote the great and highly-relevant Our World In Data article on extinction risk here). Although she ended up adding the feature herself rather than merging my PR, she took the time to write me a lovely warm response, which was very kind of her. My dashboard work has definitely been inspired by Our World In Data, so I was grateful to make this connection.
One of my themes for the year was to slow down and simplify, after last term's busyness. Whilst this has reluctantly meant no more rowing or ballroom dancing, I've continued playing lots of sport (hockey, football, squash, tennis), ran the Cambridge half-marathon, spent lots of time visiting close friends in London, and have continued to frequent the Cambridge Buddhist Centre (along with a fantastic meditation retreat in Hertfordshire).
I now have a busy few months ahead, with a fair amount of travel. I'm a groomsman for the wedding of one of my closest friends: we'll be celebrating Jono's Bachelors up in the Isle of Skye in mid-April, followed by some hiking adventures, and then l'll be going back to South Africa for a few weeks for the actual wedding in early-May (including a family trip to the Kruger of course!). In between, I'll be working on my first-year report, and getting going with the AI-Assisted Assessments experiments for the next phase of this project. I've also applied for a spot on the 2026 Machine Learning Summer School at Columbia University in New York for the end of June – another potentially exciting opportunity on the horizon.

Final Reflections

It's symbolic to now have my project idea up on Anil's website! He has patiently let me explore various directions in my early months, but it's nice to be converging on a focus area with my work on AI to support the Red List. I'm a big fan of Paul Graham's advice on the best ideas arising through evolution, rather than over-planning – I'm grateful Anil thinks similarly and has supported my ideas and work evolving over time:

From Paul Graham's fantastic essay on How To Do Great Work.

Overall, it's been a fast-paced start to the year and time has flown. It also feels like a crazy time in the world more generally, filled with uncertainty for the future: AI exponentials, geopolitical tensions, and continuing alarming trends in climate and biodiversity. But in the face of uncertainty, the truths of mindfulness are more important than ever: all we can do is pay close attention to the present moment, and respond to its demands as best we can. Happy Easter everyone!

2025 Wrapped: Presenting to the IUCN and Other Highlights

Mon, 15 Dec 2025 00:00:00 GMT

It's nearing the end of the year, and as things are slowing down ahead of the holidays, I thought I'd write a short wrap-up post to reflect on the past few weeks, and the term as a whole.

Contents:

AI for the Red List Assessor Exam: Presenting to the IUCN {#ai-red-list-exam}

One of the projects I've been most excited about this past term has been whether we can use AI to Accelerate Assessments for the IUCN Red List. There was an exciting recent development on this last week – I presented my work on AI for the Red List Assessor Exam to the IUCN last Thursday, and they're keen to see if we can collaborate in the new year!

How did this meeting even come about? Anil shared on his blog a link to my post on Can Claude Code Pass the Red List Assessor Exam?.

A friend of Anil's, Carly Batist from Conservation International, came across this and forwarded it to Neil Cox, who leads the Biodiversity Assessment Unit at the IUCN. Neil reached out to Anil saying these are really interesting posts, and asked if we could discuss it in the new year. I'm very grateful to Anil, who proactively asked to meet sooner if possible, to give us time to reflect on research ideas ahead of the break. Suddenly we had a meeting lined up with many other important people at the IUCN, including Craig Hilton-Taylor (Head of the Red List Unit), Caroline Pollock (Senior Programme Coordinator for the Red List Unit), Simon Tarr (Red List Data Manager), Miguel Torres (Red List Systems Manager) and Richard Jenkins (Head of Biodiversity Assessment and Knowledge Team).

On the day it went really well, and it looks like the IUCN would be very open to working together to take this forward. We had some great discussions, and everyone was super enthusiastic on the whole (maybe with the exception of Caroline, who designed the exam and training course 10 years ago – turns out she'd been stumped as to how someone was solving the exams so quickly, and messing up all her reporting! We all had a good laugh about that.). Here's a rough script for what I presented, and here's a video recording of me doing a re-run of the presentation to my parents later than evening (so they now finally understand what I've been working on!).

In the meeting, I also presented the dashboard I built recently to help visualise assessment coverage and prioritisation, which they were also really supportive about.

We identified three useful ways AI assistance could fit in: (1) using AI to help validate new draft assessments and re-assessments, (2) maintaining a 'living Red List' with a dynamic evidence base for each species that is automatically kept up-to-date, and (3) producing a prioritisation order for expert assessors for new assessments and re-assessments. It turns out that many of these ideas had been discussed in an IUCN workshop in Cambridge in June earlier this year, and so there is strong appetite to continue this momentum.

In terms of next steps, there are two key directions I'll be looking to explore in the new year, towards the ultimate vision of an AI-assisted 'living' Red List evidence base:

Benchmarks – this seems like the key starting point, defining the success criteria for AI-assistance in various aspects of the Red List workflow. It could build on top of the Red List Assessor Exam framework (bolstered by synthetic questions perhaps), and we could progressively add complexity to better reflect the real-world Red List assessment process. Such complexity will make this a valuable benchmark for agentic capabilities more generally – since Red List assessments involve making important real-world judgments via complex reasoning amidst ambiguity, uncertainty, diverse modalities and incomplete information.

We're in a unique position to work closely with actual Red List assessors to design these collaboratively (Richard and Simon both strongly encouraged me to come spend lots of time chatting to assessors in the DAB, which I'd love!). Moreover, due to the Red List's biannual updates, we have a periodically re-occurring held-out test-set that will be outside the training cut-off dates as new frontier models get trained (e.g. the 2026-1 Red List update will be released early next year). I'm eager to look more into Ai2's AstaBench for inspiration here, as well as OpenAI's new FrontierScience release. Benchmarks are also an amazing way to get the machine learning community working together to tackle a problem.
Agentic search with literature – this seems like the logical and super valuable next addition to the dashboard, highlighting relevant literature sources that were unused in the previous assessments. Building for the future definitely involves connecting AI with skills to large databases like OpenAlex in ClickHouse, and we're still figuring out how best to set this up.

An important thing to keep in mind again here that I'm still wrapping my head around is balancing between what is research and what is product-engineering in my work, and being very intentional about this.

But on the whole I'm feeling highly motivated – collaborating with the IUCN to improve the Red List feels like a huge opportunity to have a positive impact on the planet through my research. It's an incredible position to be in – doing cutting-edge AI research at Cambridge, in collaboration with the world's largest environmental organisation, to support one of the most important information sources in global biodiversity conservation.

I'm looking forward to reflecting more on all of this over the break, and then taking this forward in the new year.

Flora Explorer: Can We Identify Plants From Space? {#flora-explorer}

A brief word here on another project I took a brief look at, but have parked for now. The main idea here is trying to benchmark where we're at for using Tessera for fine-grained plant species identification from space. Having a dynamic 'tree species heatmap' is a tool I've wanted myself whilst wandering around Cambridge – even having a soft prior on what the trees around me are would be super interesting as I'm trying to learn more about trees.

The research questions here can be expressed as: are there some plant species we can accurately identify from space using foundation model embeddings? Which ones if so? What about fungi? (the Red List is aiming to assess 20k fungi species by 2030, but has only done 1k as yet) How does the GPS uncertainty in the underlying GBIF labels fit in? Moreover, how many training samples would we need per species to attain a given level of accuracy, and does this number depend on species' taxonomy or trait data? Could techniques like generalized category discovery help with tackling the 'long-tail' problem with plant data? And how will all this change once we get p-band radar?

If you're interested in more details, you can check out a Claude-generated write-up for some of the things I've tried so far. My key takeaway thus far is: honest-reporting of accuracy results with geospatial machine learning is not as straightforward as I'd first thought! I learned that techniques like spatial thinning and spatial blocking are essential to protect against spatial auto-correlation effects.

I'm definitely keen to continue to look into this in the new year, as I still want answers to the questions spelled out above. However, it's probably wise for this to just be a side project for me, to make sure I'm not spread too thin and protect my primary focus on the AI for the Red List project for now.

Other Highlights {#other-highlights}

I'll end by sharing some pics from various notable and enjoyable extra-curricular items during the past term.

1. Weekly EEG Seminar Meetings and the EEG Christmas Dinner

It's been so great meeting the wonderful people in the EEG and learning the interesting things that each person is working on. The EEG seminar also provided a safe space for me to demo my AI for the Red List Exam work for the first time, and gave me great initial feedback (many thanks to Keshav, Robin and Jon for thought-provoking questions!).

2. EurIPS in Copenhagen with Sadiq, Frank and Yihang, attending the AI for Conservation conference.

I had a bunch of new ideas spark from the conference, and really enjoyed meeting other people passionate about this space. In a fantastic coincidence, on the flight down I was sat next to Drew Purves (Nature Lead at Google DeepMind) who was attending as one of the keynote speakers – and we had a great 2-hour conversation sharing excited ideas about how AI could help nature. I also loved spending time with Yihang and Frank exploring Copenhagen, learning about life back in China, and being treated to a delicious Cantonese restaurant!

3. ARIA workshops: Foundational AI to forecast ecosystem resilience and Next Generation Ecosystem Models.

Many fantastic forward-looking ideas and discussions flowing, and I met many very impressive people, including Julia Jones, Robin Freeman, Mike Harfoot, Oisin Mac Aodha, Calum Maney and Drew Purves, amongst many others.

4. Football for the JCFC 1s

Whilst we sadly weren't able to retain our Cuppers title, it's been great fun.

5. Rowing for JCBC NM3s

I'm certainly not a natural rower (and am repeatedly told I need to stretch more), but it's been good to try something new!

6. Re-uniting with South African friends in Cambridge

I'm grateful to have had some close friends from home already here in Cambridge when I arrived – it made settling back into Cambridge a much easier transition.

(As an aside: it's an incredible time to be a South African sports fan, with the Springboks rugby team's continued success, our cricket teams entering a new era, and Bafana Bafana qualifying for the 2026 FIFA World Cup in the US – our first time qualifying since my very fond memories of us hosting it back in 2010.

7. Meditation Nights at the Cambridge Buddhist Centre

There's a lovely, warm Buddhist community here in Cambridge. The regular community-centred Thursday night meditation nights have provided a lovely counter-balance to the very strict and intense 10-day Vipassana retreat I undertook in the Blue Mountains outside Sydney at the start of the year.

8. Lastly, I took Ballroom Dancing classes for the first time.

Teaching the Cha-Cha to my friend Tess in France

As a famously bad dancer, it was great to learn some set moves from the Waltz, Cha-Cha, Jive, Tango and Quickstep.

Looking Forward {#looking-forward}

I'm feeling very grateful to be here doing my PhD in Cambridge. I recently took a look at the career goals I wrote in my PhD application over a year ago, and it's very satisfying to reflect on living true to those aspirations.

It's a real privilege to be here, with the freedom to think deeply about solving important global problems, researching ground-breaking AI technology, with a supportive, patient and inspiring supervisor who I'm learning so much from (I still don't quite believe Anil has the same 24 hours in a day as the rest of us), surrounded by brilliant minds in the EEG and CCI who care deeply about making a difference in the world. I'm left just feeling incredibly excited for 2026.

See you in the new year, and happy holidays everyone!

Yes, AI can pass the Red List Assessor Exam. What's next?

Tue, 02 Dec 2025 00:00:00 GMT

TL;DR

After some prompt engineering Claude Code now consistently passes the IUCN Red List Assessor exam, averaging 86%.
Without the Red List Assessor skill, or with a smaller model (Haiku), the grade drops to ~60%.
Next steps: chat to the IUCN to get access to richer full Red List assessments data.

Full Post

Since my last update, I built a full web app to show Claude Code live in action taking the exam, and demo-ed the app to various interested parties. Here’s a screenshot from the web app:

Moreover, with some small prompt engineering, Claude Code Sonnet now consistently passes 8 of 8 exams, averaging 86% – comfortably attaining the pass mark of 75%, and far better than I can do!

Here are some closing thoughts to wrap up this project.

The demo landed very well with various audiences, including:
- EEG seminar attendees
- Prof Julia Jones and Dr Rob Freeman, both of whom have contributed Red List Assessments
- Neil Burgess, Chief Scientist at UNEP-WCMC.
At the very least I hope this project can give a glimpse to the IUCN about what’s possible here with AI, in particular, in addressing their challenges in validating, maintaining and scaling the Red List assessments.
At the EEG demo, I received some audience questions about (1) whether a large model (aka Sonnet) is necessary, and (2) how the models do with their parametric knowledge only (i.e. without access to the IUCN guidelines docs). I ran the experiments, and the results were conclusive: Claude Code Sonnet (86% average) significantly outperforms Claude Code Haiku (61%), and without access to the guidelines the average grade drops to 59%, showing the model can’t leverage parametric knowledge only.
Anil pointed out a valuable use case for a tool like this that I hadn’t considered – the pedagogical value for trainee assessors.
Neil Burgess, Chief Scientist at UNEP-WCMC, suggests an agentic workflow like this could also be super useful for their workflows with CITES to protect endangered species from illegal trade. I can definitely see that being very valuable – my only worry here is that I’m being pulled in many directions at the moment, and Anil has wisely advised that I need to protect my focus!
I have learned so much from this process. Mainly, it’s been eye-opening how fast we can move from idea to action using agentic AI. The scope of possibility for rapidly testing research ideas in incredible.

Another important question on my mind here though, is to get clear on what research is involved in a given project, as opposed to just AI-driven software engineering – even if accelerating Red List assessments would be extremely impactful work for conservation. On the other hand, working with AI agents is new for all of us – and so designing effective workflows for this is certainly novel terrain. For example, it’s not yet clear how to best connect these agents to Anil’s corpus of scientific literature, an exciting future direction. But it’s definitely still important to be mindful of this research vs application trade-off.

Next steps for taking this further will be to approach the IUCN to (a) gauge their interest and (b) see if we could access their SIS data with its full history of Red List assessments. Michael Dales has mentioned he can put me in touch with the head of the Red List Unit, Craig Hilton-Taylor, who definitely seems an appropriate person to talk to about this. So we’ll see where this goes from here – but for now, I am hopeful this project at least serves a valuable PoC and example of how AI can contribute positively in the conservation domain.

Can Claude Code pass the IUCN Red List Assessor Exam?

Tue, 18 Nov 2025 00:00:00 GMT

TL;DR

Yes!

Summary

I recently completed the IUCN Red List Assessor Training course, achieving 80% in the final exam to receiving my official certification (you need >75% to pass the exam). Upon completing it, I was curious about how Claude Code would do, so I decided to put it to the test.

So, how does Claude Code do? Pretty good! It passed four out of five exam runs, averaging 80%.

Humans are allowed to repeat the exam as many times as needed until they pass, so four out of five is a very good result. Moreover, I am confident the incorrect answers are not due to an innate limitation, but rather just requires more careful context engineering.

I remain confident that AI can can significantly help the IUCN scale up Red List Assessments.

Background:

Last week I become a certified IUCN Red List Assessor, after completing the IUCN Red List Assessor Training course. (Note that this certification does not mean I can now add my own Red List assessments – one still needs to be a species expert or be part of an IUCN SSC specialist group to contribute one.)
To obtain the certification at the end of the course, I had to complete a difficult 3-hour 25 question final exam. An example question looks like:

Even with ChatGPT’s help, I found the exam questions pretty tough (it was an open-book exam, so AI and any online resources were allowed).
While doing the exam, I suspected AI would do really well at this, provided we do some careful context engineering first. This intuition is informed by the fact that the top AI reasoning models are now competing with the world’s best mathematicians and computer scientists in global olympiads. Given my interest in how AI could help with accelerating Red List assessments, I thought as a first step we should see whether it can pass the exam that human assessors are required to pass.
Why does this matter? The IUCN’s biggest bottleneck towards achieving their Red List targets is scale: they are limited by the number of trained assessors available to do assessments and re-assessments. So any way AI can help accelerate the process would be extremely valuable.
To this end, I decided to put Claude Code to the test. I made sure it has access to the same resources I had during the exam – namely:
1. The IUCN Red List Categories and Criteria (38 pages)
2. The IUCN Red List User Guidelines (122 pages)
3. The Mapping Standards and Data Quality for the IUCN Red List Spatial Data (32 pages)
4. The Supporting Information Guidelines (68 pages)
5. The Regional and National Assessment Guidelines (46 pages)

What I did:

First, I needed a way to get the exam questions into a format the AI could easily parse. To do this, I used Claude Code to create the following scripts:
- A script extract_exam.py to parse questions.md from the exam page’s raw HTML.
- A script extract_memo.py to extract a memo.txt from the HTML of a submitted exam attempt.
- A script grade_attempt.py to grade a set of AI-generated answers.txt against the memo.txt
Next I needed to get the IUCN Red List guideline PDFs into a text format that the AI could read.
- For this, I used the Claude Code PDF Skill to create a script parse_pdf.py that takes a PDF and outputs a corresponding markdown file along with its associated images and diagrams. The resulting directory looks like: <div style={{maxWidth: '150px'}}>
I then used Claude Code to design a red-list-assessor-skill that points the AI to the relevant official guidance docs.

---
name: iucn-red-list
description: Assist with IUCN Red List species threat assessments. Use when the user asks about Red List categories, criteria thresholds, EOO/AOO calculations, assessment documentation, range mapping standards, generation length, population decline analysis, or preparing species assessments for threatened species.
license: Public Domain (official IUCN documents)
---

# IUCN Red List Assessor

Expert assistance with IUCN Red List assessments using official IUCN documents as the authoritative source.

## Source Documents

Official IUCN guidelines documents are embedded in this skill at `docs/`:

1. **user-guidelines/** - Guidelines for Using the IUCN Red List Categories and Criteria (v16, March 2024, 122 pages)
   - How to apply criteria, calculation methods, edge cases, examples
   - `user-guidelines-v16.md` - Searchable markdown with tables and image references
   - `images/` - Extracted diagrams, flowcharts, and decision trees

2. **categories-and-criteria/** - IUCN Red List Categories and Criteria (v3.1, 38 pages)
   - Categories (EX, EW, CR, EN, VU, NT, LC, DD), criteria (A-E) with thresholds, definitions
   - `categories-and-criteria-v3.1.md` - Searchable markdown

3. **supporting-info/** - Supporting Information Guidelines (68 pages)
   - Documentation requirements, IUCN Threats Classification Scheme v3.2
   - `supporting-info-guidelines.md` - Searchable markdown

4. **mapping-standards/** - Mapping Standards and Data Quality (v1.20, 32 pages)
   - EOO/AOO calculation, GIS requirements (WGS84)
   - `mapping-standards-v1.20.md` - Searchable markdown

5. **regional-guidelines/** - Guidelines for Application of IUCN Red List Criteria at Regional and National Levels (v4.0, 41 pages)
   - Regional assessment protocol, rescue effect, endemism, inclusion thresholds
   - `RL-2012-002.md` - Searchable markdown
   - `images/` - Flowcharts and decision diagrams

## Instructions

When assessing species or answering questions:

### 1. Document Research
- **Search the relevant document** - Use the Read or Grep tool to find exact guidance in the appropriate markdown file (e.g., `user-guidelines-v16.md`)
- **View diagrams when needed** - When decision trees, flowcharts, or criteria tables are referenced, use the Read tool on the image files in the `images/` directory (e.g., `docs/user-guidelines/images/page_10_img_1.png` for the Red List categories diagram)
- **Cite sources** - Always reference document name, section, and page number (e.g., "According to User Guidelines Section 4.4, page 45...")
- **Quote exactly** - Use exact definitions and thresholds from the source documents
- **Exhaustive checking** - For documentation/standards questions, check BOTH general principles AND specific examples; don't rely solely on bulleted lists

### 2. Systematic Criterion Evaluation

**CRITICAL: Always evaluate ALL criteria and ALL sub-criteria to find the HIGHEST qualifying category.**

Use this systematic checklist:

**Criterion A (Population Reduction):**
- □ A1 (past reduction, causes understood/reversible/ceased)
- □ A2 (past reduction, causes may not be understood/reversible/ceased)
- □ A3 (future reduction, projected)
- □ A4 (past+future reduction)

**Criterion B (Geographic Range):**
- □ B1 (Extent of Occurrence - EOO)
- □ B2 (Area of Occupancy - AOO)
- For each, check sub-criteria:
  - □ a. Severely fragmented OR number of locations (≤1, ≤5, ≤10)
  - □ b. Continuing decline in: (i) EOO, (ii) AOO, (iii) habitat, (iv) locations/subpops, (v) mature individuals
  - □ c. Extreme fluctuations

**Criterion C (Small Population + Decline):**
- □ C1 (population &lt;2500/10000 AND decline % within timeframe)
  - Check timeframes: CR=3yrs/1gen, EN=5yrs/2gen, VU=10yrs/3gen
- □ C2 (population &lt;2500/10000 AND continuing decline AND either):
  - □ C2a(i) (≥90%/95% decline in 3/5 years OR 2/3 generations)
  - □ C2a(ii) (≥90%/95%/100% of mature individuals in one subpopulation)
  - □ C2b (extreme fluctuations)

**Criterion D (Very Small or Restricted):**
- □ D (population &lt;50/250/1000)
- □ D2 (restricted area/locations with plausible threat)

**Criterion E (Quantitative Analysis):**
- □ E (extinction probability from PVA)

**After evaluation:**
- Identify ALL qualifying criteria and sub-criteria
- Select the HIGHEST category among all qualifying criteria
- Don't stop at the first qualifying criterion
- Combine all qualifying criteria in final code (e.g., "EN C1+2a(ii); D")

### 3. Regional/National Assessment Protocol

For regional or national red list assessments (see **regional-guidelines/RL-2012-002.md** for full protocol):

**Step 1: Check Inclusion Threshold (if specified)**
- ALWAYS calculate actual percentages when thresholds are mentioned
- Extract global population size from global IUCN criteria code
  - VU C: &lt;10,000 individuals
  - EN C: &lt;2,500 individuals
  - EN D: &lt;250 individuals
  - VU D1: &lt;1,000 individuals
- Calculate: (regional population / global population) × 100%
- Decision:
  - IF percentage &lt; threshold → Category: **NA (Not Applicable)**
  - IF percentage ≥ threshold → Proceed to full regional assessment

**Step 2: Preliminary Regional Assessment**
- Apply IUCN criteria using ONLY the regional population data
- Determine preliminary category based on regional population size, decline, range, etc.

**Step 3: Consider Rescue Effect**
- Can individuals migrate into the region from outside populations?
- Is suitable habitat available for immigrants to establish?
- Does the regional population rely on immigration?
- If YES to all: Consider downlisting (rescue effect possible)
- If NO (population isolated): Keep preliminary category

**Step 4: Final Regional Category**
- Apply adjustments based on rescue effect analysis
- Document both preliminary and final categories with reasoning

### 4. Near Threatened (NT) Assessment

Check for NT when species doesn't qualify for threatened categories:

**Apply NT if:**
- Species meets a geographic/population threshold BUT doesn't meet the required number of sub-criteria
- Species is "close to" qualifying:
  - Near ≤10 locations threshold (e.g., 11-15 locations)
  - Just below percentage thresholds (e.g., 25-29% for VU A)
  - Meets 1 of 2 required sub-criteria under Criterion B
- Species is likely to qualify in near future

**Example:** EOO &lt;20,000 km² (meets VU B1 threshold) with 12 locations and declining, but only 1 of 2 required sub-criteria met → Consider NT

### 5. Calculation Methods and Precision

**For all calculations:**
- Show ALL calculation steps explicitly with formulas
- State which method/formula is being used
- For population reduction, determine if decline is linear or exponential first
- Double-check arithmetic before final rounding
- When multiple valid methods exist (e.g., empirical vs. formula-based generation length):
  - Consider both methods
  - Provide both values if question context suggests it
  - Note which method is being used for each value

**Generation Length:**
- Check if empirical data available (average age of breeding individuals)
- Check if formula-based calculation needed: (age at first breeding + age at last breeding) / 2
- For exploited populations: Use PRE-DISTURBANCE generation length
- If both empirical and formula values exist, consider providing both

**Population Reduction:**
- Determine timeframe: max(10 yrs, 3 generations) for A2, etc.
- Assess pattern: linear vs. exponential
- For exponential: Use formula (1 - (N_final/N_initial)^(target_period/observed_period))
- Round to integer only at final step

### 6. Combine Multiple Criteria
- If multiple criteria are met, combine them (e.g., "EN C1+2a(ii); D")
- Use highest qualifying category as primary
- List all qualifying criteria in standard IUCN format

Next I added a Claude Code \attempt-exam-question slash command to answer a given question using the red-list-assessor-skill , outputting the final answer and clear, concise reasoning for the answer.

# Attempt Exam Question Command

You are tasked with answering a single IUCN Red List Global and Regional Assessor Exam question.

## Important Information

**Before you start, please read the following information.**

- This is an open-book exam. Use all of the resources available to help you answer the questions.
- Refer to the IUCN Red List Categories and Criteria. Version 3.1 and the associated IUCN guidelines documents and Red List assessment tools.
- Some questions will require you to calculate parameters such as reduction, generation length, continuing decline, etc. Use whatever tools you need to help you with this (e.g., the current version of the Guidelines for Using the IUCN Red List Categories and Criteria, calculator, internet, etc).

**Question Format Guidelines:**
- **Short Answer Questions**: Read each question carefully before giving your answer. Some answers must be entered as an integer number without decimals (e.g., if "9" is the correct answer, writing "9.13", "nine", or "9 locations" will be marked as incorrect). Some questions require you to enter the appropriate two-letter IUCN Red List Category, and some require the appropriate IUCN Red List Category and Criteria code. Ensure you use the appropriate format when entering these codes (see Annex 2 of the IUCN Red List Categories and Criteria. Version 3.1).
- **Multiple-choice questions**: All multiple-choice questions allow you to select one or more answer. At least one of the answers provided is correct, but do not assume that there will always be more than one correct answer. If any of your selected answers are incorrect your overall score for the question will be zero (even if one of your selected answers is the correct one).

## Input

The user will provide:
1. `exam_name` - The exam identifier (e.g., "exam1", "exam2")
2. `question_number` - The question number (1-25)
3. `solution_folder` - The path to the solution folder where the answer file should be written

## Workflow

### 1. Load the IUCN Red List Assessor Skill

Activate the `iucn-red-list-assessor` skill to access specialized knowledge and tools for answering IUCN Red List assessment questions.

### 2. Read the Question

Read the question from: `exams/questions/{exam_name}/{exam_name}_q{question_number}.md`

Format the question number with leading zeros (e.g., q01, q02, ..., q25)

### 3. Answer the Question

Use the IUCN Red List Assessor skill to:
- Analyze the question carefully
- Apply relevant IUCN criteria and guidelines
- Perform any necessary calculations
- Determine the correct answer in the exact format required

### 4. Write the Answer File

Create a file `q{question_number}.md` (with leading zeros) in the solution folder containing:

```markdown
## Question \{N\}

[Copy the full question text here]

## Answer

[Your exact answer in the required format]

## Explanation

[Clear, concise explanation showing how the answer was derived, including:
- Key information from the question
- Relevant IUCN criteria or guidelines applied
- Any calculations performed
- Reasoning for the final answer]

Important Notes

Ensure answers follow the exact format specified in each question (e.g., integer only, two-letter code, comma-separated list)
For multiple-choice questions with checkboxes, use comma-separated lowercase letters (e.g., "a, b", "b, c, f")
For IUCN categories, use exact format from Annex 2 of the Categories and Criteria document
Be extremely careful with formatting - incorrect format = zero marks even if conceptually correct
Round numbers as instructed (e.g., to nearest integer) and follow the exact format requested
The answer in the ## Answer section will be extracted by the aggregation script (markdown formatting like bold/italic will be automatically stripped)


</details>

- I then added a `\attempt-exam` slash command that instructs Claude Code to spin out 25 parallel `Tasks` , one for each question, and run `\attempt-exam-question` on each.
<details>
<summary>Here’s the prompt. Note that the exam instructions are the same ones given to human trainers.</summary>

```markdown
# Attempt Exam Command

You are tasked with attempting the IUCN Red List Global and Regional Assessor Exam.

## Important Information

**Before you start, please read the following information.**

- This is an open-book exam. Use all of the resources available to help you answer the questions.
- Refer to the IUCN Red List Categories and Criteria. Version 3.1 and the associated IUCN guidelines documents and Red List assessment tools.
- Some questions will require you to calculate parameters such as reduction, generation length, continuing decline, etc. Use whatever tools you need to help you with this (e.g., the current version of the Guidelines for Using the IUCN Red List Categories and Criteria, calculator, internet, etc).

The exam contains a range of question types, including multiple-choice and short answer questions:
- **Short Answer Questions**: It is very important to read each question carefully before giving your answer. For example: Some answers must be entered as an integer number without decimals (e.g., if "9" is the correct answer, writing "9.13", "nine", or "9 locations" will be marked as incorrect). Some questions require you to enter the appropriate two-letter IUCN Red List Category, and some require the appropriate IUCN Red List Category and Criteria code. Ensure you use the appropriate format when entering these codes (see Annex 2 of the IUCN Red List Categories and Criteria. Version 3.1).
- **Multiple-choice questions**: All multiple-choice questions allow you to select one or more answer. At least one of the answers provided is correct, but do not assume that there will always be more than one correct answer. If any of your selected answers are incorrect your overall score for the question will be zero (even if one of your selected answers is the correct one).

## Input

The user will provide an exam name (e.g., "exam2"). This corresponds to a folder in `/home/sw984/ai-red-list-assessor/exams/questions/{exam_name}/`.

## Workflow

### 1. Setup Phase

- Read the pre-parsed exam questions from `exams/questions/{exam_name}/`
- Each question is in a separate file: `{exam_name}_q01.md` through `{exam_name}_q25.md`
- Create a timestamped solution folder: `exams/attempts/{exam_name}/{YYYYMMDD_HHMMSS}_{claude|codex|gemini|cursor}`
- Use format like: `20251107_153045_claude`

### 2. Parallel Question Processing

Launch 25 Task agents **in parallel** (one message with 25 Task tool calls), one for each question. Each Task agent should:
- Invoke the `/attempt-exam-question` slash command with the appropriate parameters
- Pass the exam_name, question_number (1-25), and solution_folder path
- The slash command will handle reading the question, using the IUCN skill, and writing the answer file

Example Task agent prompt format:

Please invoke the /attempt-exam-question command to answer question {N} for {exam_name}.

The solution folder is: {solution_folder_path}

Use this invocation: /attempt-exam-question {exam_name} {N} {solution_folder_path}


### 3. Aggregation Phase

After all 25 Task agents complete:
- Run the aggregation script: `python3 scripts/aggregate_answers.py {solution_folder_path}`
- This script will:
  - Read all `q01.md` through `q25.md` files
  - Extract the answer from each file's `## Answer` section
  - Create `answers.txt` with exactly 25 lines (one answer per line)
  - Report any missing or failed answer files

### 4. Completion

- Inform the user where the solutions are saved
- Provide the path to `answers.txt`
- Report the summary from the aggregation script

## Important Notes

- This command orchestrates 25 parallel Task agents, each invoking the `/attempt-exam-question` slash command
- The `/attempt-exam-question` command handles individual question processing, including activating the IUCN Red List Assessor skill
- You can test individual questions independently by running: `/attempt-exam-question &lt;exam_name> &lt;question_number> &lt;solution_folder>`
- Ensure answers follow the exact format specified in each question (e.g., integer only, two-letter code, comma-separated list)
- For multiple-choice questions with checkboxes, use comma-separated lowercase letters (e.g., "a, b", "b, c, f")
- For IUCN categories, use exact format from Annex 2 of the Categories and Criteria document
- Be extremely careful with formatting - incorrect format = zero marks even if conceptually correct

I then extracted 5 sets of exams of 25 questions each from the official webpage, and ran Claude Code on each exam. Here are some screenshots showing the Claude Code workflow looks like in action.

- Note Q3 took 7m 3s since it was a weird format:

Results

So, how did it actually do?

Across the 5 exam runs, Claude Code’s results averaged 81%:
Not bad! And the questions it got wrong tended to be questions that were ambiguous or borderline cases. With some further context engineering I’m confident it can score even higher.

Next Steps

Longer-term, the vision is just for this framework to serve as a test-bed for a practical AI tool to help accelerate assessments. This vision is spelled out in more depth in Untitled.
- To that end, we would benefit significantly from access to the IUCN’s SIS, which stores a history of all assessments, including the succession of rejected drafts until the published draft was ready.
- This would be an extremely rich dataset. As just one example, this could be used to design an evaluation framework for an AI validation system for catching errors in Red List assessment first drafts.
However, to get the IUCN’s buy-in, it may be helpful to present this to them in a more accessible way than this technical report.
Some of Anil’s suggestions include:
1. A neat UI to visualize it in action. Try tailwind and daisy UI.
2. Add a click-through from answers to the source material in PDF. Get the source map from pdfplumber , and then draw a polygon in the original pdf for.
3. Visualize tool use in action. Try claude code python sdk to stream this live.

Cleanly store all prompts, tool calls, tasks, results etc. launched by Claude Code. This will be important for improvements going forward.

Try injecting the full guidelines doc into the context window (using prompt caching to reduce costs), and use the model’s in-built attention mechanisms rather than relying on agentic search.
Try RAG approaches for semantic retrieval

Design a Claude Code framework for iteratively recursively refining its own prompts until it achieves full marks on the exam questions.

Appendix

What else did I learn during this sprint?

With all the parallel agents, I run out of Claude Code Pro credits quite quickly, which slows experimentation while I wait a few hours for my session credits to refresh. It might make sense to upgrade to Claude Code Max, or try directly using the API to leverage prompt caching and have finer-grained control over credit usage.
This sprint has made clear to me the real tradeoffs between (a) task performance, (b) token usage & cost, (c) model size, and (d) level of reasoning. The ideal combination is to answer the question correctly, using as few tokens as possible, with as small and cheap a model as possible. This has made the importance of context engineering extra clear.
Modularity is still important in AI workflow design. Originally, I had one mega-prompt. Splitting this up into subtasks and scripts was extremely helpful. Using AI to explore a solution and then encoding this in a deterministic script is very useful.
I’ve also learned it’s important to not overfit a solution to the problem – i.e. agentic workflows, skills, subagents etc are not always the right solution for a given problem. To this end, I got clearer in my own head about when AI agents and tool usage is useful and where it’s not. Tool use and agentic AI is primarily helpful when we want the ability to dynamically run scripts, or perform calculations like calculating generation length or population size reduction rates.

A North Star for my PhD: LIFE and STAR

Wed, 22 Oct 2025 00:00:00 GMT

Over the past couple weeks, I’ve been navigating between various exciting paths I could work on during my PhD. Whilst free-form exploration is valuable, a great aid for staying on track whilst navigating is to choose a North Star. This post provides an argument for initially choosing LIFE and STAR as a North Star for my PhD.

For IUCN's STAR, here’s my argument for why it matters:

We are facing urgent biodiversity and ecological crises.
We want our actions to tackle these to be as leveraged as possible.
The IUCN has global influence and authority on the state of nature and conservation. They advise the UN, governments, NGOs and large corporations. Their scope for impact is thus large.
STAR is the IUCN’s flagship metric – they have recently publicly committed to it as their key tool for quantifying and directing nature-positive action.
STAR will thus be increasingly relied upon by influential decision makers looking to achieve their biodiversity goals.
To do so faithfully, and to preserve public trust, it’s important that STAR accurately reflects what it’s designed to measure.

For LIFE, see Dr Alyson Eyres's fantastic paper showing use cases for applying LIFE to inform conservation acton. One clear example is how it can help us understand how the biodiversity impacts of our dietary choices differ depending on the country we live in.

If you agree with those arguments, you can see why I care about LIFE and STAR, and why one could justify LIFE and STAR serving initially as an North Star for my PhD.

The next question follows: as a computer science PhD student, what are the best ways I can contribute to making these metrics better?

The conceptual diagram below is my attempt at capturing what I see as the biggest gaps and opportunities.

The opportunities can be summarised as:

Improve the accuracy of the existing metrics (better habitat maps, more accurate and up-to-date Red List data)
Extend their coverage to other taxa (plants, fungi, marine species), and
Capture richer insights (include species interaction effects, simulations e.g. Madingley).

This lends itself to a series of research proposals to explore:

But where to start? It’s clear to see in the diagram that the AOH maps are the funnel – if the AOHs are wrong, LIFE and STAR are wrong. So, as Michael Dales argues, to be confident in LIFE and STAR being correct, we first need to validate the constituent AOHs are correct. The current validation standard was set by Dahal et al, but we think we can do better.

So, we’ll now move on to… Research Idea: Validating AOHs: Can we do better than Dahal et al?

Early Explorations: LIFE, STAR and Habitat Mapping

Tue, 14 Oct 2025 00:00:00 GMT

Following on from Hello World: My Arrival in Cambridge, I have explored a number of interesting things over the past couple weeks, including:

I had some great meetings with Michael Dales, discussing LIFE, STAR, the need for better habitat maps (and validation thereof), and how I could help.
I read lots of papers:
- LIFE and STAR
- Creating habitat maps: Jung et al and Lumbierres et al
- Measuring AoH: Brooks et al
- Validating AoHs: Dahal et al
- Madingley model: Harfoot et al.
I had a great chat with Calum Maney and Professor Andrea Manica, discussing the UNEP-WCMC’s new project looking at geoindexing literature to try extract more data on data-deficient species.
I virtually attended some of the sessions of the IUCN World Conservation Congress.
I attended lectures by Professor Carl Henrik Ek on Machine Learning and the Physical World. The course content is highly relevant to biodiversity data and ecology, where data is scarce and expensive, but we have a lot of expert knowledge.
I attended our Computer Science PhD Induction day and met our first-year PhD cohort.

Working with Michael to improve the habitat maps underlying LIFE and STAR seems a great place to start my PhD. It’s very valuable work – these metrics are likely to be increasingly relied upon by governments, NGOs and businesses who will look to IUCN’s authority about what nature metrics to include towards nature-positive decision making (for example, see IUCN’s STAR press release from last week). So improving the accuracy of these metrics could be very impactful (for example, see this IUCN report about STAR being applied in Kenya and Cameroon). We’re uniquely placed here at the CCI to collaborate with the IUCN on this. I also think I have a lot to offer here with my stats and ML background.

Initially the first step is to look at answering the key question Michael has been facing: how do we know if a habitat map is good enough? This will involve building some validation infrastructure for this, improving upon Dahal et al’s validation standard. We can then use this validation structure to see if we can use ML and TESSERA embeddings to produce significantly better quality habitat maps.

Other advantages of this work include getting me up to speed with working with citizen science data (GBIF) and remote sensing data (via TESSERA embeddings) – data sources it’s important I get hands-on experience with, as I’ll certainly be using these a lot over the next few years. It also will get me up to speed with running large-scale geospatial processing pipelines on our computing cluster.

Particular thanks go out to Michael Dales who has patiently given me a lot of his time to contextualise and walk me through the LIFE and STAR pipelines and future directions. I’m excited to continue to learn from and work with him here!

I’d also be very excited to work with James Ball and David Coomes in their projects on vegetation mapping using TESSERA (starting with the Cairngorms, and building towards the entire UK), so I’m hoping this will all tie together – especially given including plant data in LIFE and STAR is probably the biggest opportunity for improving them.

Finally, taking a step back with a bigger-picture lens for my PhD, I’ve sketched out three broad problem domains I’m interested in working on, as illustrated in the diagram below:

This simplifies to:

Better habitat maps for STAR and LIFE (including better validation techniques)
Addressing data-deficiency in plant, fungi and marine data (multi-modal approaches combining literature, remote sensing, and citizen science data; and active learning approaches to guide the search for rare plant data)
Simulating entire ecosystems (revisiting Madingley for 2025 in the age of AI+data+compute).

Each of these problems tie together and could all be interesting domains to explore during the course of my PhD. But we’ll see how things play out over the coming weeks – I’m aware that research tends to be an iterative journey that can take unexpected turns.

That’s all from me this week. Excited to be moving forward. Onwards and upwards!

Hello World: My Arrival in Cambridge

Sun, 05 Oct 2025 00:00:00 GMT

Hello world! This is the first of many updates I plan to post to track my PhD journey. My goal with these is two-fold: (1) to document and keep track of the work I’ve done over time, and (2) to make my thoughts publicly available to help receive early feedback. We’ll see how it goes, but my aim is to post an update roughly every two weeks.

It’s been fantastic to finally be back here in Cambridge! To set the scene – after I finished my Masters in Machine Learning and Machine Intelligence here back in 2022, I spent a few years in industry as a Machine Learning Engineer at Aerobotics in Cape Town, and am now starting my PhD in Computer Science under the supervision of Professor Anil Madhavapeddy in the Energy and Environment Group (EEG).

My goal for my first few weeks has just been to start exploring initial research directions I could take for my PhD. You may be wondering – did I not need to already have decided on a topic when I applied? Kind of – whilst yes I did have to submit a proposal for my application (see my proposal on automated bird call classification), this was primarily just to demonstrate my ability to (a) review the current literature in a domain of interest, (b) identify research gaps and (c) present a carefully thought-out research plan. My plan for my actual PhD, however, was always to iterate towards a topic for impactful research once I arrive after chatting with various people in the space – with the umbrella goal of combining my background in computer science and machine learning, with my passion for nature and sustainability, to contribute towards tackling the global biodiversity crisis. And what better place to do this than as part of the Cambridge Conservation Institute (CCI) (housed at the David Attenborough Building), home to world-leading interdisciplinary Cambridge researchers, alongside globally important conservation organisations including the United Nations Environmental Programme World Conservation Monitoring Centre (UNEP-WCMC), the International Union for Conservation of Nature (IUCN) and Birdlife International.

With that in mind, over the past couple weeks I’ve had meetings with various people at the EEG and CCI. It started with a fantastic first meeting with Anil in his office at Pembroke ahead of welcome drinks at the pub with a couple of his PhD students. After giving me some initial PhD advice, and some discussions about advancements in AI, we discussed initial directions I could take. Firstly, Anil mentioned there’s potential collaboration with the IUCN that he thought I could be a good fit for. They’ve recently received funding from Google.org to help with mapping data-deficient plant species. If I understand correctly, the IUCN guys want to see if there’s data hidden in the depths of scientific literature that could be retrieved at scale using AI. This would involve multi-modal LLMs to scan both text and graphs. There may be a particular focus on the Fynbos biome in South Africa’s Western Cape, which of course is close to my heart having studied and lived in Cape Town for several years. I still need to find out more details here but hope to have a meeting with some of the IUCN guys next week. To get an initial feel for what scanning through this literature would entail though, I downloaded the PLOS corpus and got a feel for the XML and how one would access figures.

The other direction I could take is to build on the group’s incredibly exciting new geospatial foundation model: TESSERA. One potential exciting application for this is in habitat mapping (validated by our trip to search for hedgehog habitats that went viral). In this vein, my next meeting involved going to the David Attenborough Building (DAB) for the first time to meet with Professor David Coomes and Dr James Ball (who was funnily enough my cricket captain at Magdelelene back in 2022!) from the Plant Sciences department. They’re leading the habitat mapping work with TESSERA and are working towards a vision of a global habitat map of the earth at varying levels of granularity enabled by geospatial foundation models. David outlined their access to a great dataset of field data and aircraft-captured LIDAR scans in the Cairgormmes that he’d love to test out leveraging TESSERA to analyse. I think this could be a great first project for me. (Side note: the DAB is an extremely inspiring place, with a David Attenborough quote beneath the 17m high living wall wall as you walk in, saying “There are few things more important in the world today than what you are doing here”)

I also joined in on a call between 3rd year PhD student Jovana Knezevic and David Ball, where Jovana took us through the exploratory work she’s been doing comparing TESSERA and AlphaEarth for change detection. Lastly, Sadiq invited me to meet with the Conservation Evidence (CE) team, where I met some of the amazing people working in this space including Sam Reynolds, Alec Christie and Bill Sutherland. They’ve been building AI pipelines to help scale out the CE’s hugely impactful work. This could also be an interesting area to get involved in at some point, and really impactful. I do think initially for my PhD though it’ll be useful to stay within the EEG group to get more of a lab feel, but I could perhaps pivot here at a later stage in the PhD.

Coming away from these meetings, I’m thinking as my starting point to look at applications of TESSERA for a few reasons. One, it looks like it’ll be an enjoyable, collaborative environment to be part of – with Frank (2nd year PhD), Jovana (3rd year PhD) and James (2nd year post-doc) all doing research in this space, in contrast to the Conservation Evidence space where there aren’t other PhD students working currently. Two, I like the direct real-world problems this work could contribute towards. The IUCN’s Red List is one of the world’s most important resource for conservation, and they are severely bottlenecked on their ability to do assessments by time and resources. TESSERA could be an incredibly powerful tool to help scale out mapping of habitats and plant distributions.

For this next week or two, my plan is to get my hands dirty with actually playing around with TESSERA, and to that end I’ll start with asking David and James for more details about the Scottish Cairgormes dataset. I’ll also be meeting with Dr Michael Dales, who has done really amazing work with habitat maps for LIFE, built fantastic python libraries in yirgacheffe and aoh-calculator, and worked closely with the IUCN, so I am very excited to pick his brain. I might also look at doing some experiments on tree species distribution mapping in Cambridge using iNaturalist data. Finally, it’s also the big IUCN conference this week which I’m considering attending virtually…

Lots of exciting things to explore. Onwards and upwards!