SW
Back to all posts

⁉️Can Claude Code pass the IUCN Red List Assessor Exam?

TL;DR

Yes!

Summary

I recently completed the IUCN Red List Assessor Training course, achieving 80% in the final exam to receiving my official certification (you need >75% to pass the exam). Upon completing it, I was curious about how Claude Code would do, so I decided to put it to the test.

So, how does Claude Code do? Pretty good! It passed four out of five exam runs, averaging 80%.

Claude Code’s exam results. The top row is my own personal attempt, the bottom 5 are Claude Code’s. Claude Code got the highest grade of 88%.
Claude Code’s exam results. The top row is my own personal attempt, the bottom 5 are Claude Code’s. Claude Code got the highest grade of 88%.

Humans are allowed to repeat the exam as many times as needed until they pass, so four out of five is a very good result. Moreover, I am confident the incorrect answers are not due to an innate limitation, but rather just requires more careful context engineering.

I remain confident that AI can can significantly help the IUCN scale up Red List Assessments.

Background:

What I did:

  • First, I needed a way to get the exam questions into a format the AI could easily parse. To do this, I used Claude Code to create the following scripts:

    • A script extract_exam.py to parse questions.md from the exam page’s raw HTML.
    • A script extract_memo.py to extract a memo.txt from the HTML of a submitted exam attempt.
    • A script grade_attempt.py to grade a set of AI-generated answers.txt against the memo.txt
  • Next I needed to get the IUCN Red List guideline PDFs into a text format that the AI could read.

    • For this, I used the Claude Code PDF Skill to create a script parse_pdf.py that takes a PDF and outputs a corresponding markdown file along with its associated images and diagrams. The resulting directory looks like:
  • I then used Claude Code to design a red-list-assessor-skill that points the AI to the relevant official guidance docs.

Here’s the SKILL.md:
---
name: iucn-red-list
description: Assist with IUCN Red List species threat assessments. Use when the user asks about Red List categories, criteria thresholds, EOO/AOO calculations, assessment documentation, range mapping standards, generation length, population decline analysis, or preparing species assessments for threatened species.
license: Public Domain (official IUCN documents)
---

# IUCN Red List Assessor

Expert assistance with IUCN Red List assessments using official IUCN documents as the authoritative source.

## Source Documents

Official IUCN guidelines documents are embedded in this skill at `docs/`:

1. **user-guidelines/** - Guidelines for Using the IUCN Red List Categories and Criteria (v16, March 2024, 122 pages)
   - How to apply criteria, calculation methods, edge cases, examples
   - `user-guidelines-v16.md` - Searchable markdown with tables and image references
   - `images/` - Extracted diagrams, flowcharts, and decision trees

2. **categories-and-criteria/** - IUCN Red List Categories and Criteria (v3.1, 38 pages)
   - Categories (EX, EW, CR, EN, VU, NT, LC, DD), criteria (A-E) with thresholds, definitions
   - `categories-and-criteria-v3.1.md` - Searchable markdown

3. **supporting-info/** - Supporting Information Guidelines (68 pages)
   - Documentation requirements, IUCN Threats Classification Scheme v3.2
   - `supporting-info-guidelines.md` - Searchable markdown

4. **mapping-standards/** - Mapping Standards and Data Quality (v1.20, 32 pages)
   - EOO/AOO calculation, GIS requirements (WGS84)
   - `mapping-standards-v1.20.md` - Searchable markdown

5. **regional-guidelines/** - Guidelines for Application of IUCN Red List Criteria at Regional and National Levels (v4.0, 41 pages)
   - Regional assessment protocol, rescue effect, endemism, inclusion thresholds
   - `RL-2012-002.md` - Searchable markdown
   - `images/` - Flowcharts and decision diagrams

## Instructions

When assessing species or answering questions:

### 1. Document Research
- **Search the relevant document** - Use the Read or Grep tool to find exact guidance in the appropriate markdown file (e.g., `user-guidelines-v16.md`)
- **View diagrams when needed** - When decision trees, flowcharts, or criteria tables are referenced, use the Read tool on the image files in the `images/` directory (e.g., `docs/user-guidelines/images/page_10_img_1.png` for the Red List categories diagram)
- **Cite sources** - Always reference document name, section, and page number (e.g., "According to User Guidelines Section 4.4, page 45...")
- **Quote exactly** - Use exact definitions and thresholds from the source documents
- **Exhaustive checking** - For documentation/standards questions, check BOTH general principles AND specific examples; don't rely solely on bulleted lists

### 2. Systematic Criterion Evaluation

**CRITICAL: Always evaluate ALL criteria and ALL sub-criteria to find the HIGHEST qualifying category.**

Use this systematic checklist:

**Criterion A (Population Reduction):**
- □ A1 (past reduction, causes understood/reversible/ceased)
- □ A2 (past reduction, causes may not be understood/reversible/ceased)
- □ A3 (future reduction, projected)
- □ A4 (past+future reduction)

**Criterion B (Geographic Range):**
- □ B1 (Extent of Occurrence - EOO)
- □ B2 (Area of Occupancy - AOO)
- For each, check sub-criteria:
  - □ a. Severely fragmented OR number of locations (≤1, ≤5, ≤10)
  - □ b. Continuing decline in: (i) EOO, (ii) AOO, (iii) habitat, (iv) locations/subpops, (v) mature individuals
  - □ c. Extreme fluctuations

**Criterion C (Small Population + Decline):**
- □ C1 (population <2500/10000 AND decline % within timeframe)
  - Check timeframes: CR=3yrs/1gen, EN=5yrs/2gen, VU=10yrs/3gen
- □ C2 (population <2500/10000 AND continuing decline AND either):
  - □ C2a(i) (≥90%/95% decline in 3/5 years OR 2/3 generations)
  - □ C2a(ii) (≥90%/95%/100% of mature individuals in one subpopulation)
  - □ C2b (extreme fluctuations)

**Criterion D (Very Small or Restricted):**
- □ D (population <50/250/1000)
- □ D2 (restricted area/locations with plausible threat)

**Criterion E (Quantitative Analysis):**
- □ E (extinction probability from PVA)

**After evaluation:**
- Identify ALL qualifying criteria and sub-criteria
- Select the HIGHEST category among all qualifying criteria
- Don't stop at the first qualifying criterion
- Combine all qualifying criteria in final code (e.g., "EN C1+2a(ii); D")

### 3. Regional/National Assessment Protocol

For regional or national red list assessments (see **regional-guidelines/RL-2012-002.md** for full protocol):

**Step 1: Check Inclusion Threshold (if specified)**
- ALWAYS calculate actual percentages when thresholds are mentioned
- Extract global population size from global IUCN criteria code
  - VU C: <10,000 individuals
  - EN C: <2,500 individuals
  - EN D: <250 individuals
  - VU D1: <1,000 individuals
- Calculate: (regional population / global population) × 100%
- Decision:
  - IF percentage < threshold → Category: **NA (Not Applicable)**
  - IF percentage ≥ threshold → Proceed to full regional assessment

**Step 2: Preliminary Regional Assessment**
- Apply IUCN criteria using ONLY the regional population data
- Determine preliminary category based on regional population size, decline, range, etc.

**Step 3: Consider Rescue Effect**
- Can individuals migrate into the region from outside populations?
- Is suitable habitat available for immigrants to establish?
- Does the regional population rely on immigration?
- If YES to all: Consider downlisting (rescue effect possible)
- If NO (population isolated): Keep preliminary category

**Step 4: Final Regional Category**
- Apply adjustments based on rescue effect analysis
- Document both preliminary and final categories with reasoning

### 4. Near Threatened (NT) Assessment

Check for NT when species doesn't qualify for threatened categories:

**Apply NT if:**
- Species meets a geographic/population threshold BUT doesn't meet the required number of sub-criteria
- Species is "close to" qualifying:
  - Near ≤10 locations threshold (e.g., 11-15 locations)
  - Just below percentage thresholds (e.g., 25-29% for VU A)
  - Meets 1 of 2 required sub-criteria under Criterion B
- Species is likely to qualify in near future

**Example:** EOO <20,000 km² (meets VU B1 threshold) with 12 locations and declining, but only 1 of 2 required sub-criteria met → Consider NT

### 5. Calculation Methods and Precision

**For all calculations:**
- Show ALL calculation steps explicitly with formulas
- State which method/formula is being used
- For population reduction, determine if decline is linear or exponential first
- Double-check arithmetic before final rounding
- When multiple valid methods exist (e.g., empirical vs. formula-based generation length):
  - Consider both methods
  - Provide both values if question context suggests it
  - Note which method is being used for each value

**Generation Length:**
- Check if empirical data available (average age of breeding individuals)
- Check if formula-based calculation needed: (age at first breeding + age at last breeding) / 2
- For exploited populations: Use PRE-DISTURBANCE generation length
- If both empirical and formula values exist, consider providing both

**Population Reduction:**
- Determine timeframe: max(10 yrs, 3 generations) for A2, etc.
- Assess pattern: linear vs. exponential
- For exponential: Use formula (1 - (N_final/N_initial)^(target_period/observed_period))
- Round to integer only at final step

### 6. Combine Multiple Criteria
- If multiple criteria are met, combine them (e.g., "EN C1+2a(ii); D")
- Use highest qualifying category as primary
- List all qualifying criteria in standard IUCN format
  • Next I added a Claude Code \attempt-exam-question slash command to answer a given question using the red-list-assessor-skill , outputting the final answer and clear, concise reasoning for the answer.
Here’s the attempt-exam-question.md:
# Attempt Exam Question Command

You are tasked with answering a single IUCN Red List Global and Regional Assessor Exam question.

## Important Information

**Before you start, please read the following information.**

- This is an open-book exam. Use all of the resources available to help you answer the questions.
- Refer to the IUCN Red List Categories and Criteria. Version 3.1 and the associated IUCN guidelines documents and Red List assessment tools.
- Some questions will require you to calculate parameters such as reduction, generation length, continuing decline, etc. Use whatever tools you need to help you with this (e.g., the current version of the Guidelines for Using the IUCN Red List Categories and Criteria, calculator, internet, etc).

**Question Format Guidelines:**
- **Short Answer Questions**: Read each question carefully before giving your answer. Some answers must be entered as an integer number without decimals (e.g., if "9" is the correct answer, writing "9.13", "nine", or "9 locations" will be marked as incorrect). Some questions require you to enter the appropriate two-letter IUCN Red List Category, and some require the appropriate IUCN Red List Category and Criteria code. Ensure you use the appropriate format when entering these codes (see Annex 2 of the IUCN Red List Categories and Criteria. Version 3.1).
- **Multiple-choice questions**: All multiple-choice questions allow you to select one or more answer. At least one of the answers provided is correct, but do not assume that there will always be more than one correct answer. If any of your selected answers are incorrect your overall score for the question will be zero (even if one of your selected answers is the correct one).

## Input

The user will provide:
1. `exam_name` - The exam identifier (e.g., "exam1", "exam2")
2. `question_number` - The question number (1-25)
3. `solution_folder` - The path to the solution folder where the answer file should be written

## Workflow

### 1. Load the IUCN Red List Assessor Skill

Activate the `iucn-red-list-assessor` skill to access specialized knowledge and tools for answering IUCN Red List assessment questions.

### 2. Read the Question

Read the question from: `exams/questions/{exam_name}/{exam_name}_q{question_number}.md`

Format the question number with leading zeros (e.g., q01, q02, ..., q25)

### 3. Answer the Question

Use the IUCN Red List Assessor skill to:
- Analyze the question carefully
- Apply relevant IUCN criteria and guidelines
- Perform any necessary calculations
- Determine the correct answer in the exact format required

### 4. Write the Answer File

Create a file `q{question_number}.md` (with leading zeros) in the solution folder containing:

```markdown
## Question \{N\}

[Copy the full question text here]

## Answer

[Your exact answer in the required format]

## Explanation

[Clear, concise explanation showing how the answer was derived, including:
- Key information from the question
- Relevant IUCN criteria or guidelines applied
- Any calculations performed
- Reasoning for the final answer]

Important Notes

  • Ensure answers follow the exact format specified in each question (e.g., integer only, two-letter code, comma-separated list)
  • For multiple-choice questions with checkboxes, use comma-separated lowercase letters (e.g., "a, b", "b, c, f")
  • For IUCN categories, use exact format from Annex 2 of the Categories and Criteria document
  • Be extremely careful with formatting - incorrect format = zero marks even if conceptually correct
  • Round numbers as instructed (e.g., to nearest integer) and follow the exact format requested
  • The answer in the ## Answer section will be extracted by the aggregation script (markdown formatting like bold/italic will be automatically stripped)

</details>

- I then added a `\attempt-exam` slash command that instructs Claude Code to spin out 25 parallel `Tasks` , one for each question, and run `\attempt-exam-question` on each.
<details>
<summary>Here’s the prompt. Note that the exam instructions are the same ones given to human trainers.</summary>

```markdown
# Attempt Exam Command

You are tasked with attempting the IUCN Red List Global and Regional Assessor Exam.

## Important Information

**Before you start, please read the following information.**

- This is an open-book exam. Use all of the resources available to help you answer the questions.
- Refer to the IUCN Red List Categories and Criteria. Version 3.1 and the associated IUCN guidelines documents and Red List assessment tools.
- Some questions will require you to calculate parameters such as reduction, generation length, continuing decline, etc. Use whatever tools you need to help you with this (e.g., the current version of the Guidelines for Using the IUCN Red List Categories and Criteria, calculator, internet, etc).

The exam contains a range of question types, including multiple-choice and short answer questions:
- **Short Answer Questions**: It is very important to read each question carefully before giving your answer. For example: Some answers must be entered as an integer number without decimals (e.g., if "9" is the correct answer, writing "9.13", "nine", or "9 locations" will be marked as incorrect). Some questions require you to enter the appropriate two-letter IUCN Red List Category, and some require the appropriate IUCN Red List Category and Criteria code. Ensure you use the appropriate format when entering these codes (see Annex 2 of the IUCN Red List Categories and Criteria. Version 3.1).
- **Multiple-choice questions**: All multiple-choice questions allow you to select one or more answer. At least one of the answers provided is correct, but do not assume that there will always be more than one correct answer. If any of your selected answers are incorrect your overall score for the question will be zero (even if one of your selected answers is the correct one).

## Input

The user will provide an exam name (e.g., "exam2"). This corresponds to a folder in `/home/sw984/ai-red-list-assessor/exams/questions/{exam_name}/`.

## Workflow

### 1. Setup Phase

- Read the pre-parsed exam questions from `exams/questions/{exam_name}/`
- Each question is in a separate file: `{exam_name}_q01.md` through `{exam_name}_q25.md`
- Create a timestamped solution folder: `exams/attempts/{exam_name}/{YYYYMMDD_HHMMSS}_{claude|codex|gemini|cursor}`
- Use format like: `20251107_153045_claude`

### 2. Parallel Question Processing

Launch 25 Task agents **in parallel** (one message with 25 Task tool calls), one for each question. Each Task agent should:
- Invoke the `/attempt-exam-question` slash command with the appropriate parameters
- Pass the exam_name, question_number (1-25), and solution_folder path
- The slash command will handle reading the question, using the IUCN skill, and writing the answer file

Example Task agent prompt format:

Please invoke the /attempt-exam-question command to answer question {N} for {exam_name}.

The solution folder is: {solution_folder_path}

Use this invocation: /attempt-exam-question {exam_name} {N} {solution_folder_path}


### 3. Aggregation Phase

After all 25 Task agents complete:
- Run the aggregation script: `python3 scripts/aggregate_answers.py {solution_folder_path}`
- This script will:
  - Read all `q01.md` through `q25.md` files
  - Extract the answer from each file's `## Answer` section
  - Create `answers.txt` with exactly 25 lines (one answer per line)
  - Report any missing or failed answer files

### 4. Completion

- Inform the user where the solutions are saved
- Provide the path to `answers.txt`
- Report the summary from the aggregation script

## Important Notes

- This command orchestrates 25 parallel Task agents, each invoking the `/attempt-exam-question` slash command
- The `/attempt-exam-question` command handles individual question processing, including activating the IUCN Red List Assessor skill
- You can test individual questions independently by running: `/attempt-exam-question &lt;exam_name> &lt;question_number> &lt;solution_folder>`
- Ensure answers follow the exact format specified in each question (e.g., integer only, two-letter code, comma-separated list)
- For multiple-choice questions with checkboxes, use comma-separated lowercase letters (e.g., "a, b", "b, c, f")
- For IUCN categories, use exact format from Annex 2 of the Categories and Criteria document
- Be extremely careful with formatting - incorrect format = zero marks even if conceptually correct

  • I then extracted 5 sets of exams of 25 questions each from the official webpage, and ran Claude Code on each exam. Here are some screenshots showing the Claude Code workflow looks like in action.
1: Claude Code starting the exam
2: Claude Code in action, answering the questions in parallel and referencing the User Guidelines.
3: Screenshot showing Claude Code with completed answers. See token usage and time taken per question.
  • Note Q3 took 7m 3s since it was a weird format:
4: Screenshot showing Claude Code completing the exam and aggregating the answers:
5: Screenshot showing where Claude Code outputs its answers:
6: Screenshot showing an example of Claude Code’s answer to q01.md :

Results

So, how did it actually do?

  • Across the 5 exam runs, Claude Code’s results averaged 81%:
    Claude Code’s exam results. The top row is my own personal attempt, the bottom 5 are Claude Code’s. Claude Code got the highest grade of 88%.
    Claude Code’s exam results. The top row is my own personal attempt, the bottom 5 are Claude Code’s. Claude Code got the highest grade of 88%.
  • Not bad! And the questions it got wrong tended to be questions that were ambiguous or borderline cases. With some further context engineering I’m confident it can score even higher.

Next Steps

Polish up a demo for the IUCN
  • Longer-term, the vision is just for this framework to serve as a test-bed for a practical AI tool to help accelerate assessments. This vision is spelled out in more depth in Untitled.

    • To that end, we would benefit significantly from access to the IUCN’s SIS, which stores a history of all assessments, including the succession of rejected drafts until the published draft was ready.
    • This would be an extremely rich dataset. As just one example, this could be used to design an evaluation framework for an AI validation system for catching errors in Red List assessment first drafts.
  • However, to get the IUCN’s buy-in, it may be helpful to present this to them in a more accessible way than this technical report.

  • Some of Anil’s suggestions include:

    1. A neat UI to visualize it in action. Try tailwind and daisy UI.
    2. Add a click-through from answers to the source material in PDF. Get the source map from pdfplumber , and then draw a polygon in the original pdf for.
    3. Visualize tool use in action. Try claude code python sdk to stream this live.
Better logging and monitoring:
  • Cleanly store all prompts, tool calls, tasks, results etc. launched by Claude Code. This will be important for improvements going forward.
Experiment with context engineering:
  • Try injecting the full guidelines doc into the context window (using prompt caching to reduce costs), and use the model’s in-built attention mechanisms rather than relying on agentic search.
  • Try RAG approaches for semantic retrieval
Recursive prompt improvement:
  • Design a Claude Code framework for iteratively recursively refining its own prompts until it achieves full marks on the exam questions.

Appendix

What else did I learn during this sprint?

  • With all the parallel agents, I run out of Claude Code Pro credits quite quickly, which slows experimentation while I wait a few hours for my session credits to refresh. It might make sense to upgrade to Claude Code Max, or try directly using the API to leverage prompt caching and have finer-grained control over credit usage.
  • This sprint has made clear to me the real tradeoffs between (a) task performance, (b) token usage & cost, (c) model size, and (d) level of reasoning. The ideal combination is to answer the question correctly, using as few tokens as possible, with as small and cheap a model as possible. This has made the importance of context engineering extra clear.
  • Modularity is still important in AI workflow design. Originally, I had one mega-prompt. Splitting this up into subtasks and scripts was extremely helpful. Using AI to explore a solution and then encoding this in a deterministic script is very useful.
  • I’ve also learned it’s important to not overfit a solution to the problem – i.e. agentic workflows, skills, subagents etc are not always the right solution for a given problem. To this end, I got clearer in my own head about when AI agents and tool usage is useful and where it’s not. Tool use and agentic AI is primarily helpful when we want the ability to dynamically run scripts, or perform calculations like calculating generation length or population size reduction rates.