biotech

paleo

Instinct-Driven Dinosaur Agents · Path of Titans

Team Members
Alexus Aguirre Arias · Laura Wetherhold
Team Name
World's Finest
paleo
ResNet-18 · Snapshot Serengeti · Letta
Spring 2026
Gallery

Path of Titans mods

Mods in pages/paleo-presentation/mods/ — click a tile to open full size.

Problem Statement

How do you give an AI agent animal instinct?

stadia_controller
Task Definition
Control a dinosaur in Path of Titans with biologically-grounded survival decisions: flee predators, find prey, manage hunger/thirst/stamina — without game API access
public
Why It Matters
Instinct-level AI is an open problem: rule-based bots lack adaptation; RL requires simulators. Wildlife ML + vision offers a novel grounded alternative
photo_camera
Data Source
Snapshot Serengeti / Dryad camera-trap dataset
2.65M sequences, 7.1M images, real predator/prey ecology
bolt
Key Output
Real-time companion HUD + optional keyboard control loop: observe screen → classify threat → decide action → actuate keys, all at 4–10 FPS
Main claim: Wildlife camera-trap data can bootstrap game-agent threat detection. Players benefit from safer assistance; developers benefit from interpretable behavior-testing tools.
Key Technical Challenges

What makes this hard

balance
~98% Class Imbalance
Non-predator images dominate Serengeti. Accuracy is trivially 98% by predicting all-non-predator. Macro F1 + predator recall expose true model quality
🎮→📸
Domain Shift
Training on savanna camera-traps; inference on rendered 3D game frames. Only raw screen pixels via mss screen capture at runtime
hub
No Game API
Path of Titans has no bot/API hooks. All input comes from simulated keystrokes (keyboard lib) with rate-limiting + emergency-stop (F12) safety guard
neurology
Memory & Personality
Decisions must be species-consistent over time. Primal Mind stores personality + goal + recent-experience blocks, routing through Letta tool surface
inventory_2
Multi-Source Data
Serengeti is the supervised source because species labels map cleanly into predator vs non-predator.
bolt
Real-Time Constraint
Observe→decide→act loop must run at 4–10 FPS. ResNet-18 inference on CPU must complete in <80ms per tick; rate-limited actions every ≥350ms
Proposed Solution

System Pipeline Overview

PerceiveThinkDecideAct Grab the frame and HUD cues → refresh Primal Mind → choose an action → send keys/mouse safely.
📸 mss
Screen Capture
🔬 ResNet-18
Classifier
🌡️ frame_to
_observation()
🧠 Instinct
Agent
🗺️ Action
Mapper
⌨️ Safe Input
Controller
PALEOOverlay.exe — tiny on-game HUD: debug readout, live frame preview, demos, start/stop loop.
Letta is the goal — that’s where the agent gets real power (memory, tools, deeper decisions). Right now we run a simple offline brain in the loop — good for demos and iteration, but still faulty. HUD bars/icons are parsed in code; wiki RAG lives locally for mechanics lookup.
visibility Vision Layer (src/pot.py)
  • ScreenCaptureWorker → BGRA array via mss
  • classify_frame_predator_probability() → ResNet-18 softmax class-1
  • Replaces pixel heuristic (brightness/motion) when checkpoint provided
  • CaptureFrame stores raw frame_bgr for inference
neurology Decision Layer (src/agent.py)
  • Primal Mind: personality + goals + 5-event experience ring
  • Instinct Agent: species-specific thresholds per action
  • Letta tool surface: update_memory, log_thought, get_status
  • Wiki/RAG search hook planned for game-mechanics context
  • Outputs: flee / hunt / forage / idle / drink
analytics Training (src/image_training.py)
  • ResNet-18 + ImageNet transfer, frozen backbone option
  • 12 Serengeti evaluations across LR, augmentation, and epoch settings
  • PoT fine-tune scripts add 300-game-screenshot adaptation runs
  • evaluation outputs include accuracy, predator recall, F1, and confusion matrices
dataset Data (src/data.py)
  • Deterministic JSONL manifest; stable SHA-256 split seeding
  • Serengeti: Dryad consensus_data.csv + all_images.csv
  • Kaggle review: one dataset was accelerometer CSV; the other was unlabeled videos
  • DatasetRecord: sample_id, image_path, species, predator_label, split, source
In-game

Path of Titans — gameplay & UI

Screens from pages/paleo-presentation/game/ — click a tile to open full size.

Method Overview

Model architecture + agent design

🔬 ResNet-18 Classifier
→ ImageNet pretrained backbone (frozen early layers)
→ fc replaced: 512 → 2 (non-predator / predator)
→ Input: 224×224, ImageNet normalize [0.485, 0.456, 0.406]
→ Augment: RandomHorizontalFlip + ColorJitter(0.3)
→ Inference: softmax(logits)[1] = predator probability
→ Runtime: BGR→RGB→PIL→transforms→unsqueeze→model
🛡️ Safe Control Architecture
→ mode: advice (no keys) | dry_run | control (live)
→ min_action_interval: 350ms rate limiter
→ F12 emergency stop — poll every tick
→ keyboard.press() → sleep(80ms) → keyboard.release()
🧠 Primal Mind State
personality: aggression, curiosity, fear, sociability
goals: [survive, find_water, grow, hunt_small_prey]
recent_events: ring buffer, last 5 events
vitals: health, stamina, hunger, thirst
species: determines threshold table
🎯 Decision Logic
if predator_prob > 0.6 → flee
elif prey_density > 0.5 + hunger > 0.6 → hunt
elif hunger > 0.7 → forage
elif thirst > 0.7 → drink
else → idle / explore
On the site

Agent architecture & demo

Right after the architecture overview on this deck — open the live HTML tabs (new tab keeps the presentation).

Experimental Setup

Dataset · Baselines · Metrics · Protocol

📦 Dataset
Source: Snapshot Serengeti / Dryad
Modality: images + CSV metadata
Manifest: 50,000 labeled rows
Balanced images: nearly 10k local JPEGs
Class mix: predator / non-predator balanced
Split: 6,529 train / 2,142 validation
PoT fine-tune: 300 screenshots -> 240 train / 60 validation
PoT holdout: 10 newest labeled game screenshots
Annotation: species/file names mapped to predator_label
Ethics: public wildlife data; game screenshots only from our test set
Kaggle: deferred; no labeled image set
📐 Baselines
Heuristic: majority class on balanced val set

OpenCV Rule: brightness + motion threshold (deterministic, no training)

Serengeti sweep: 12 evaluated ResNet-18 runs

PoT safety tests: 15-epoch fine-tunes with LR, class weight, and threshold sweeps
📊 Metrics & Protocol
Primary: validation accuracy + predator recall
Run mode: balanced training/evaluation sweep + PoT fine-tune
Training subset: balanced predator + non-predator images
Output: failure_analysis_comparison.json + PoT eval JSONs
Figures: validation accuracy + confusion matrices
0
Serengeti eval runs
0
Serengeti split images
0
PoT fine-tune labels
0
PoT holdout images
Major Results

Latest Serengeti results: 15 epochs made 1e-4 the best accuracy run

Latest Serengeti validation accuracy by experiment using the 2142-image validation split
0.9118
Best validation accuracy
ResNet-18, LR=1e-4, augmentation, 15 epochs
0.9206
Predator recall
939 predators caught, 81 missed on validation
0.5238
Heuristic baseline
Simple baseline before ResNet learning
2,142
Validation images
Same holdout split across all Serengeti experiments
Experiment Val Accuracy Pred Recall Pred F1 LR Epochs
ResNet-18 LR=1e-4 + aug 0.9118 0.9206 0.9086 1e-4 15
ResNet-18 LR=1e-3 + aug 0.9080 0.9020 0.9033 1e-3 10
ResNet-18 LR=1e-3 + aug 0.9066 0.9167 0.9034 1e-3 15
ResNet-18 LR=1e-4 + aug 0.9062 0.8990 0.9012 1e-4 10
ResNet-18 LR=5e-5 + aug 0.8987 0.8569 0.8896 5e-5 15
Convergence across epoch settings
Convergence curves comparing training and validation loss across epoch and learning-rate experiment runs
Model-selection result: the Serengeti sweep used a 6,529 train / 2,142 validation split. On that stable validation set, 1e-4 + augmentation + 15 epochs was the strongest run, so we used it as the best real-image checkpoint before adapting to Path of Titans.
Ablation Studies

Augmentation dominated the strongest Serengeti runs

Validation accuracy by augmentation setting
Color-coded bar and scatter plot showing augmented Serengeti runs outperforming most non-augmented runs
Blue = with augmentation
Augmented runs used RandomHorizontalFlip and ColorJitter(0.3).
Pattern
The best run used augmentation: 0.9118 accuracy, 0.9206 predator recall, 0.9086 predator F1.
Decision
Because augmented models dominated the upper ranks, the transfer checkpoint kept augmentation.
12
runs compared
2,142
validation images
6/6
top-half augmented
1e-4
best LR + aug
Domain Shift + Safety Tuning

Why the selected model changed

300-image PoT fine-tune validation
Confusion matrix comparison on the 60-image Path of Titans validation split
Agent safety holdout
Confusion matrix comparison for baseline and safety-tuned operating point
1e-4 is still the accuracy pick: it was strongest on Serengeti and on the 300-screenshot PoT validation split. But on the 10-image game holdout, class weighting alone did not fix false negatives at the default threshold. We switched from "best accuracy checkpoint" to a safety operating point: lr=3e-5, predator class weight 3.0, and threshold 0.20, because the live agent should over-warn rather than miss predators.
Label source matters
Kaggle videos could be converted into frames, but without predator/non-predator labels they cannot support supervised evaluation. Serengeti species labels keep the experiment measurable.
🎮 Domain Shift: Savanna → 3D Game
Camera-trap JPEG textures differ fundamentally from rendered 3D polygon meshes. In-game predator appearance (e.g., T-Rex) has no Serengeti analogue. Current classifier degrades with game-engine domain
📊 Pixel Heuristic Fallback Is Blind
When no checkpoint is loaded, frame_to_observation() uses brightness + motion score as a proxy for threat. This fires on any moving bright object (day/night cycle, weather) — high false positive rate
⏱️ Latency Budget Under Pressure
ResNet-18 on CPU runs ~120ms/frame on mid-range hardware — exceeds the 80ms target for 10 FPS. Drops effective rate to ~6 FPS. GPU inference or model distillation (MobileNet) required for smooth real-time
🧠 Primal Mind Has No Episodic Context
Recent-events memory is still a short runtime window, so the agent cannot reason far back in time (for example, repeated attacks from one direction over longer sequences). Longer-horizon memory or Letta RAG integration is still needed.
Threshold beat retraining alone
The weighted fine-tune still missed too many predators at default threshold. Lowering the predator threshold to 0.20 on the weighted run moved recall from 0.571 to 0.714 on the tiny holdout, trading one extra false alarm for fewer missed threats.
Conclusion & Future Work

Final model choices across real images, game screenshots, and safety holdout

task_alt Contributions
  • Built an end-to-end screen-capture, classify, think, and act pipeline with no game API.
  • Trained and compared 12 Serengeti ResNet-18 experiments on the same 2,142-image validation split.
  • Selected the best real-image checkpoint: 1e-4 + aug + 15 epochs, 0.9118 validation accuracy and 0.9206 predator recall.
  • Fine-tuned that checkpoint on 300 labeled Path of Titans screenshots; 1e-4 remained the best 60-image validation accuracy model at 0.7667.
  • Chose a separate live-agent safety operating point for the 10-image holdout: lr=3e-5, predator weight 3.0, threshold 0.20.
  • Moved holdout predator recall from 0.571 to 0.714 by prioritizing fewer missed predators over raw accuracy.
  • Added Primal Mind state, explainable thoughts, guarded keyboard control, rate limits, and F12 emergency stop.
Future Work
  • Add more labeled Path of Titans holdout images, especially predator cases.
  • Run the full focused in-game smoke test with capture, advice mode, and simple movement.
  • Wire live Letta memory/RAG into the middle of the decision loop.
  • Try MobileNetV3 or distillation for faster inference.
  • Keep optimizing around predator recall, then report accuracy as the secondary metric.
Best real-image checkpoint
0.9118Serengeti val accuracy
ResNet-18 1e-4 + aug + 15 epochs; predator recall 0.9206, predator F1 0.9086.
Best 300-screenshot PoT model
0.766760-image val accuracy
Fine-tuned from Serengeti checkpoint with lr=1e-4, 15 epochs; split was 240 train / 60 validation.
Best 10-image safety operating point
0.714predator recall
lr=3e-5, predator weight 3.0, threshold 0.20; accuracy 0.70, precision 0.833.
"A dinosaur that knows it's in danger, even when no one tells it."
github.com/PALEO-AI-System/PALEO
Demo

PALEO in action

Screen recording from pages/paleo-presentation/demo/. Use the player controls to play or scrub.

Explore

Profiles, HUD & skins

Opens in a new tab. Paths are relative to the site pages/ root.