biotech

paleo

Instinct-Driven Dinosaur Agents · Path of Titans

Team Members

Alexus Aguirre Arias · Laura Wetherhold

Team Name

World's Finest

paleo

ResNet-18 · Snapshot Serengeti · Letta

Spring 2026

Gallery

Path of Titans mods

Mods in pages/paleo-presentation/mods/ — click a tile to open full size.

Problem Statement

How do you give an AI agent animal instinct?

stadia_controller

Task Definition

Control a dinosaur in Path of Titans with biologically-grounded survival decisions: flee predators, find prey, manage hunger/thirst/stamina — without game API access

public

Why It Matters

Instinct-level AI is an open problem: rule-based bots lack adaptation; RL requires simulators. Wildlife ML + vision offers a novel grounded alternative

photo_camera

Data Source

Snapshot Serengeti / Dryad camera-trap dataset
2.65M sequences, 7.1M images, real predator/prey ecology

bolt

Key Output

Real-time companion HUD + optional keyboard control loop: observe screen → classify threat → decide action → actuate keys, all at 4–10 FPS

Main claim: Wildlife camera-trap data can bootstrap game-agent threat detection. Players benefit from safer assistance; developers benefit from interpretable behavior-testing tools.

Key Technical Challenges

What makes this hard

balance

~98% Class Imbalance

Non-predator images dominate Serengeti. Accuracy is trivially 98% by predicting all-non-predator. Macro F1 + predator recall expose true model quality

🎮→📸

Domain Shift

Training on savanna camera-traps; inference on rendered 3D game frames. Only raw screen pixels via mss screen capture at runtime

hub

No Game API

Path of Titans has no bot/API hooks. All input comes from simulated keystrokes (keyboard lib) with rate-limiting + emergency-stop (F12) safety guard

neurology

Memory & Personality

Decisions must be species-consistent over time. Primal Mind stores personality + goal + recent-experience blocks, routing through Letta tool surface

inventory_2

Multi-Source Data

Serengeti is the supervised source because species labels map cleanly into predator vs non-predator.

bolt

Real-Time Constraint

Observe→decide→act loop must run at 4–10 FPS. ResNet-18 inference on CPU must complete in <80ms per tick; rate-limited actions every ≥350ms

Proposed Solution

System Pipeline Overview

Perceive → Think → Decide → Act Grab the frame and HUD cues → refresh Primal Mind → choose an action → send keys/mouse safely.

📸 mss
Screen Capture

→

🔬 ResNet-18
Classifier

→

🌡️ frame_to
_observation()

→

🧠 Instinct
Agent

→

🗺️ Action
Mapper

→

⌨️ Safe Input
Controller

PALEOOverlay.exe — tiny on-game HUD: debug readout, live frame preview, demos, start/stop loop.
Letta is the goal — that’s where the agent gets real power (memory, tools, deeper decisions). Right now we run a simple offline brain in the loop — good for demos and iteration, but still faulty. HUD bars/icons are parsed in code; wiki RAG lives locally for mechanics lookup.

visibility Vision Layer (src/pot.py)

ScreenCaptureWorker → BGRA array via mss
classify_frame_predator_probability() → ResNet-18 softmax class-1
Replaces pixel heuristic (brightness/motion) when checkpoint provided
CaptureFrame stores raw frame_bgr for inference

neurology Decision Layer (src/agent.py)

Primal Mind: personality + goals + 5-event experience ring
Instinct Agent: species-specific thresholds per action
Letta tool surface: update_memory, log_thought, get_status
Wiki/RAG search hook planned for game-mechanics context
Outputs: flee / hunt / forage / idle / drink

analytics Training (src/image_training.py)

ResNet-18 + ImageNet transfer, frozen backbone option
12 Serengeti evaluations across LR, augmentation, and epoch settings
PoT fine-tune scripts add 300-game-screenshot adaptation runs
evaluation outputs include accuracy, predator recall, F1, and confusion matrices

dataset Data (src/data.py)

Deterministic JSONL manifest; stable SHA-256 split seeding
Serengeti: Dryad consensus_data.csv + all_images.csv
Kaggle review: one dataset was accelerometer CSV; the other was unlabeled videos
DatasetRecord: sample_id, image_path, species, predator_label, split, source

In-game

Path of Titans — gameplay & UI

Screens from pages/paleo-presentation/game/ — click a tile to open full size.

Method Overview

Model architecture + agent design

🔬 ResNet-18 Classifier

→ ImageNet pretrained backbone (frozen early layers)

→ fc replaced: 512 → 2 (non-predator / predator)

→ Input: 224×224, ImageNet normalize [0.485, 0.456, 0.406]

→ Augment: RandomHorizontalFlip + ColorJitter(0.3)

→ Inference: softmax(logits)[1] = predator probability

→ Runtime: BGR→RGB→PIL→transforms→unsqueeze→model

🛡️ Safe Control Architecture

→ mode: advice (no keys) | dry_run | control (live)

→ min_action_interval: 350ms rate limiter

→ F12 emergency stop — poll every tick

→ keyboard.press() → sleep(80ms) → keyboard.release()

🧠 Primal Mind State

personality: aggression, curiosity, fear, sociability
goals: [survive, find_water, grow, hunt_small_prey]
recent_events: ring buffer, last 5 events
vitals: health, stamina, hunger, thirst
species: determines threshold table

🎯 Decision Logic

if predator_prob > 0.6 → flee
elif prey_density > 0.5 + hunger > 0.6 → hunt
elif hunger > 0.7 → forage
elif thirst > 0.7 → drink
else → idle / explore

On the site

Agent architecture & demo

Right after the architecture overview on this deck — open the live HTML tabs (new tab keeps the presentation).

Experimental Setup

Dataset · Baselines · Metrics · Protocol

📦 Dataset

Source: Snapshot Serengeti / Dryad
Modality: images + CSV metadata
Manifest: 50,000 labeled rows
Balanced images: nearly 10k local JPEGs
Class mix: predator / non-predator balanced
Split: 6,529 train / 2,142 validation
PoT fine-tune: 300 screenshots -> 240 train / 60 validation
PoT holdout: 10 newest labeled game screenshots
Annotation: species/file names mapped to predator_label
Ethics: public wildlife data; game screenshots only from our test set
Kaggle: deferred; no labeled image set

📐 Baselines

Heuristic: majority class on balanced val set

OpenCV Rule: brightness + motion threshold (deterministic, no training)

Serengeti sweep: 12 evaluated ResNet-18 runs

PoT safety tests: 15-epoch fine-tunes with LR, class weight, and threshold sweeps

📊 Metrics & Protocol

Primary: validation accuracy + predator recall
Run mode: balanced training/evaluation sweep + PoT fine-tune
Training subset: balanced predator + non-predator images
Output: failure_analysis_comparison.json + PoT eval JSONs
Figures: validation accuracy + confusion matrices

0

Serengeti eval runs

0

Serengeti split images

0

PoT fine-tune labels

0

PoT holdout images

Major Results

Latest Serengeti results: 15 epochs made `1e-4` the best accuracy run

Latest Serengeti validation accuracy by experiment using the 2142-image validation split

0.9118

Best validation accuracy

ResNet-18, LR=1e-4, augmentation, 15 epochs

0.9206

Predator recall

939 predators caught, 81 missed on validation

0.5238

Heuristic baseline

Simple baseline before ResNet learning

2,142

Validation images

Same holdout split across all Serengeti experiments

Experiment	Val Accuracy	Pred Recall	Pred F1	LR	Epochs
ResNet-18 LR=1e-4 + aug	0.9118	0.9206	0.9086	1e-4	15
ResNet-18 LR=1e-3 + aug	0.9080	0.9020	0.9033	1e-3	10
ResNet-18 LR=1e-3 + aug	0.9066	0.9167	0.9034	1e-3	15
ResNet-18 LR=1e-4 + aug	0.9062	0.8990	0.9012	1e-4	10
ResNet-18 LR=5e-5 + aug	0.8987	0.8569	0.8896	5e-5	15

Convergence across epoch settings

Convergence curves comparing training and validation loss across epoch and learning-rate experiment runs

Model-selection result: the Serengeti sweep used a 6,529 train / 2,142 validation split. On that stable validation set, 1e-4 + augmentation + 15 epochs was the strongest run, so we used it as the best real-image checkpoint before adapting to Path of Titans.

The convergence curves explain why our top run is not just a lucky endpoint. We compare the same ResNet-18 architecture across different learning rates and epoch counts.

At 10 epochs, both 1e-3 and 1e-4 reached strong validation scores quickly, but the curves were less settled than the longer training runs.

At 15 epochs, 1e-4 with augmentation gave the best final balance: highest validation accuracy (0.9118) with strong predator recall and F1, while avoiding the noisier behavior seen in more aggressive rates.

1e-3 was competitive and converged fast, but plateaued slightly below the best 15-epoch 1e-4 run. 5e-5 was more conservative and underfit relative to the top configurations.

Key takeaway: same model family, but epoch budget and learning rate materially changed the operating point; 15 epochs at 1e-4 gave the most reliable checkpoint before domain adaptation to Path of Titans.

Ablation Studies

Augmentation dominated the strongest Serengeti runs

Validation accuracy by augmentation setting

Color-coded bar and scatter plot showing augmented Serengeti runs outperforming most non-augmented runs

Blue = with augmentation

Augmented runs used RandomHorizontalFlip and ColorJitter(0.3).

Pattern

The best run used augmentation: 0.9118 accuracy, 0.9206 predator recall, 0.9086 predator F1.

Decision

Because augmented models dominated the upper ranks, the transfer checkpoint kept augmentation.

12

runs compared

2,142

validation images

6/6

top-half augmented

1e-4

best LR + aug

Domain Shift + Safety Tuning

Why the selected model changed

300-image PoT fine-tune validation

Confusion matrix comparison on the 60-image Path of Titans validation split

Agent safety holdout

Confusion matrix comparison for baseline and safety-tuned operating point

1e-4 is still the accuracy pick: it was strongest on Serengeti and on the 300-screenshot PoT validation split. But on the 10-image game holdout, class weighting alone did not fix false negatives at the default threshold. We switched from "best accuracy checkpoint" to a safety operating point: lr=3e-5, predator class weight 3.0, and threshold 0.20, because the live agent should over-warn rather than miss predators.

Label source matters

Kaggle videos could be converted into frames, but without predator/non-predator labels they cannot support supervised evaluation. Serengeti species labels keep the experiment measurable.

🎮 Domain Shift: Savanna → 3D Game

Camera-trap JPEG textures differ fundamentally from rendered 3D polygon meshes. In-game predator appearance (e.g., T-Rex) has no Serengeti analogue. Current classifier degrades with game-engine domain

📊 Pixel Heuristic Fallback Is Blind

When no checkpoint is loaded, frame_to_observation() uses brightness + motion score as a proxy for threat. This fires on any moving bright object (day/night cycle, weather) — high false positive rate

⏱️ Latency Budget Under Pressure

ResNet-18 on CPU runs ~120ms/frame on mid-range hardware — exceeds the 80ms target for 10 FPS. Drops effective rate to ~6 FPS. GPU inference or model distillation (MobileNet) required for smooth real-time

🧠 Primal Mind Has No Episodic Context

Recent-events memory is still a short runtime window, so the agent cannot reason far back in time (for example, repeated attacks from one direction over longer sequences). Longer-horizon memory or Letta RAG integration is still needed.

Threshold beat retraining alone

The weighted fine-tune still missed too many predators at default threshold. Lowering the predator threshold to 0.20 on the weighted run moved recall from 0.571 to 0.714 on the tiny holdout, trading one extra false alarm for fewer missed threats.

Conclusion & Future Work

Final model choices across real images, game screenshots, and safety holdout

task_alt Contributions

Built an end-to-end screen-capture, classify, think, and act pipeline with no game API.
Trained and compared 12 Serengeti ResNet-18 experiments on the same 2,142-image validation split.
Selected the best real-image checkpoint: 1e-4 + aug + 15 epochs, 0.9118 validation accuracy and 0.9206 predator recall.
Fine-tuned that checkpoint on 300 labeled Path of Titans screenshots; 1e-4 remained the best 60-image validation accuracy model at 0.7667.
Chose a separate live-agent safety operating point for the 10-image holdout: lr=3e-5, predator weight 3.0, threshold 0.20.
Moved holdout predator recall from 0.571 to 0.714 by prioritizing fewer missed predators over raw accuracy.
Added Primal Mind state, explainable thoughts, guarded keyboard control, rate limits, and F12 emergency stop.

Future Work

Add more labeled Path of Titans holdout images, especially predator cases.
Run the full focused in-game smoke test with capture, advice mode, and simple movement.
Wire live Letta memory/RAG into the middle of the decision loop.
Try MobileNetV3 or distillation for faster inference.
Keep optimizing around predator recall, then report accuracy as the secondary metric.

Best real-image checkpoint

0.9118Serengeti val accuracy

ResNet-18 1e-4 + aug + 15 epochs; predator recall 0.9206, predator F1 0.9086.

Best 300-screenshot PoT model

0.766760-image val accuracy

Fine-tuned from Serengeti checkpoint with lr=1e-4, 15 epochs; split was 240 train / 60 validation.

Best 10-image safety operating point

0.714predator recall

lr=3e-5, predator weight 3.0, threshold 0.20; accuracy 0.70, precision 0.833.

"A dinosaur that knows it's in danger, even when no one tells it."

github.com/PALEO-AI-System/PALEO

Demo

PALEO in action

Screen recording from pages/paleo-presentation/demo/. Use the player controls to play or scrub.

Explore

Profiles, HUD & skins

Opens in a new tab. Paths are relative to the site pages/ root.

paleo

Path of Titans mods

How do you give an AI agent animal instinct?

What makes this hard

System Pipeline Overview

Path of Titans — gameplay & UI

Model architecture + agent design

Agent architecture & demo

Dataset · Baselines · Metrics · Protocol

Latest Serengeti results: 15 epochs made 1e-4 the best accuracy run

Augmentation dominated the strongest Serengeti runs

Why the selected model changed

Final model choices across real images, game screenshots, and safety holdout

PALEO in action

Profiles, HUD & skins

Latest Serengeti results: 15 epochs made `1e-4` the best accuracy run