Datasets & models — BirdLense Hub
Formats, scripts, sources, and training hardware. End-to-end training: TRAINING.
Library operational flow (Hub)
Critical daily operator happy-path in Library:
- Import from disk (
Scan and import). - Regenerate for the period (
Spectrograms->Tracks). - Export dataset ZIP (optional:
only manually corrected). - Maintenance: use
retro-exportfor backfill andclean datasetfor cleanup.
The “All time” range
Library now includes an “All time” preset. It does not guess from the calendar; it derives the range from recordings actually present on disk (storage/stats), so it can safely target the whole archive without manual date hunting.
Practical guidance: - start with the last 7 or 30 days if you want to estimate runtime first; - use “All time” when the device is idle and not busy with live capture; - on very large archives, track regeneration is usually the heaviest operation, then spectrogram regeneration; dataset ZIP export is usually lighter when crops already exist.
System metric "Unique visitors" is defined as the number of SpeciesVisit sessions in the selected period (visit sessions, not unique individual birds).
Train-ready export
In Library -> Export dataset, enable "Train-ready (auto train/val split, no post-script)".
Optionally enable "Add test split (~10%)" to include test/<class>/... (hold-out).
The ZIP will include:
- train/<class>/..., val/<class>/..., and optionally test/<class>/...
- classes.txt
- dataset_info.json — export passport (manifest.schema=birdlense_dataset_export_v2, filters, split_seed, fingerprint_sha256_16) and a quality block: duplicate (video_id, track_id) rows and cross-split video_id leakage.
API: GET /api/ui/dataset/export supports test_ratio and strict_quality=1 (abort on duplicate tracks, cross-split video leakage, or — with ready_for_train — any class below min_images_per_class).
This removes the mandatory intermediate scripts/datasets/export_birdlense_to_yolo.py step for the basic finetuning path.
1. Models
| Component | Version | Trained on |
|---|---|---|
| Detector | YOLOv8n | NABirds + COCO birds + OIDv4 squirrel (binary bird/squirrel) |
| EU classifier | YOLO11n-cls | birds-525 + iNaturalist (~491 species) — active best.pt |
| US classifier | YOLOv8n-cls | NABirds (~400 species) — best_US.pt |
Switch to US: cp best_US.pt best.pt.
2. Name format: Scientific (Common)
Shared convention for merge, Frigate, BirdNET, YOLO:
| Source | Raw | Normalized |
|---|---|---|
| Frigate | Cardinalis cardinalis (Northern Cardinal) |
as-is |
| iNaturalist | Columba palumbus |
Columba palumbus (Common Wood Pigeon) |
| birds-525 | GOLDEN_EAGLE |
Aquila chrysaetos (Golden Eagle) |
YOLO cls folders: train/Parus major (Great Tit)/img.jpg, same class names under val/.
3. Scripts (scripts/datasets/)
EU classifier (birds-525 + iNaturalist)
| Script | Role |
|---|---|
export_birdlense_to_yolo.py |
BirdLense local crops (app/data/dataset/train) → YOLO cls train/val |
download_hf_birds.py |
Hugging Face → YOLO cls (--format scientific_common) |
download_inaturalist.py |
iNaturalist Europe → YOLO cls |
merge_classification_datasets.py |
Merge splits |
download_and_merge_all.sh |
Full pipeline → merged_cls |
Detector (legacy)
| Script | Role |
|---|---|
convert_nabirds_to_yolo.py |
NABirds → YOLO |
download_coco_birds.py |
COCO birds for binary |
merge_datasets_binary.py |
NABirds + COCO → binary |
Weights (app/processor/models/)
| Path | Role |
|---|---|
classification/weights/best.pt |
EU classifier (default) |
classification/weights/best_US.pt |
US backup |
detection/weights/best.pt |
Binary detector |
4. Public datasets
EU (primary)
| Dataset | Species | Link |
|---|---|---|
| 34data/birds-525-species | 525 | Hugging Face |
| iNaturalist Europe | many | API, e.g. place_id=96372 |
North America (weak signal for EU accuracy)
| Dataset | Species |
|---|---|
| NABirds | ~400 |
| sasha/birdsnap | 500 |
| randall-lab/cub200 | 200 |
5. Hardware for training
| Platform | GPU | Cost |
|---|---|---|
| Google Colab | T4 (15 GB) | Free tier |
| RunPod | RTX 4090, A100 | ~$0.40–0.80/h |
| Local | Your GPU | — |
Practical default: Colab Free (T4) — see TRAINING.
6. Pipeline: collect → train
BirdLense recordings → export_birdlense_to_yolo.py → YOLO dataset
↓
birds-525 + iNaturalist → merge_classification_datasets.py → merged_cls
↓
TRAINING.md (Colab) → best.pt
7. Publishing artifacts
| Platform | Use |
|---|---|
| Hugging Face | gfermoto/birds-eu-merged, gfermoto/birdlense-birds-eu — see TRAINING |
| Zenodo | DOI snapshots for papers |