Sitemap
Voxel51

News, tutorials, tips, and big ideas in computer vision and data-centric machine learning, from the company behind open source FiftyOne. Learn more at https://voxel51.com

Beyond the Microscope: Diving into BIOSCAN-5M, a New Dataset for Insect Biodiversity Research

9 min readFeb 13, 2025

--

BIOSCAN in FiftyOne

The Power of Multi-Modal Data

Key features of the BIOSCAN-5M dataset

Building a Robust Dataset

Exploring BIOSCAN-5M in FiftyOne

!pip install fiftyone open-clip-torch umap-learn transformers
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
"Voxel51/BIOSCAN-30k",
name="bioscan30k",
overwrite=True,
persistent=True
)
import os

from getpass import getpass

os.environ["MAPBOX_TOKEN"] = getpass("Input your Mapbox token:")
!fiftyone plugins download \
https://github.com/voxel51/fiftyone-plugins \
--plugin-names @voxel51/dashboard
fo.launch_app(dataset)
fiftyone app launch
import fiftyone as fo
fo.launch_app(dataset)

🐛 Warning: You’re about to see some creepy crawly insects.

Initial exploration of the dataset in FiftyOne

Deeper analysis with FiftyOne

import fiftyone.zoo as foz
bio_clip_model = foz.load_zoo_model(
"open-clip-torch",
pretrained="",
clip_model="hf-hub:imageomics/bioclip"
)
import torch 

device="cuda" if torch.cuda.is_available() else "cpu" #use GPU if available

dataset.compute_embeddings(
model=bio_clip_model,
embeddings_field="bio_clip_embeddings",
batch_size=128, #use whatever batch size your GPU can handle
device=device
)

import torch

from transformers import AutoTokenizer, AutoModel, BertConfig

# First load the configuration
barcode_bert_config = BertConfig.from_pretrained(
"bioscan-ml/BarcodeBERT",
trust_remote_code=True
)

# Load the tokenizer
barcode_bert_tokenizer = AutoTokenizer.from_pretrained(
"bioscan-ml/BarcodeBERT",
trust_remote_code=True
)

# Load the model
barcode_bert_model = AutoModel.from_pretrained(
"bioscan-ml/BarcodeBERT",
device_map=device,
trust_remote_code=True,
config=barcode_bert_config
)


with torch.no_grad():
for sample in dataset:
dna_sequence = sample["dna_barcode"]['value']
inputs = barcode_bert_tokenizer(dna_sequence, return_tensors="pt")["input_ids"]
inputs = inputs.to(device)
outputs = barcode_bert_model(inputs.unsqueeze(0))["hidden_states"][-1]
embs = outputs.mean(1).squeeze().cpu().numpy()
sample["barcode_bert_embeddings"] = embs
sample.save()
import fiftyone.brain as fob

embedding_fields = [ "bio_clip_embeddings", "barcode_bert_embeddings"]

for fields in embedding_fields:
_fname = fields.split("_embeddings")[0]
results = fob.compute_visualization(
dataset,
embeddings=fields,
method="umap",
brain_key=f"{_fname}_viz",
num_dims=2,
)
import fiftyone as fo
fo.launch_app(dataset)
Visualizing embeddings in FiftyOne

Using embeddings to gain deeper insights

import fiftyone.brain as fob

fob.compute_uniqueness(
samples=dataset,
uniqueness_field="bio_clip_uniqueness",
embeddings="bio_clip_embeddings",
)

fob.compute_representativeness(
samples=dataset,
representativeness_field="bio_clip_representativeness",
embeddings="bio_clip_embeddings",
)
Filtering by uniqueness and representativeness

Conclusion

Next Steps

--

--

Voxel51
Voxel51

Published in Voxel51

News, tutorials, tips, and big ideas in computer vision and data-centric machine learning, from the company behind open source FiftyOne. Learn more at https://voxel51.com

Harpreet Sahota
Harpreet Sahota

Written by Harpreet Sahota

🤖 Generative AI Hacker | 👨🏽‍💻 AI Engineer | Hacker-in- Residence at Voxel 51

No responses yet