API Reference
- class bioscan_dataset.BIOSCAN1M(root, split='train', partitioning_version='large_diptera_family', modality=('image', 'dna'), reduce_repeated_barcodes=False, max_nucleotides=660, target_type='family', transform=None, dna_transform=None, target_transform=None, download=False)[source]
Bases:
VisionDatasetBIOSCAN-1M Dataset.
- Parameters:
root (str) – The root directory, to contain the downloaded tarball file, and the image directory, BIOSCAN-1M.
split (str, default="train") –
The dataset partition, one of:
"train""val""test""no_split"
partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
"large_diptera_family""medium_diptera_family""small_diptera_family""large_insect_order""medium_insect_order""small_insect_order"
modality (str or Iterable[str], default=("image", "dna")) – Which data modalities to use. One of, or a list of:
"image","dna".reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) – Maximum number of nucleotides to keep in the DNA barcode. Set to
Noneto keep the original data without truncation (default). Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.target_type (str, default="family") –
Type of target to use. One of:
"phylum""class""order""family""subfamily""tribe""genus""species""uri"
Where
"uri"corresponds to the BIN cluster label.transform (Callable, default=None) – Image transformation pipeline.
dna_transform (Callable, default=None) – DNA barcode transformation pipeline.
target_transform (Callable, default=None) – Label transformation pipeline.
- class bioscan_dataset.BIOSCAN5M(root, split='train', modality=('image', 'dna'), image_package='cropped_256', reduce_repeated_barcodes=False, max_nucleotides=660, target_type='species', transform=None, dna_transform=None, target_transform=None, download=False)[source]
Bases:
VisionDatasetBIOSCAN-5M Dataset.
- Parameters:
root (str) – The root directory, to contain the downloaded tarball files, and data directory.
split (str, default="train") –
The dataset partition. One of:
"pretrain""train""val""test""key_unseen""val_unseen""test_unseen""other_heldout""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is the union of {key_unseen, val_unseen, test_unseen}
Set to
"all"to include all splits.modality (str or Iterable[str], default=("image", "dna")) – Which data modalities to use. One of, or a list of:
"image","dna".image_package (str, default="cropped_256") – The package to load images from. One of:
"original_full","cropped","original_256","cropped_256".reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) – Maximum number of nucleotides to keep in the DNA barcode. Set to
Noneto keep the original data without truncation. Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.target_type (str or Iterable[str], default="species") –
Type of target to use. One of, or a list of:
"phylum""class""order""family""subfamily""genus""species""dna_bin"
transform (Callable, default=None) – Image transformation pipeline.
dna_transform (Callable, default=None) – DNA barcode transformation pipeline.
target_transform (Callable, default=None) – Label transformation pipeline.
download (bool, default=False) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the
"image"modality is requested. Note that onlyimage_package=cropped_256is supported for automatic image download.
- bioscan_dataset.load_bioscan1m_metadata(metadata_path, max_nucleotides=660, reduce_repeated_barcodes=False, split=None, partitioning_version='large_diptera_family', dtype=MetadataDtype.DEFAULT, **kwargs) DataFrame[source]
Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.
- Parameters:
metadata_path (str) – Path to metadata file.
max_nucleotides (int, default=660) – Maximum nucleotide sequence length to keep for the DNA barcodes. Set to
Noneto keep the original data without truncation. Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.reduce_repeated_barcodes (str or bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no reduction is performed.split (str, optional) –
The dataset partition, one of:
"train""val""test""no_split""all"
If
splitisNoneor"all"(default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
"large_diptera_family""medium_diptera_family""small_diptera_family""large_insect_order""medium_insect_order""small_insect_order"
**kwargs – Additional keyword arguments to pass to
pandas.read_csv().
- Returns:
df – The metadata DataFrame.
- Return type:
pd.DataFrame
- bioscan_dataset.load_bioscan5m_metadata(metadata_path, max_nucleotides=660, reduce_repeated_barcodes=False, split=None, dtype=MetadataDtype.DEFAULT, **kwargs) DataFrame[source]
Load BIOSCAN-5M metadata from its CSV file and prepare it for training.
- Parameters:
metadata_path (str) – Path to the metadata CSV file.
max_nucleotides (int, default=660) – Maximum nucleotide sequence length to keep for the DNA barcodes. Set to
Noneto keep the original data without truncation (default). Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no reduction is performed.split (str, optional) –
The dataset partition to return. One of:
"pretrain""train""val""test""key_unseen""val_unseen""test_unseen""other_heldout""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is the union of {key_unseen, val_unseen, test_unseen}
If
splitisNoneor"all"(default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.**kwargs – Additional keyword arguments to pass to
pandas.read_csv().
- Returns:
The metadata DataFrame.
- Return type: