API Reference

class bioscan_dataset.BIOSCAN1M(root, split='train', partitioning_version='large_diptera_family', modality=('image', 'dna'), reduce_repeated_barcodes=False, max_nucleotides=660, target_type='family', transform=None, dna_transform=None, target_transform=None, download=False)[source]

Bases: VisionDataset

BIOSCAN-1M Dataset.

Parameters:
  • root (str) – The root directory, to contain the downloaded tarball file, and the image directory, BIOSCAN-1M.

  • split (str, default="train") –

    The dataset partition, one of:

    • "train"

    • "val"

    • "test"

    • "no_split"

  • partitioning_version (str, default="large_diptera_family") –

    The dataset partitioning version, one of:

    • "large_diptera_family"

    • "medium_diptera_family"

    • "small_diptera_family"

    • "large_insect_order"

    • "medium_insect_order"

    • "small_insect_order"

  • modality (str or Iterable[str], default=("image", "dna")) – Which data modalities to use. One of, or a list of: "image", "dna".

  • reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.

  • max_nucleotides (int, default=660) – Maximum number of nucleotides to keep in the DNA barcode. Set to None to keep the original data without truncation (default). Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.

  • target_type (str, default="family") –

    Type of target to use. One of:

    • "phylum"

    • "class"

    • "order"

    • "family"

    • "subfamily"

    • "tribe"

    • "genus"

    • "species"

    • "uri"

    Where "uri" corresponds to the BIN cluster label.

  • transform (Callable, default=None) – Image transformation pipeline.

  • dna_transform (Callable, default=None) – DNA barcode transformation pipeline.

  • target_transform (Callable, default=None) – Label transformation pipeline.

class bioscan_dataset.BIOSCAN5M(root, split='train', modality=('image', 'dna'), image_package='cropped_256', reduce_repeated_barcodes=False, max_nucleotides=660, target_type='species', transform=None, dna_transform=None, target_transform=None, download=False)[source]

Bases: VisionDataset

BIOSCAN-5M Dataset.

Parameters:
  • root (str) – The root directory, to contain the downloaded tarball files, and data directory.

  • split (str, default="train") –

    The dataset partition. One of:

    • "pretrain"

    • "train"

    • "val"

    • "test"

    • "key_unseen"

    • "val_unseen"

    • "test_unseen"

    • "other_heldout"

    • "all", which is the union of all splits

    • "seen", which is the union of {train, val, test}

    • "unseen", which is the union of {key_unseen, val_unseen, test_unseen}

    Set to "all" to include all splits.

  • modality (str or Iterable[str], default=("image", "dna")) – Which data modalities to use. One of, or a list of: "image", "dna".

  • image_package (str, default="cropped_256") – The package to load images from. One of: "original_full", "cropped", "original_256", "cropped_256".

  • reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.

  • max_nucleotides (int, default=660) – Maximum number of nucleotides to keep in the DNA barcode. Set to None to keep the original data without truncation. Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.

  • target_type (str or Iterable[str], default="species") –

    Type of target to use. One of, or a list of:

    • "phylum"

    • "class"

    • "order"

    • "family"

    • "subfamily"

    • "genus"

    • "species"

    • "dna_bin"

  • transform (Callable, default=None) – Image transformation pipeline.

  • dna_transform (Callable, default=None) – DNA barcode transformation pipeline.

  • target_transform (Callable, default=None) – Label transformation pipeline.

  • download (bool, default=False) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the "image" modality is requested. Note that only image_package=cropped_256 is supported for automatic image download.

download() None[source]

Download and extract the data.

bioscan_dataset.load_bioscan1m_metadata(metadata_path, max_nucleotides=660, reduce_repeated_barcodes=False, split=None, partitioning_version='large_diptera_family', dtype=MetadataDtype.DEFAULT, **kwargs) DataFrame[source]

Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.

Parameters:
  • metadata_path (str) – Path to metadata file.

  • max_nucleotides (int, default=660) – Maximum nucleotide sequence length to keep for the DNA barcodes. Set to None to keep the original data without truncation. Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.

  • reduce_repeated_barcodes (str or bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no reduction is performed.

  • split (str, optional) –

    The dataset partition, one of:

    • "train"

    • "val"

    • "test"

    • "no_split"

    • "all"

    If split is None or "all" (default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.

  • partitioning_version (str, default="large_diptera_family") –

    The dataset partitioning version, one of:

    • "large_diptera_family"

    • "medium_diptera_family"

    • "small_diptera_family"

    • "large_insect_order"

    • "medium_insect_order"

    • "small_insect_order"

  • **kwargs – Additional keyword arguments to pass to pandas.read_csv().

Returns:

df – The metadata DataFrame.

Return type:

pd.DataFrame

bioscan_dataset.load_bioscan5m_metadata(metadata_path, max_nucleotides=660, reduce_repeated_barcodes=False, split=None, dtype=MetadataDtype.DEFAULT, **kwargs) DataFrame[source]

Load BIOSCAN-5M metadata from its CSV file and prepare it for training.

Parameters:
  • metadata_path (str) – Path to the metadata CSV file.

  • max_nucleotides (int, default=660) – Maximum nucleotide sequence length to keep for the DNA barcodes. Set to None to keep the original data without truncation (default). Note that the barcode should only be 660 base pairs long. Characters beyond this length are unlikely to be accurate.

  • reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no reduction is performed.

  • split (str, optional) –

    The dataset partition to return. One of:

    • "pretrain"

    • "train"

    • "val"

    • "test"

    • "key_unseen"

    • "val_unseen"

    • "test_unseen"

    • "other_heldout"

    • "all", which is the union of all splits

    • "seen", which is the union of {train, val, test}

    • "unseen", which is the union of {key_unseen, val_unseen, test_unseen}

    If split is None or "all" (default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.

  • **kwargs – Additional keyword arguments to pass to pandas.read_csv().

Returns:

The metadata DataFrame.

Return type:

pandas.DataFrame