API Reference

We provide BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates classes to load the respective BIOSCAN-1M, BIOSCAN-5M, and Canadian Invertebrates datasets for use within PyTorch. These classes are subclasses of torch.utils.data.Dataset and are designed to be used with PyTorch’s DataLoader for batching and model training.

General usage instructions for BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates are provided in our usage guide.

Tip

For new projects, we recommend using BIOSCAN5M instead of BIOSCAN1M since the newer dataset has cleaner labels and images. For larger scale projects, BIOSCAN5M is a superset of BIOSCAN1M and will provide five times more samples to train on. On the other hand, if 5 million samples is too much to handle, you can ignore the "pretrain" partition (train using the "train" partition only), which reduces the dataset to less than 400k samples.

The accompanying functions load_bioscan1m_metadata(), load_bioscan5m_metadata(), and load_canadian_invertebrates_metadata() can be used to load the metadata from the CSV files. This produces a DataFrame in the same format as is used for model training. These functions do not need to be manually called when you are using BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates to work with the datasets.

BIOSCAN-1M Dataset

class bioscan_dataset.BIOSCAN1M(root, split: str = 'train', partitioning_version: str = 'large_diptera_family', modality: str | Iterable[str] = ('image', 'dna'), image_package: str = 'cropped_256', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'family', target_format: str = 'index', output_format: str = 'tuple', transform: Callable | None = None, dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]

Bases: VisionDataset

BIOSCAN-1M Dataset.

Parameters:

root (str) – The root directory, to contain the downloaded tarball files and bioscan1m data directory.
split (str, default="train") –
The dataset partition. For the BIOSCAN-1M partitioning versions ({large/medium/small}_{diptera_family/insect_order}), this should be one of:
- "train"
- "validation"
- "test"
- "no_split" (unused by experiments in BIOSCAN-1M paper)
For the CLIBD partitioning version, this should be one of:
- "all_keys" (the keys are used as a reference set for retrieval tasks)
- "no_split" (equivalent to "pretrain" in BIOSCAN-5M; these samples are not labelled to species level)
- "no_split_and_seen_train" (used for CLIBD model training; equivalent to using "pretrain+train" in BIOSCAN-5M)
- "seen_keys"
- "single_species"
- "test_seen" (similar to "test" in BIOSCAN-5M)
- "test_unseen"
- "test_unseen_keys" (similar to "key_unseen" in BIOSCAN-5M)
- "train_seen" (similar to "train" in BIOSCAN-5M)
- "val_seen" (similar to "val" in BIOSCAN-5M)
- "val_unseen"
- "val_unseen_keys"
- Additionally, BIOSCAN5M split names are accepted as aliases for the corresponding CLIBD partitions.
If split is None or "all", the data is not filtered by partition and the dataframe will contain every sample in the dataset.

The split parameter can also be specified as collection of partitions joined by "+". For example, split="train+validation+test" will return a dataset comprised of samples in the training, validation, and test partitions.

Warning

The contents of the split depends on the value of partitioning_version. If partitioning_version is changed, the same split value will yield completely different records.
partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
- "large_diptera_family"
- "medium_diptera_family"
- "small_diptera_family"
- "large_insect_order"
- "medium_insect_order"
- "small_insect_order"
- "clibd"
The "clibd" partitioning version was introduced by the paper CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale, whilst the other partitions were introduced in the BIOSCAN-1M paper.

To use the CLIBD partitioning, download and extract the partition files from here into the "{root}/bioscan1m/" directory. These files are automatically downloaded if download=True.

Attention

The original BIOSCAN-1M partitioning versions only support target_type up to family and order level, respectively. For more fine-grained taxonomic labels, we recommend using the CLIBD partitioning, which supports target_type up to species level.

Changed in version 1.2.0: Added support for CLIBD partitioning.
modality (str or Iterable[str], default=("image", "dna")) –
Which data modalities to use. One of, or a list of: "image", "dna", or any column name in the metadata TSV file.

Changed in version 1.1.0: Added support for arbitrary modalities.
image_package (str, default="cropped_256") –
The package to load images from. One of: "original_full", "cropped", "original_256", "cropped_256".

Added in version 1.1.0.
reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-1M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="family") –
Type of target to use. One of, or a list of:
- "phylum"
- "class"
- "order"
- "family"
- "subfamily"
- "tribe"
- "genus"
- "species"
- "uri" (equivalent to "dna_bin"; a species-level label derived from DNA barcode clustering by BOLD)
Where "uri" corresponds to the BIN cluster label.
target_format (str, default="index") –
Format in which the targets will be returned. One of: "index", "text". If this is set to "index" (default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with -1. This format is appropriate for use in classification tasks. If this is set to "text", the target(s) will each be returned as a string, appropriate for processing with language models.

Added in version 1.1.0.
output_format (str, default="tuple") –
Format in which __getitem__() will be returned. One of: "tuple", "dict". If this is set to "tuple" (default), all modalities and targets will be returned together as a single tuple. If this is set to "dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.

Added in version 1.3.0.
transform (Callable, optional) – Image transformation pipeline.
dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) –
If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the "image" modality is requested. Note that only image_package values "cropped_256" and "original_256" are currently supported for automatic image download.

Added in version 1.2.0.

metadata

The metadata associated with the samples in the select split, loaded using load_bioscan1m_metadata().

Type:: pandas.DataFrame

__getitem__(index: int) → Tuple[Any, ...][source]

Get a sample from the dataset.

Parameters:

index (int) – Index of the sample to retrieve.

Returns:

If output_format="tuple", the output will be a tuple containing:

imagePIL.Image.Image or Any
The image, if the "image" modality is requested, optionally transformed by the transform pipeline.
dnastr or Any
The DNA barcode, if the "dna" modality is requested, optionally transformed by the dna_transform pipeline.
*modalitiesAny
Any other modalities requested, as specified in the modality parameter. The data is extracted from the appropriate column in the metadata TSV file, without any transformations. Missing values will be filled with NaN.
targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the target_transform pipeline. If target_format="index", the target(s) will be returned as integer indices, with missing values filled with -1. If target_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. If target_type is an empty list, the output target will be None.

If output_format="dict", the output will be a dictionary with keys and values as follows:

keys for each of the modalities specified in the modality parameter, with corresponding values as described above. The values for the image and DNA barcode modalities are transformed by their respective pipelines if specified.
keys for each of the targets specified in target_type, with corresponding value equal to that target’s label (e.g. out["family"] == "Gelechiidae")
for each of the keys in target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g. out["family_index"] == 206)
the key "target", whose contents are as described above

Changed in version 1.3.0: Added support for output_format="dict".

Return type:

tuple or dict

download() → None[source]: Download and extract the data.

Added in version 1.2.0.

index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) → str | ndarray[tuple[int, ...], dtype[str_]][source]

Convert target’s integer index to text label.

Added in version 1.1.0.

Parameters:

index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.

Return type:

str or numpy.array[str]

label2index(label: str | Iterable[str], column: str | None = None) → int | ndarray[tuple[int, ...], dtype[int64]][source]

Convert target’s text label to integer index.

Added in version 1.1.0.

Parameters:

label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to -1.

Return type:

int or numpy.array[int]

bioscan_dataset.load_bioscan1m_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, partitioning_version: str = 'large_diptera_family', clibd_partitioning_path: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) → DataFrame[source]

Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.

Parameters:

metadata_path (str) – Path to metadata file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-1M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (str or bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no reduction is performed.
split (str, optional) –
The dataset partition. For the BIOSCAN-1M partitioning versions ({large/meduim/small}_{diptera_family/insect_order}), this should be one of:
- "train"
- "validation"
- "test"
- "no_split" (unused by experiments in BIOSCAN-1M paper)
For the CLIBD partitioning version, this should be one of:
- "all_keys" (the keys are used as a reference set for retrieval tasks)
- "no_split" (equivalent to "pretrain" in BIOSCAN-5M; these samples are not labelled to species level)
- "no_split_and_seen_train" (used for CLIBD model training; equivalent to using "pretrain+train" in BIOSCAN-5M)
- "seen_keys"
- "single_species"
- "test_seen" (similar to "test" in BIOSCAN-5M)
- "test_unseen"
- "test_unseen_keys" (similar to "key_unseen" in BIOSCAN-5M)
- "train_seen" (similar to "train" in BIOSCAN-5M)
- "val_seen" (similar to "val" in BIOSCAN-5M)
- "val_unseen"
- "val_unseen_keys"
- Additionally, BIOSCAN5M split names are accepted as aliases for the corresponding CLIBD partitions.
If split is None or "all" (default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.

The split parameter can also be specified as collection of partitions joined by "+". For example, "train+validation+test" will filter the metadata to samples in the training, validation, and test partitions.

Warning

The contents of the split depends on the value of partitioning_version. If partitioning_version is changed, the same split value will yield completely different records.
partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
- "large_diptera_family"
- "medium_diptera_family"
- "small_diptera_family"
- "large_insect_order"
- "medium_insect_order"
- "small_insect_order"
- "clibd"
The "clibd" partitioning version was introduced by the paper CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale, whilst the other partitions were introduced in the BIOSCAN-1M paper.

To use the CLIBD partitioning, download and extract the partition files from here into the same directory as the metadata TSV file.

Changed in version 1.2.0: Added support for CLIBD partitioning.
clibd_partitioning_path (str, optional) – Path to the CLIBD_partitioning directory. By default, this is a subdirectory named "CLIBD_partitioning" in the directory containing metadata_path.
**kwargs – Additional keyword arguments to pass to pandas.read_csv().

Returns:

df – The metadata DataFrame. If the CLIBD partitioning files are present, the DataFrame will contain an additional column named "clibd_split" which indicates the CLIBD split for each sample.

Return type:

pandas.DataFrame

BIOSCAN-5M Dataset

class bioscan_dataset.BIOSCAN5M(root, split: str = 'train', modality: str | Iterable[str] = ('image', 'dna'), image_package: str = 'cropped_256', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'species', target_format: str = 'index', output_format: str = 'tuple', transform: Callable | None = None, dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]

Bases: VisionDataset

BIOSCAN-5M Dataset.

Parameters:

root (str) – The root directory, to contain the downloaded tarball files and bioscan5m data directory.
split (str, default="train") –
The dataset partition. One of:
- "pretrain"
- "train"
- "val"
- "test"
- "key_unseen"
- "val_unseen"
- "test_unseen"
- "other_heldout"
- "all", which is the union of all splits
- "seen", which is the union of {train, val, test}
- "unseen", which is the union of {key_unseen, val_unseen, test_unseen}
Set to "all" to include all splits.

The split parameter can also be specified as collection of partitions joined by "+". For example, split="pretrain+train" will return a dataset comprised of the pretraining and training partitions.

Note

There is distributional shift between the partitions, which means the validation and test accuracy are not directly comparable, and the accuracy for the seen and unseen splits are not directly comparable. For more details, please see Appendix R.3 of the BIOSCAN-5M paper.
modality (str or Iterable[str], default=("image", "dna")) –
Which data modalities to use. One of, or a list of: "image", "dna", or any column name in the metadata CSV file. Examples of column names which may be of interest are "coord-lat" (latitude of collection location), "coord-lon" (longitude of collection location), and "image_measurement_value" (specimen size, in pixels).

Changed in version 1.1.0: Added support for arbitrary modalities.
image_package (str, default="cropped_256") – The package to load images from. One of: "original_full", "cropped", "original_256", "cropped_256".
reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-5M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="species") –
Type of target to use. One of, or a list of:
- "phylum"
- "class"
- "order"
- "family"
- "subfamily"
- "genus"
- "species"
- "dna_bin" (a species-level label derived from DNA barcode clustering by BOLD)
target_format (str, default="index") –
Format in which the targets will be returned. One of: "index", "text". If this is set to "index" (default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with -1. This format is appropriate for use in classification tasks. If this is set to "text", the target(s) will each be returned as a string, appropriate for processing with language models.

Added in version 1.1.0.
output_format (str, default="tuple") –
Format in which __getitem__() will be returned. One of: "tuple", "dict". If this is set to "tuple" (default), all modalities and targets will be returned together as a single tuple. If this is set to "dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.

Added in version 1.3.0.
transform (Callable, optional) – Image transformation pipeline.
dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) – If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the "image" modality is requested. Note that only image_package=cropped_256 is supported for automatic image download.

metadata

The metadata associated with the samples in the select split, loaded using load_bioscan5m_metadata().

Type:: pandas.DataFrame

__getitem__(index: int) → Tuple[Any, ...][source]

Get a sample from the dataset.

Parameters:

index (int) – Index of the sample to retrieve.

Returns:

If output_format="tuple", the output will be a tuple containing:

imagePIL.Image.Image or Any
The image, if the "image" modality is requested, optionally transformed by the transform pipeline.
dnastr or Any
The DNA barcode, if the "dna" modality is requested, optionally transformed by the dna_transform pipeline.
*modalitiesAny
Any other modalities requested, as specified in the modality parameter. The data is extracted from the appropriate column in the metadata TSV file, without any transformations. Missing values will be filled with NaN.
targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the target_transform pipeline. If target_format="index", the target(s) will be returned as integer indices, with missing values filled with -1. If target_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. If target_type is an empty list, the output target will be None.

If output_format="dict", the output will be a dictionary with keys and values as follows:

keys for each of the modalities specified in the modality parameter, with corresponding values as described above. The values for the image and DNA barcode modalities are transformed by their respective pipelines if specified.
keys for each of the targets specified in target_type, with corresponding value equal to that target’s label (e.g. out["species"] == "Gnamptogenys sulcata")
for each of the keys in target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g. out["species_index"] == 240)
the key "target", whose contents are as described above

Changed in version 1.3.0: Added support for output_format="dict".

Return type:

tuple or dict

download() → None[source]: Download and extract the data.

index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) → str | ndarray[tuple[int, ...], dtype[str_]][source]

Convert target’s integer index to text label.

Added in version 1.1.0.

Parameters:

index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.

Return type:

str or numpy.array[str]

Examples

>>> dataset.index2label([4], "order")
'Diptera'
>>> dataset.index2label([4, 9, -1, 4], "order")
array(['Diptera', 'Lepidoptera', '', 'Diptera'], dtype=object)

label2index(label: str | Iterable[str], column: str | None = None) → int | ndarray[tuple[int, ...], dtype[int64]][source]

Convert target’s text label to integer index.

Added in version 1.1.0.

Parameters:

label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to -1.

Return type:

int or numpy.array[int]

Examples

>>> dataset.label2index("Diptera", "order")
4
>>> dataset.label2index(["Diptera", "Lepidoptera", "", "Diptera"], "order")
array([4, 9, -1, 4])

bioscan_dataset.load_bioscan5m_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) → DataFrame[source]

Load BIOSCAN-5M metadata from its CSV file and prepare it for training.

Parameters:

metadata_path (str) – Path to the metadata CSV file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-5M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no reduction is performed.
split (str, optional) –
The dataset partition to return. One of:
- "pretrain"
- "train"
- "val"
- "test"
- "key_unseen"
- "val_unseen"
- "test_unseen"
- "other_heldout"
- "all", which is the union of all splits
- "seen", which is the union of {train, val, test}
- "unseen", which is the union of {key_unseen, val_unseen, test_unseen}
If split is None or "all" (default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.

The split parameter can also be specified as collection of partitions joined by "+". For example, "pretrain+train" will filter the metadata to samples in either the pretraining or training partitions.
**kwargs – Additional keyword arguments to pass to pandas.read_csv().

Returns:

The metadata DataFrame.

Return type:

pandas.DataFrame

Canadian Invertebrates Dataset

class bioscan_dataset.CanadianInvertebrates(root, split: str = 'train', modality: str | Iterable[str] = 'dna', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'species', target_format: str = 'index', output_format: str = 'tuple', dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]

Bases: Dataset

Canadian Invertebrates Dataset.

Added in version 1.4.0.

Parameters:

root (str) – The root directory, to contain the downloaded tarball files and CanadianInvertebrates data directory.
split (str, default="train") –
The dataset partition. One of:
- "pretrain"
- "train"
- "val"
- "test"
- "test_unseen"
- "all", which is the union of all splits
- "seen", which is the union of {train, val, test}
- "unseen", which is an alias for "test_unseen"
Set to "all" to include all splits.

The split parameter can also be specified as collection of partitions joined by "+". For example, split="pretrain+train" will return a dataset comprised of the pretraining and training partitions.
modality (str or Iterable[str], default=("dna")) – Which data modalities to use. This dataset only has one modality: the DNA barcode. The modality parameter is only present to provide a consistent interface with the other dataset classes in this package.
reduce_repeated_barcodes (bool, default=False) –
Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no additional reduction of repeated barcodes is performed.

Note

This version of the dataset is already reduced to one sample per full-length barcode compared to the original Canadian Invertebrates 1.5M dataset (deWaard et al., 2019). Even with the parameter reduce_repeated_barcodes=False, the dataset will have at most 965,289 samples, and not 1.5M. For more details on the dataset preprocessing steps, see Millan Arias et al. (2024).
max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="species") –
Type of target to use. One of, or a list of:
- "phylum"
- "class"
- "order"
- "family"
- "subfamily"
- "genus"
- "species"
- "dna_bin" (a species-level label derived from DNA barcode clustering by BOLD)
target_format (str, default="index") – Format in which the targets will be returned. One of: "index", "text". If this is set to "index" (default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with -1. This format is appropriate for use in classification tasks. If this is set to "text", the target(s) will each be returned as a string, appropriate for processing with language models.
output_format (str, default="tuple") – Format in which __getitem__() will be returned. One of: "tuple", "dict". If this is set to "tuple" (default), all modalities and targets will be returned together as a single tuple. If this is set to "dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.
dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) – If True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

metadata

The metadata associated with the samples in the select split, loaded using load_canadian_invertebrates_metadata().

Type:: pandas.DataFrame

__getitem__(index: int) → Tuple[Any, ...][source]

Get a sample from the dataset.

Parameters:

index (int) – Index of the sample to retrieve.

Returns:

If output_format="tuple", the output will be a tuple containing:

dnastr or Any
The DNA barcode, if the "dna" modality is requested, optionally transformed by the dna_transform pipeline.
*modalitiesAny
Any other modalities requested, as specified in the modality parameter. The data is extracted from the appropriate column in the metadata csv file, without any transformations. Missing values will be filled with NaN.
targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the target_transform pipeline. If target_format="index", the target(s) will be returned as integer indices, with missing values filled with -1. If target_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. If target_type is an empty list, the output target will be None.

If output_format="dict", the output will be a dictionary with keys and values as follows:

keys for each of the modalities specified in the modality parameter, with corresponding values as described above.
keys for each of the targets specified in target_type, with corresponding value equal to that target’s label (e.g. out["species"] == "Gnamptogenys sulcata")
for each of the keys in target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g. out["species_index"] == 240)
the key "target", whose contents are as described above

Return type:

tuple or dict

download() → None[source]: Download and extract the data.

index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) → str | ndarray[tuple[int, ...], dtype[str_]][source]

Convert target’s integer index to text label.

Parameters:

index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.

Return type:

str or numpy.array[str]

Examples

>>> dataset.index2label(29, "order")
'Diptera'
>>> dataset.index2label([4, 9, -1, 4], "order")
array(['Anomopoda', 'Araneae', '', 'Anomopoda'], dtype=object)

label2index(label: str | Iterable[str], column: str | None = None) → int | ndarray[tuple[int, ...], dtype[int64]][source]

Convert target’s text label to integer index.

Parameters:

label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as self.target_type) – The dataset column name to map. This should be one of the possible values for target_type. By default, the column name is the target_type used for the class, provided it is a single value.

Returns:

The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to -1.

Return type:

int or numpy.array[int]

Examples

>>> dataset.label2index("Diptera", "order")
29
>>> dataset.label2index(["Diptera", "Lepidoptera", "", "Diptera"], "order")
array([29, 45, -1, 29])

bioscan_dataset.load_canadian_invertebrates_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) → DataFrame[source]

Load Canadian Invertebrates dataset metadata from its CSV file and prepare it for training.

Added in version 1.4.0.

Parameters:

metadata_path (str) – Path to the metadata CSV file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to None to keep the original data without truncation.

Note

COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the Canadian Invertebrates dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (bool, default=False) –
Whether to reduce the dataset to only one sample per barcode. If True, duplicated barcodes are removed after truncating the barcodes to the length specified by max_nucleotides and stripping trailing Ns. If False (default) no reduction is performed.

Note

This version of the dataset is already reduced to one sample per full-length barcode compared to the original Canadian Invertebrates 1.5M dataset (deWaard et al., 2019). Even with the parameter reduce_repeated_barcodes=False, the dataset will have at most 965,289 samples, and not 1.5M. For more details on the dataset preprocessing steps, see Millan Arias et al. (2024).
split (str, optional) –
The dataset partition to return. One of:
- "pretrain"
- "train"
- "val"
- "test"
- "test_unseen"
- "all", which is the union of all splits
- "seen", which is the union of {train, val, test}
- "unseen", which is an alias for "test_unseen"
If split is None or "all" (default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.

The split parameter can also be specified as collection of partitions joined by "+". For example, "pretrain+train" will filter the metadata to samples in either the pretraining or training partitions.
**kwargs – Additional keyword arguments to pass to pandas.read_csv().

Returns:

The metadata DataFrame.

Return type:

pandas.DataFrame