API Reference
We provide BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates classes to load the respective BIOSCAN-1M, BIOSCAN-5M, and Canadian Invertebrates datasets for use within PyTorch.
These classes are subclasses of torch.utils.data.Dataset and are designed to be used with PyTorch’s DataLoader for batching and model training.
General usage instructions for BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates are provided in our usage guide.
Tip
For new projects, we recommend using BIOSCAN5M instead of BIOSCAN1M since the newer dataset has cleaner labels and images.
For larger scale projects, BIOSCAN5M is a superset of BIOSCAN1M and will provide five times more samples to train on.
On the other hand, if 5 million samples is too much to handle, you can ignore the "pretrain" partition (train using the "train" partition only), which reduces the dataset to less than 400k samples.
The accompanying functions load_bioscan1m_metadata(), load_bioscan5m_metadata(), and load_canadian_invertebrates_metadata() can be used to load the metadata from the CSV files.
This produces a DataFrame in the same format as is used for model training.
These functions do not need to be manually called when you are using BIOSCAN1M, BIOSCAN5M, and CanadianInvertebrates to work with the datasets.
BIOSCAN-1M Dataset
- class bioscan_dataset.BIOSCAN1M(root, split: str = 'train', partitioning_version: str = 'large_diptera_family', modality: str | Iterable[str] = ('image', 'dna'), image_package: str = 'cropped_256', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'family', target_format: str = 'index', output_format: str = 'tuple', transform: Callable | None = None, dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]
Bases:
VisionDatasetBIOSCAN-1M Dataset.
- Parameters:
root (str) – The root directory, to contain the downloaded tarball files and bioscan1m data directory.
split (str, default="train") –
The dataset partition. For the BIOSCAN-1M partitioning versions ({large/medium/small}_{diptera_family/insect_order}), this should be one of:
"train""validation""test""no_split"(unused by experiments in BIOSCAN-1M paper)
For the CLIBD partitioning version, this should be one of:
"all_keys"(the keys are used as a reference set for retrieval tasks)"no_split"(equivalent to"pretrain"in BIOSCAN-5M; these samples are not labelled to species level)"no_split_and_seen_train"(used for CLIBD model training; equivalent to using"pretrain+train"in BIOSCAN-5M)"seen_keys""single_species""test_seen"(similar to"test"in BIOSCAN-5M)"test_unseen""test_unseen_keys"(similar to"key_unseen"in BIOSCAN-5M)"train_seen"(similar to"train"in BIOSCAN-5M)"val_seen"(similar to"val"in BIOSCAN-5M)"val_unseen""val_unseen_keys"Additionally,
BIOSCAN5Msplit names are accepted as aliases for the corresponding CLIBD partitions.
If
splitisNoneor"all", the data is not filtered by partition and the dataframe will contain every sample in the dataset.The
splitparameter can also be specified as collection of partitions joined by"+". For example,split="train+validation+test"will return a dataset comprised of samples in the training, validation, and test partitions.Warning
The contents of the split depends on the value of
partitioning_version. Ifpartitioning_versionis changed, the samesplitvalue will yield completely different records.partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
"large_diptera_family""medium_diptera_family""small_diptera_family""large_insect_order""medium_insect_order""small_insect_order""clibd"
The
"clibd"partitioning version was introduced by the paper CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale, whilst the other partitions were introduced in the BIOSCAN-1M paper.To use the CLIBD partitioning, download and extract the partition files from here into the
"{root}/bioscan1m/"directory. These files are automatically downloaded ifdownload=True.Attention
The original BIOSCAN-1M partitioning versions only support
target_typeup to family and order level, respectively. For more fine-grained taxonomic labels, we recommend using the CLIBD partitioning, which supportstarget_typeup to species level.Changed in version 1.2.0: Added support for CLIBD partitioning.
modality (str or Iterable[str], default=("image", "dna")) –
Which data modalities to use. One of, or a list of:
"image","dna", or any column name in the metadata TSV file.Changed in version 1.1.0: Added support for arbitrary modalities.
image_package (str, default="cropped_256") –
The package to load images from. One of:
"original_full","cropped","original_256","cropped_256".Added in version 1.1.0.
reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-1M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="family") –
Type of target to use. One of, or a list of:
"phylum""class""order""family""subfamily""tribe""genus""species""uri"(equivalent to"dna_bin"; a species-level label derived from DNA barcode clustering by BOLD)
Where
"uri"corresponds to the BIN cluster label.target_format (str, default="index") –
Format in which the targets will be returned. One of:
"index","text". If this is set to"index"(default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with-1. This format is appropriate for use in classification tasks. If this is set to"text", the target(s) will each be returned as a string, appropriate for processing with language models.Added in version 1.1.0.
output_format (str, default="tuple") –
Format in which
__getitem__()will be returned. One of:"tuple","dict". If this is set to"tuple"(default), all modalities and targets will be returned together as a single tuple. If this is set to"dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.Added in version 1.3.0.
transform (Callable, optional) – Image transformation pipeline.
dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) –
If
True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the"image"modality is requested. Note that onlyimage_packagevalues"cropped_256"and"original_256"are currently supported for automatic image download.Added in version 1.2.0.
- metadata
The metadata associated with the samples in the select split, loaded using
load_bioscan1m_metadata().- Type:
- __getitem__(index: int) Tuple[Any, ...][source]
Get a sample from the dataset.
- Parameters:
index (int) – Index of the sample to retrieve.
- Returns:
If
output_format="tuple", the output will be a tuple containing:- imagePIL.Image.Image or Any
The image, if the
"image"modality is requested, optionally transformed by thetransformpipeline.
- dnastr or Any
The DNA barcode, if the
"dna"modality is requested, optionally transformed by thedna_transformpipeline.
- *modalitiesAny
Any other modalities requested, as specified in the
modalityparameter. The data is extracted from the appropriate column in the metadata TSV file, without any transformations. Missing values will be filled with NaN.
- targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the
target_transformpipeline. Iftarget_format="index", the target(s) will be returned as integer indices, with missing values filled with-1. Iftarget_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. Iftarget_typeis an empty list, the outputtargetwill beNone.
If
output_format="dict", the output will be a dictionary with keys and values as follows:keys for each of the modalities specified in the
modalityparameter, with corresponding values as described above. The values for the image and DNA barcode modalities are transformed by their respective pipelines if specified.keys for each of the targets specified in
target_type, with corresponding value equal to that target’s label (e.g.out["family"] == "Gelechiidae")for each of the keys in
target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g.out["family_index"] == 206)the key
"target", whose contents are as described above
Changed in version 1.3.0: Added support for
output_format="dict".- Return type:
- index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) str | ndarray[tuple[int, ...], dtype[str_]][source]
Convert target’s integer index to text label.
Added in version 1.1.0.
- Parameters:
index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.
- Return type:
- label2index(label: str | Iterable[str], column: str | None = None) int | ndarray[tuple[int, ...], dtype[int64]][source]
Convert target’s text label to integer index.
Added in version 1.1.0.
- Parameters:
label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to
-1.- Return type:
- bioscan_dataset.load_bioscan1m_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, partitioning_version: str = 'large_diptera_family', clibd_partitioning_path: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) DataFrame[source]
Load BIOSCAN-1M metadata from its TSV file, and prepare it for training.
- Parameters:
metadata_path (str) – Path to metadata file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-1M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (str or bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no reduction is performed.split (str, optional) –
The dataset partition. For the BIOSCAN-1M partitioning versions ({large/meduim/small}_{diptera_family/insect_order}), this should be one of:
"train""validation""test""no_split"(unused by experiments in BIOSCAN-1M paper)
For the CLIBD partitioning version, this should be one of:
"all_keys"(the keys are used as a reference set for retrieval tasks)"no_split"(equivalent to"pretrain"in BIOSCAN-5M; these samples are not labelled to species level)"no_split_and_seen_train"(used for CLIBD model training; equivalent to using"pretrain+train"in BIOSCAN-5M)"seen_keys""single_species""test_seen"(similar to"test"in BIOSCAN-5M)"test_unseen""test_unseen_keys"(similar to"key_unseen"in BIOSCAN-5M)"train_seen"(similar to"train"in BIOSCAN-5M)"val_seen"(similar to"val"in BIOSCAN-5M)"val_unseen""val_unseen_keys"Additionally,
BIOSCAN5Msplit names are accepted as aliases for the corresponding CLIBD partitions.
If
splitisNoneor"all"(default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.The
splitparameter can also be specified as collection of partitions joined by"+". For example,"train+validation+test"will filter the metadata to samples in the training, validation, and test partitions.Warning
The contents of the split depends on the value of
partitioning_version. Ifpartitioning_versionis changed, the samesplitvalue will yield completely different records.partitioning_version (str, default="large_diptera_family") –
The dataset partitioning version, one of:
"large_diptera_family""medium_diptera_family""small_diptera_family""large_insect_order""medium_insect_order""small_insect_order""clibd"
The
"clibd"partitioning version was introduced by the paper CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale, whilst the other partitions were introduced in the BIOSCAN-1M paper.To use the CLIBD partitioning, download and extract the partition files from here into the same directory as the metadata TSV file.
Changed in version 1.2.0: Added support for CLIBD partitioning.
clibd_partitioning_path (str, optional) – Path to the CLIBD_partitioning directory. By default, this is a subdirectory named
"CLIBD_partitioning"in the directory containingmetadata_path.**kwargs – Additional keyword arguments to pass to
pandas.read_csv().
- Returns:
df – The metadata DataFrame. If the CLIBD partitioning files are present, the DataFrame will contain an additional column named
"clibd_split"which indicates the CLIBD split for each sample.- Return type:
BIOSCAN-5M Dataset
- class bioscan_dataset.BIOSCAN5M(root, split: str = 'train', modality: str | Iterable[str] = ('image', 'dna'), image_package: str = 'cropped_256', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'species', target_format: str = 'index', output_format: str = 'tuple', transform: Callable | None = None, dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]
Bases:
VisionDatasetBIOSCAN-5M Dataset.
- Parameters:
root (str) – The root directory, to contain the downloaded tarball files and bioscan5m data directory.
split (str, default="train") –
The dataset partition. One of:
"pretrain""train""val""test""key_unseen""val_unseen""test_unseen""other_heldout""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is the union of {key_unseen, val_unseen, test_unseen}
Set to
"all"to include all splits.The
splitparameter can also be specified as collection of partitions joined by"+". For example,split="pretrain+train"will return a dataset comprised of the pretraining and training partitions.Note
There is distributional shift between the partitions, which means the validation and test accuracy are not directly comparable, and the accuracy for the seen and unseen splits are not directly comparable. For more details, please see Appendix R.3 of the BIOSCAN-5M paper.
modality (str or Iterable[str], default=("image", "dna")) –
Which data modalities to use. One of, or a list of:
"image","dna", or any column name in the metadata CSV file. Examples of column names which may be of interest are"coord-lat"(latitude of collection location),"coord-lon"(longitude of collection location), and"image_measurement_value"(specimen size, in pixels).Changed in version 1.1.0: Added support for arbitrary modalities.
image_package (str, default="cropped_256") – The package to load images from. One of:
"original_full","cropped","original_256","cropped_256".reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode.
max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-5M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="species") –
Type of target to use. One of, or a list of:
"phylum""class""order""family""subfamily""genus""species""dna_bin"(a species-level label derived from DNA barcode clustering by BOLD)
target_format (str, default="index") –
Format in which the targets will be returned. One of:
"index","text". If this is set to"index"(default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with-1. This format is appropriate for use in classification tasks. If this is set to"text", the target(s) will each be returned as a string, appropriate for processing with language models.Added in version 1.1.0.
output_format (str, default="tuple") –
Format in which
__getitem__()will be returned. One of:"tuple","dict". If this is set to"tuple"(default), all modalities and targets will be returned together as a single tuple. If this is set to"dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.Added in version 1.3.0.
transform (Callable, optional) – Image transformation pipeline.
dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) – If
True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again. Images are only downloaded if the"image"modality is requested. Note that onlyimage_package=cropped_256is supported for automatic image download.
- metadata
The metadata associated with the samples in the select split, loaded using
load_bioscan5m_metadata().- Type:
- __getitem__(index: int) Tuple[Any, ...][source]
Get a sample from the dataset.
- Parameters:
index (int) – Index of the sample to retrieve.
- Returns:
If
output_format="tuple", the output will be a tuple containing:- imagePIL.Image.Image or Any
The image, if the
"image"modality is requested, optionally transformed by thetransformpipeline.
- dnastr or Any
The DNA barcode, if the
"dna"modality is requested, optionally transformed by thedna_transformpipeline.
- *modalitiesAny
Any other modalities requested, as specified in the
modalityparameter. The data is extracted from the appropriate column in the metadata TSV file, without any transformations. Missing values will be filled with NaN.
- targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the
target_transformpipeline. Iftarget_format="index", the target(s) will be returned as integer indices, with missing values filled with-1. Iftarget_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. Iftarget_typeis an empty list, the outputtargetwill beNone.
If
output_format="dict", the output will be a dictionary with keys and values as follows:keys for each of the modalities specified in the
modalityparameter, with corresponding values as described above. The values for the image and DNA barcode modalities are transformed by their respective pipelines if specified.keys for each of the targets specified in
target_type, with corresponding value equal to that target’s label (e.g.out["species"] == "Gnamptogenys sulcata")for each of the keys in
target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g.out["species_index"] == 240)the key
"target", whose contents are as described above
Changed in version 1.3.0: Added support for
output_format="dict".- Return type:
- index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) str | ndarray[tuple[int, ...], dtype[str_]][source]
Convert target’s integer index to text label.
Added in version 1.1.0.
- Parameters:
index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.
- Return type:
Examples
>>> dataset.index2label([4], "order") 'Diptera' >>> dataset.index2label([4, 9, -1, 4], "order") array(['Diptera', 'Lepidoptera', '', 'Diptera'], dtype=object)
- label2index(label: str | Iterable[str], column: str | None = None) int | ndarray[tuple[int, ...], dtype[int64]][source]
Convert target’s text label to integer index.
Added in version 1.1.0.
- Parameters:
label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to
-1.- Return type:
Examples
>>> dataset.label2index("Diptera", "order") 4 >>> dataset.label2index(["Diptera", "Lepidoptera", "", "Diptera"], "order") array([4, 9, -1, 4])
- bioscan_dataset.load_bioscan5m_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) DataFrame[source]
Load BIOSCAN-5M metadata from its CSV file and prepare it for training.
- Parameters:
metadata_path (str) – Path to the metadata CSV file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the BIOSCAN-5M dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (bool, default=False) – Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no reduction is performed.split (str, optional) –
The dataset partition to return. One of:
"pretrain""train""val""test""key_unseen""val_unseen""test_unseen""other_heldout""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is the union of {key_unseen, val_unseen, test_unseen}
If
splitisNoneor"all"(default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.The
splitparameter can also be specified as collection of partitions joined by"+". For example,"pretrain+train"will filter the metadata to samples in either the pretraining or training partitions.**kwargs – Additional keyword arguments to pass to
pandas.read_csv().
- Returns:
The metadata DataFrame.
- Return type:
Canadian Invertebrates Dataset
- class bioscan_dataset.CanadianInvertebrates(root, split: str = 'train', modality: str | Iterable[str] = 'dna', reduce_repeated_barcodes: bool = False, max_nucleotides: int | None = 660, target_type: str | Iterable[str] = 'species', target_format: str = 'index', output_format: str = 'tuple', dna_transform: Callable | None = None, target_transform: Callable | None = None, download: bool = False)[source]
Bases:
DatasetCanadian Invertebrates Dataset.
Added in version 1.4.0.
- Parameters:
root (str) – The root directory, to contain the downloaded tarball files and CanadianInvertebrates data directory.
split (str, default="train") –
The dataset partition. One of:
"pretrain""train""val""test""test_unseen""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is an alias for"test_unseen"
Set to
"all"to include all splits.The
splitparameter can also be specified as collection of partitions joined by"+". For example,split="pretrain+train"will return a dataset comprised of the pretraining and training partitions.modality (str or Iterable[str], default=("dna")) – Which data modalities to use. This dataset only has one modality: the DNA barcode. The
modalityparameter is only present to provide a consistent interface with the other dataset classes in this package.reduce_repeated_barcodes (bool, default=False) –
Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no additional reduction of repeated barcodes is performed.Note
This version of the dataset is already reduced to one sample per full-length barcode compared to the original Canadian Invertebrates 1.5M dataset (deWaard et al., 2019). Even with the parameter
reduce_repeated_barcodes=False, the dataset will have at most 965,289 samples, and not 1.5M. For more details on the dataset preprocessing steps, see Millan Arias et al. (2024).max_nucleotides (int, default=660) –
Maximum number of nucleotides to keep in the DNA barcode. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
target_type (str or Iterable[str], default="species") –
Type of target to use. One of, or a list of:
"phylum""class""order""family""subfamily""genus""species""dna_bin"(a species-level label derived from DNA barcode clustering by BOLD)
target_format (str, default="index") – Format in which the targets will be returned. One of:
"index","text". If this is set to"index"(default), target(s) will each be returned as integer indices, each of which corresponds to a value for that taxonomic rank in a look-up-table. Missing values will be filled with-1. This format is appropriate for use in classification tasks. If this is set to"text", the target(s) will each be returned as a string, appropriate for processing with language models.output_format (str, default="tuple") – Format in which
__getitem__()will be returned. One of:"tuple","dict". If this is set to"tuple"(default), all modalities and targets will be returned together as a single tuple. If this is set to"dict", the output will be returned as a dictionary containing the modalities and targets as separate keys.dna_transform (Callable, optional) – DNA barcode transformation pipeline.
target_transform (Callable, optional) – Label transformation pipeline.
download (bool, default=False) – If
True, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
- metadata
The metadata associated with the samples in the select split, loaded using
load_canadian_invertebrates_metadata().- Type:
- __getitem__(index: int) Tuple[Any, ...][source]
Get a sample from the dataset.
- Parameters:
index (int) – Index of the sample to retrieve.
- Returns:
If
output_format="tuple", the output will be a tuple containing:- dnastr or Any
The DNA barcode, if the
"dna"modality is requested, optionally transformed by thedna_transformpipeline.
- *modalitiesAny
Any other modalities requested, as specified in the
modalityparameter. The data is extracted from the appropriate column in the metadata csv file, without any transformations. Missing values will be filled with NaN.
- targetint or Tuple[int, …] or str or Tuple[str, …] or None
The target(s), optionally transformed by the
target_transformpipeline. Iftarget_format="index", the target(s) will be returned as integer indices, with missing values filled with-1. Iftarget_format="text", the target(s) will be returned as a string. If there are multiple targets, they will be returned as a tuple. Iftarget_typeis an empty list, the outputtargetwill beNone.
If
output_format="dict", the output will be a dictionary with keys and values as follows:keys for each of the modalities specified in the
modalityparameter, with corresponding values as described above.keys for each of the targets specified in
target_type, with corresponding value equal to that target’s label (e.g.out["species"] == "Gnamptogenys sulcata")for each of the keys in
target_type, the corresponding index column ({target}_index), with value equal to that target’s index (e.g.out["species_index"] == 240)the key
"target", whose contents are as described above
- Return type:
- index2label(index: int | List[int] | ndarray[tuple[int, ...], dtype[int64]], column: str | None = None) str | ndarray[tuple[int, ...], dtype[str_]][source]
Convert target’s integer index to text label.
- Parameters:
index (int or array_like[int]) – The integer index or indices to map to labels.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The text label or labels corresponding to the integer index or indices in the specified column. Entries containing missing values, indicated by negative indices, are mapped to an empty string.
- Return type:
Examples
>>> dataset.index2label(29, "order") 'Diptera' >>> dataset.index2label([4, 9, -1, 4], "order") array(['Anomopoda', 'Araneae', '', 'Anomopoda'], dtype=object)
- label2index(label: str | Iterable[str], column: str | None = None) int | ndarray[tuple[int, ...], dtype[int64]][source]
Convert target’s text label to integer index.
- Parameters:
label (str or Iterable[str]) – The text label or labels to map to integer indices.
column (str, default=same as
self.target_type) – The dataset column name to map. This should be one of the possible values fortarget_type. By default, the column name is thetarget_typeused for the class, provided it is a single value.
- Returns:
The integer index or indices corresponding to the text label or labels in the specified column. Entries containing missing values, indicated by empty strings or NaN values, are mapped to
-1.- Return type:
Examples
>>> dataset.label2index("Diptera", "order") 29 >>> dataset.label2index(["Diptera", "Lepidoptera", "", "Diptera"], "order") array([29, 45, -1, 29])
- bioscan_dataset.load_canadian_invertebrates_metadata(metadata_path, max_nucleotides: int | None = 660, reduce_repeated_barcodes: bool = False, split: str | None = None, dtype: str | dict | None = MetadataDtype.DEFAULT, **kwargs) DataFrame[source]
Load Canadian Invertebrates dataset metadata from its CSV file and prepare it for training.
Added in version 1.4.0.
- Parameters:
metadata_path (str) – Path to the metadata CSV file.
max_nucleotides (int, default=660) –
Maximum nucleotide sequence length to keep for the DNA barcodes. Set to
Noneto keep the original data without truncation.Note
COI DNA barcodes are typically 658 base pairs long for insects (Elbrecht et al., 2019), and an additional two base pairs are included as a buffer for the primer sequence. Although the Canadian Invertebrates dataset itself contains longer sequences, characters after the first 660 base pairs are likely to be inaccurate reads, and not part of the DNA barcode. Hence we recommend limiting the DNA barcode to the first 660 nucleotides. If you don’t know much about DNA barcodes, you probably shouldn’t change this parameter.
reduce_repeated_barcodes (bool, default=False) –
Whether to reduce the dataset to only one sample per barcode. If
True, duplicated barcodes are removed after truncating the barcodes to the length specified bymax_nucleotidesand stripping trailing Ns. IfFalse(default) no reduction is performed.Note
This version of the dataset is already reduced to one sample per full-length barcode compared to the original Canadian Invertebrates 1.5M dataset (deWaard et al., 2019). Even with the parameter
reduce_repeated_barcodes=False, the dataset will have at most 965,289 samples, and not 1.5M. For more details on the dataset preprocessing steps, see Millan Arias et al. (2024).split (str, optional) –
The dataset partition to return. One of:
"pretrain""train""val""test""test_unseen""all", which is the union of all splits"seen", which is the union of {train, val, test}"unseen", which is an alias for"test_unseen"
If
splitisNoneor"all"(default), the data is not filtered by partition and the dataframe will contain every sample in the dataset.The
splitparameter can also be specified as collection of partitions joined by"+". For example,"pretrain+train"will filter the metadata to samples in either the pretraining or training partitions.**kwargs – Additional keyword arguments to pass to
pandas.read_csv().
- Returns:
The metadata DataFrame.
- Return type: