• How does unsupervised or supervised machine learning aid your discovery efforts (e.g., clustering, classification, inference)?
o If unsupervised – what approaches do you take (e.g., k-means, PCA, other)?
o If supervised, do shallow learning (logistic regression, naïve Bayes, etc.) approaches suffice or do you require deep learning (recurrent NN, 1D convnets, etc.)
• What experimental and/or bioinformatics processing steps do you employ to ensure you have established an accurate ground truth for select population (e.g., binders / non-binders from FACS) and classification strategies do you employ as it pertains to antibody discovery?
• What min/max read depth and/or fold coverage of the underlying region of interest (HCDR3) thresholds do you employ to dataset machine learning algorithm to avoid under / overfitting? Is this sufficient based on your accuracy assessment from your acc/loss curves? Do you utilize data augmentation or regularization techniques
• How do you typically allocate your training, validation, and test sets for NGS datasets from discovery campaigns?
• What data encoding methods do you employ (e.g. one-hot encoding, tokenization) to represent your sequence data? Does 3D coordinate information enhance your dataset?