October 14th, 2023
Objective: Embryo evaluation is a critical step of in vitro fertilization (IVF). Here, we sought to develop an AI model that can automate the Gardner scale morphology grading that is performed routinely in labs, including degree of expansion (3,4,5,6), inner cell mass (ICM) grade (A,B,C), and trophectoderm (TE) grade (A,B,C).
Materials and Methods: Historical, de-identified images of blastocyst-stage embryos and manual morphology grades were collected from multiple IVF clinics in the US for cycles between 2015-2020. Images were captured on day 5, 6, or 7 using the inverted microscope prior to biopsy or freeze. The dataset contains 9,478 images. A separate test dataset of 50 images was collected from an independent IVF clinic, including manual morphology grades given by 6-10 embryologists each year for 4 years.
Convolutional neural networks (CNNs) were trained independently for each morphological component. First, the images were sorted into 3 ICM grades (A,B, or C), and an ensemble of 2 CNNs (ResNet and EfficientNet) were trained to predict the ICM grade. This process was then repeated independently for TE and expansion. The final model for predicting the morphological grade consisted of 6 CNNs. After training and validation, the model was evaluated on an independent test dataset.
Results: The ICM, TE, and expansion deep learning models reached training and validation accuracies of approximately 80%.. Visual inspection of images with prediction errors revealed issues with image quality and inconsistent labeling between embryologists. The independent test dataset was used to evaluate consensus agreement between a group of embryologists and the AI model. For expansion, the embryologists agreed unanimously on the expansion grade 12% of the time, showed majority (>50%) consensus 100% of the time, and the AI model agreed with the embryologist-consensus 88% of the time. For ICM, the embryologists agreed unanimously on the ICM grade 4% of the time, showed majority (>50%) consensus 94% of the time, and the AI model agreed with the embryologist-consensus 60% of the time. For TE, the embryologists agreed unanimously on the TE grade 0% of the time, showed majority (>50%) consensus 98% of the time, and the AI model agreed with the embryologist-consensus 84% of the time. The most common AI prediction errors were A-to-B or B-to-A, but never A-to-C or C-to-A. After combining all three categories (expansion, ICM, and TE), the average rate at which individual embryologists agree with the common consensus is 41%, while the ratio for the AI model is 48%.
Conclusions: While the subjectivity of ground-truth labels poses a challenge, automated morphology grading of blastocyst-stage embryos can be achieved with deep learning at human-level accuracy.
Impact Statement: Our results demonstrate that it is possible to develop a deep-learning model for automated morphology grading that often agrees with the consensus of embryologist grades. Performance can be improved with additional quality control. Such a model could be useful for training embryologists, assessing performance, and standardizing morphology grading.