Introduction

Disease pathotyping with histopathology, the discovery of disease subtypes using target organ histology, is a critical step in understanding etiology and response to therapy in heterogeneous diseases, like rheumatoid arthritis (RA). Our understanding of RA, which is a chronic, inflammatory joint disease, has greatly benefited from histopathology subtyping because the disease has distinct and disparate etiologies with largely stable pathotypes1,2 that show differential response to therapy3,4,5,6,7. However, the process of pathotyping a patient can be resource intensive involving both basic and immune-stains requiring a high level of expertise by pathologists to interpret tissue and cellular histologic features, and prone to inter- and intra-observer variation8,9. More cost effective and efficient procedures need to be developed in order to incorporate these types of data into a precision medicine decision making process.

Recent work describing RA pathotypes uncover three distinct synovial pathotypes (1) cellular dense, lymphocyte rich (lymphoid), (2) myeloid rich with few lymphocytes (diffuse/myeloid), and (3) fibroblast rich (pauci-immune); which are identifiable through distinct cellular and tissue level changes within synovial joint biopsies4,10,11,12,13. These pathotypes also correlate with antibody positivity (i.e. anti-citrullinated peptide antibodies), with the lymphoid type enriched in antibody positive patients whereas both the diffuse/myeloid and pauci-immune types have equal contributions of both antibody positive and negative patients. This aligns with preclinical models that rely on antibody dependent (e.g. Collagen Induced Arthritis and Serum Induced Arthritis) and independent mechanisms (e.g. humanized TNF Transgenic and Zymosan Induced Arthritis) to study disease and are phenotypically similar to these pathotypes14. Thus, tools that allow us to study both murine and human pathology would improve our overall understanding of this heterogeneous disease.

Computational tools to study histological changes have been shown to augment pathologist workflows and allow for the identification of disease specific patterns15. In particular, machine learning, and specifically deep learning, is a data-driven framework that has recently had success in the automated analysis of musculoskeletal imaging data16. Additionally, computational tools can automate and provide a more holistic analysis of a wide variety of histopathologic tissue and cell-level changes to enable a more detailed understanding of disease subtypes. However, an automated and comprehensive tool to study both tissue and cell type specific changes in arthritis that can quantify therapeutic or clinically meaningful differences has not yet been developed16.

In this work, we developed an automated multi-scale computational pathotyping (AMSCP) model to analyze tissue and cell-level changes during the progression of inflammatory arthritis. This model can pathotype both mouse and human inflammatory arthritis in therapeutic intervention studies and clinical meaningful scenarios. We leveraged innovative transfer- and active- learning techniques to improve model performance and workload efficiency. Our modeling framework consists of two distinct components that can be utilized together or independently, (1) a method to segment different tissue types to characterize tissue-level changes and (2) a method to classify cell types within each tissue compartment to study how these change across disease states. We utilized two mouse models of inflammatory arthritis to train and validate our computational models with subsequent implementation on additional datasets to measure therapeutic efficacy, known biologic differences and discover novel pathologic changes. Then, we utilized a human synovial biopsy data set from the Accelerating Medicines Partnership Rheumatoid Arthritis (AMP-RA) research consortium to demonstrate our model’s utility in a clinical setting by classifying lymphoid pathotypes from diffuse/myeloid pathotype and identifying cell types associated with the pauci-immune pathotype while preserving spatial cell-level data.

Results

Deep learning segmentation can identify major tissue types within mouse knee histology and measure therapeutic response

Several model training choices, including patch overlap, training strategy, and use of different amounts of augmentation during training, were empirically derived from initial experiments to inform the final training of the deep learning segmentation model (UNET++)17,18. First, we tested if 0%, 50% or 66% patch overlap was more performative and determined that 66% overlap performed the best (0% Overlap fwIOU: 0.72 ± 0.04; 50% Overlap fwIOU: 0.93 ± 0.02; 66% Overlap fwIOU: 0.95 ± 0.01; Supplemental Fig. 4A). Qualitatively, there were less tiling edge artifacts in the 66% overlap vs 50% overlap results (50%—Arrows Supplemental Fig. 4B vs Fig. 1). Second, a mixed training strategy was shown to overcome the large staining batch effect (Supplemental Fig. 5A) commonly seen in histology datasets with all levels of augmentation equally performative (Supplemental Fig. 5B). However, if the data is restricted to a single batch and data augmentation is introduced, model performance for the single batch training strategy becomes comparable to model performance using the mixed strategy (High Augmentation UNET++ : 0.81 ± 0.02 vs 0.80 ± 0.06 mIOU for mixed and single batch respectively), demonstrating the need for image augmentation during training to generalize across batches with this level of heterogeneity in staining and other imaging variations (Supplemental Fig. 5B and C). Thus, we chose to employ a mixed training strategy, with 66% patch overlap and with high augmentation to optimize model performance and generalizability.

Once the training strategy was established, model performance was benchmarked across segmentation tasks at multiple tissue granularities and compared with a standard RF model built-in to QuPath (Fig. 1). As expected, as the number of different tissue types increases, model performance decreases for both the RF and UNET++ model with the DL model outperforming the RF at all levels. Interestingly, the magnitude of decrease is smaller for the UNET++ model compared to the RF model suggesting that it is more robust to increased complexity. When testing the UNET++ model, using the ten-class granularity, model performance drops from 0.88 ± 0.06 mIOU for the cartilage and meniscus class to 0.83 ± 0.05 mIOU for the cartilage class and 0.0 ± 0.0 mIOU for the meniscus class indicating a complete loss of meniscus identification (Supplemental Fig. 6). Because defining the amount of cartilage and meniscus is a very important pathologic readout in inflammatory joint diseases (e.g. pannus invasion at end stage arthritis), we developed a finely tuned two-class model and placed it sequentially after the 9-class model. Predictions from both were incorporated during majority voting process to create a composite 10 class model. Performance jumps for this fine-tuned model from 0.72 ± 0.01 mIOU to 0.82 ± 0.02 mIOU (Fig. 1A, B). Specifically, performance for the cartilage increased from 0.83 ± 0.05 to 0.90 ± 0.04 and for the meniscus from 0.00 ± 0.00 0.92 ± 0.05, a dramatic improvement (Supplemental Fig. 6C). We additionally observe that the worst performing classes are the artifact class and bone marrow fat class, the two most infrequent classes, suggesting that fine tuning may work to improve performance for one or both (Supplemental Fig. 6C, D). Thus, for all subsequent work we either used the 9-class model (termed Original model) for ease of computation or the Fine-Tuned 10 class model for meniscus segmentation because these provided the best predictive performance while encompassing the most amount of tissues.

Fig. 1: A fine tuned 10 class model can segment relevant tissue in inflammatory arthritis.
figure 1

A Mean Intersection over union (mIOU) and class frequency weighted mIOU statistics from the held-out test set for the RF and DL segmentation models at 4 different tissue granularities. Box and Whisker plots are constructed by showing the Min, 25th percentile, Median, 75th percentile and Max, each dot is one slide, n = 16. BF Representative images of H&E (B) image, with Ground Truth (GT, C), RF (D), 9 class UNET++ (E) and Fine Tuned UNET++ (F) tissue overlays from the Test set. BiFi 2x magnification of whole the joint. (BiiEii) 10x magnification of the anterior femoral condyle depicting synovial pannus encroachment reaching the articular cartilage. Fii 10x magnification of the posterior articular cartilage and meniscus. (BiiiEiii) 10x magnification of the trabecular bone and bone marrow proximal to the femoral growth plate depicting an area that was difficult to predict for all models (Fiii) 10x magnification of the posterior articular cartilage and meniscus with the Fine Tuned UNET + + tissue prediction overlays.

After Test set validation, we then sought to externally validate the Fine-Tuned 10 class model (i.e., completely independent of the training, testing and validation process above) in two additional ways. (1) We first validated our Fine-Tuned 10 class model directly on previously published data by comparing hand drawn histomorphometry outlines of the synovial tissue previously published in Bell et al.19 (n = 9) to the synovial segmentation predictions that were in the Test set from the TNF-Tg cohort (Batch A). There was a significant positive correlation between the DL segmentation area and the hand drawn area (r2 = 0.96, Fig. 2A) which demonstrates the accuracy of our method with hand drawn histomorphometry. However, linear regression and RMSE analysis suggests it is not a perfect fit describing an over segmentation at low areas (healthy samples) and under segmentation at high areas (severely diseased samples). (2) We then validated our model in a real-world setting by collecting 171 slides (64 knees, 2 or 3 histologic levels per knee) from 9.5 month-old male TNF-Tg mice either treated with anti-TNF therapy or placebo for 6 weeks and from 6 month-old TNF-Tg and WT (treatment naïve) mice used as controls20,21 (experimental design schematic in Fig. 2B). Male TNF-Tg mice display a robust inflammatory arthritis with synovial hyperplasia and pannus invasion of the distal femur and femoral articular cartilage (Fig. 2C, WT vs TNF-Tg, p < 0.0001; Fig. 2D, black arrow 2nd panel). Anti-TNF therapy is known to reduce synovitis (Fig. 2D, blue outlined tissue and * right panel) yet does not alter trabecular bone loss (Fig. 2D, dark teal outlined tissue, red arrows right panels) in mice with established disease (>6 months old)22,23. Our DL segmentation appropriately modeled these well-established structural changes autonomously (Fig. 2C, D), and it uncovered that cartilage degradation is also moderately reduced when Anti-TNF therapy is provided at 8 months of age for 6 weeks, which is an expected but novel result. Interestingly, trabecular bone area is already decreased in 6 month-old TNF-Tg mice compared to WT counterparts while cartilage area is not, suggesting that trabecular bone loss occurs before cartilage loss. Additionally, multiple other tissue structures can be studied simultaneously (Supplemental Fig. 7) showing the versatility of studying H&E segmentation models to assess tissue structural changes within the context of mouse disease models. Taken together, these analyses suggest that our model accurately detects relevant and meaningful biologic treatment effects with the potential to discover novel structural changes. It is important to note that some of these slides were stained with a variation of traditional H&E, H&E-Orange-G (see Methods) demonstrating that our training strategy choices produced a model that is robust even when introducing additional stain variations.

Fig. 2: The fine tuned UNET++ model measures treatment response in the TNF-Tg with Anti-TNF therapy.
figure 2

A Inferred synovial area in the held-out test set compared to historical hand drawn synovial histomorphometry area (n = 9, both Tissue inferred Area and Hand Drawn Histomorphometry Area are normally distributed, Pearson’s Correlation). B Anti-TNF therapy Study design. Six-month-old WT and TNF-Tg mice were used as controls. Eight-month-old WT or TNF-Tg mice were treated with Anti-TNF therapy or Placebo control for 6 weeks (weeks post treatment, wpt). Both left and right knees were collected and 2–3 histologic levels per knee were analyzed (n = Slides / Knees). C Using the fine tuned UNETT + + model we inferred tissue area of the Synovium (Left), Trabecular Bone (Middle) and Cartilage (Right) for 6mo WT (n = 17) and TNF Controls (n = 14) as well as Placebo (Irrelevant IgG) treated WT (n = 15), Placebo treated TNF-Tg (n = 10) and Anti-TNF treated TNF-Tg (n = 8). Each dot represents one knee (average of 2-3 histologic levels), Box and Whisker plots are construct by showing the Min, 25th percentile, Median, 75th percentile and Max. Left Panel: TNF-Tg group was not normally distributed, data was log transformed and a One-Way ANOVA with Tukey’s Post-hoc Test was performed. Middle Panel: All data was normally distributed, and a One-Way ANOVA with Tukey’s Post-hoc Test was performed. Right Panel: WT (Placebo) and TNF-Tg (Placebo) groups are not normally distributed, data was log transformed and a One-Way ANOVA with Tukey’s Post-hoc Test was performed. D Representative images (2x magnification) of 6mo WT and TNF Controls as well as 9.5 mo Placebo treated WT, Placebo treated TNF-Tg and Anti-TNF treated TNF-Tg with predicted tissue overlay. Note: black arrows denote pannus invasion of the femoral articular cartilage, the red arrows denote trabecular bone loss, and * denotes reduction in synovial area.

Tissue-Specific and Arthritis Effector Cell Types can be Identified with ML in TNF-Tg Mice

We annotated a total of 4,712 cells across three stages of murine inflammatory arthritis in the TNF-Tg mice (n = 6 healthy, n = 8 mild disease, and n = 5 severe, Supplemental Fig. 8) to build a cell type classification model that could recognize cells from various disease stages. Cell nuclei were first segmented with HoverNet24, nuclei boundaries passed to ImageJ where stain deconvoluted color features of both the nuclei and cytoplasm as well as nuclei shape parameters were calculated. We next passed these data to our custom feature extraction pipeline leveraging our novel insight that cell type predictive modeling dramatically improves with neighborhood features (Supplemental Table 3) inspired by adjacent work tackling a different classification task25. These neighborhood features include standard statistical measure of neighboring cells within a radius to the parent cell like hematoxylin intensity mean or kurtosis, whether the parent cell was located in a dense region of other cells, and the shape (convex hull) of the cells within a radius to the parent cell (Supplemental Table 2). The fidelity of these methods was demonstrated in 2D UMAP space, as most cell types are clearly distinguished (Fig. 3A). We next built a gradient boosted decision tree (GBDT) classification model (Xgboost) using nested stratified 5-fold cross validation training and tested our models’ predictions using three methods. (1) we calculated the average (± SD of folds) F1 of each cell class in the test sets (Fig. 3B). Each cell class demonstrated a good F1, between 0.66 for vessel cells and 0.94 for adipo-stromal cells with synovial associated cell classifications among the best (synovial lining cells = 0.90 ± 0.03; synovial fibroblasts = 0.82 ± 0.05; and synovial lymphocytes = 0.94 ± 0.05). Our next two validation strategies utilized tissue and disease context. (2) Tissue segmentation using the Original model was performed on these training slides and the remaining >300,000 cells cell-type classification was inferred. Tissues with known homogenous cell types, fat tissue and cartilage/meniscus, were investigated and cell types among these tissues were plotted. Within these adipose tissue and cartilage/meniscus, the most predicted cell type were adipo-stromal cells (67%) and chondrocytes (61%), respectively (Fig. 3C). (3) To assess predictions in the context of disease, we utilized the synovial tissue predictions only (as determined by the tissue segmentation model Fig. 1) and stratified by disease severity. As shown by Fig. 3D, E, synovial-specific increases in synovial fibroblasts, synovial lining cells and lymphocytes are seen with increasing disease severity. These results suggest that our cell type model can produce high quality predictions that are sensitive to disease stage.

Fig. 3: Cell type classification model successfully identifies important cell types in inflammatory arthritis.
figure 3

A Uniform Manifold Approximation and Projection (UMAP) plot after principal component analysis dimensional reduction on 856 cell features of the annotated cells (colored by cell type, n = 4,712). B A gradient boosted decision tree was trained using a parameter grid search with a nested, stratified, 5-fold cross-validation training strategy. The F1 scores (M ± SD) of the five folds for each cell class are presented with the overall weighted F1 of 0.88 ± 0.03 (M ± SD). CE On the remaining (not-annotated) cells (~300,000) on the 9 training slides, the cell class was predicted using the most performant model. Tissue class was also predicted using the Original segmentation model. C Predicted cell class is plotted as a percent of total cells within the Fat tissue (Top) and Cartilage and Meniscus tissue (Bottom). D Representative images of Synovial tissue cell class predictions within an inflamed synovium. E Cell counts from the synovial tissue on the Healthy, Mild disease and Severe disease training slides of lymphocytes (Left), Synovial Lining Cells (Middle) and Fibroblast (Right). Each dot represents one slide, M ± SD.

Given the promising intra-test set performance and tissue- and disease-state specificity of the cell type modeling, we aimed to further validate our model with a larger data-set with more biologic variation. To do this, we utilize the previously described sexually dimorphic synovial pathology in TNF-Tg mice19 and collected slides from 3–5.5 month-old WT and TNF-Tg-male and female mice. Confirming our previous observation using traditional histologic scoring, we found a significant increase of lymphocytes in female TNF-Tg synovium at 3 months of age with concomitant significant increase in synovial lining cells (Fig. 4A, B). Interestingly, sexual dimorphism was not observed when assessing synovial fibroblasts, which is a novel finding (Fig. 4C). Finally, we found excellent correlations with our computationally derived lymphocyte counts vs expert derived synovial inflammatory score (rho = 0.81, p < 0.0001), synovial lining cell counts and pannus score (rho = 0.66, p < 0.001), and total cell counts vs total cell area (r2 = 0.96, p < 0.0001) (Fig. 4D–F). These data suggest that our mouse cell typing model is sensitive to both subtle and dramatic tissue changes, and importantly recapitulates expert scored data.

Fig. 4: Computational pathology modeling recapitulates the sexual dimorphism of TNF-Tg inflammatory arthritis.
figure 4

An independent set of slides from the training slides were used to validate the cell type prediction model (3 months-old: WT Male n = 5, WT Female n = 4, TNF-Tg Male n = 6, TNF-Tg Female n = 5; 4 months-old: WT Male n = 6, WT Female n = 4, TNF-Tg Male n = 5, TNF-Tg Female n = 5; 5.5 months-old: WT Male n = 4, WT Female n = 5, TNF-Tg Male n = 6, TNF-Tg Female n = 6). Tissue segmentation was first performed to segment the synovium and then cell type predictions were performed within the synovium. A Lymphocytes predictions counts. Each dot is one mouse, Box and Whisker plots are construct by showing the Min, 25th percentile, Median, 75th percentile and Max. Please note the log scale. Lymphocyte counts were found to be lognormal, a log transformation was performed on the data and a Two-Way ANOVA with Tukey’s Post-hoc test was conducted. B Synovial Lining Cell prediction counts. Each dot is one mouse, Box and Whisker plots are construct by showing the Min, 25th percentile, Median, 75th percentile and Max. Please note the log scale. A Two-Way ANOVA with Tukey’s Post-hoc test was conducted. Differences only shown between the female and male TNF-Tg mice. C Fibroblast predictions counts. Each dot is one mouse, Box and Whisker plots are construct by showing the Min, 25th percentile, Median, 75th percentile and Max. A Two-Way ANOVA with Tukey’s Post-hoc test was conducted. Differences only shown between the female and male TNF-Tg mice. D Lymphocyte predictions compare to the synovial inflammatory score as previously quantified in Figure 2F of Bell et al.19. Lymphocyte counts were not normally distributed. Spearman’s correlation, TNF-Tg mice only (n = 28). Please note the log scale on the x-axis. E Synovial Lining cell counts compared to the pannus invasion score as previously quantified in Supplemental Figure 3I in Bell et al.19. Synovial Lining cell counts were not normally distributed. Spearman’s correlation, TNF-Tg mice only (n = 28). Please note the log scale on the x-axis. F Total cell counts in the synovium compared to the cell area as previously quantified in Figure 2D of Bell et al.19. Total cell counts and cell area in the synovium were normally distributed. Pearson’s correlation, TNF-Tg mice only (n = 28).

To reduce the annotation time, we explored various active learning approaches retrospectively in the murine cell type data set. Model performance using all the active learning strategies with 45% of the total training size was comparable to using the complete dataset with random sampling (0.8188 (0.8157–0.8219) vs 0.8213 (0.8184–0.8243), F1 ± SD). Additionally, with 45% of the data, the model performance using active learning was higher than using a randomly sampled set of examples (0.8188 (0.8157–0.8219) vs 0.8082(0.8053–0.8111) F1 ± SD) (Supplemental Fig. 10). Additionally, the mean 5-fold CV macro F1-score was most different at the 10%–25% of annotated data range, indicating that active learning can drastically improve model performance with fewer examples.

Cell type modeling on human RA synovial biopsies predicts pathotypes and correlates with clinical outcomes

After validating the active learning strategy in murine tissue, we next applied this approach to generate cell type annotations on human synovial tissue sections, aiming to reduce the overall histopathological evaluation time for our pathologist. We collected a small subset of initial annotations, predicted cell types on new cells and calculated the entropy-based uncertainty, ranking the most uncertain cells for future annotation by the pathologist. After multiple rounds with active learning, a total of 2,341 cells were annotated (Supplemental Fig. 10). Using a GBDT (xgboost) with a nested stratified 5-fold cross-validation training strategy, we achieved good model performance ranging from 0.79-0.91 average F1-scores with the overall weighted F1 of 0.85 ± 0.01 (M ± SD, Fig. 5A). The confusion matrix from the best performing fold (Fig. 5B) demonstrates that undifferentiated stromal-connective cells are confused with vascular endothelial cells and synovial lining at the highest frequency. Also, plasma cells are confused with lymphocytes. These results suggest there is very little misclassification between high level cell type (stromal vs lymphoreticular cells) but some misclassification within more specific cell classes. To further validate our human RA cell typing model, we acquired adjacent immunostained sections (Lymphoid, n = 7; Diffuse, n = 6; Pauci-Immune, n = 2) for lymphoid (CD3, CD20 and CD138) and stromal-immune (CD3, CD68, CLIC5 and CD34) markers to qualitatively and quantitatively assess for cell type validity. Our qualitative analysis revealed a remarkable spatial alignment of machine learning predicted plasma cells with CD138+ cells (Fig. 5Ci, green outline middle panel vs green IF right panel) and lymphocytes with CD3+ or CD20+ cells (Fig. 5Cii, blue outline middle panel vs red/white IF right panel), synovial lining cells with CLIC5+ cells (Fig. 5Ciii, light blue outline middle panel vs yellow IF right panel) and vascular endothelial cells with CD34+ cells (Fig. 5Civ, grey outline middle panel vs red IF right panel) in many of the sections. We also observed some alignment of macrophage-histiocytes predictions with CD68+ cells (Fig. 5Ciii, purple outline middle panel vs green IF right panel) but these observations were much less consistent suggesting that either the model fails to predict these cells outside of the training data or H&E defined macrophages-histiocytes are not well marked with CD68 in our data. We were unable to acquire immunostains to validate our fibroblast and stromal-connective cell classes. Lymphocyte, and in particular B-cell, infiltration into the synovial tissue is an important pathologic finding to discriminate RA pathotypes2,11,26 and classify disease severity27. Therefore, to quantitatively validate our lymphocyte and plasma cell machine learning predictions vs fiducial protein cell markers on adjacent slides, we performed thresholding histomorphometry analysis to count the number of either DAPI+CD138+, DAPI +CD3+ or DAPI+CD20+ cells within regions of interest. There is an excellent correlation between our machine learning predictions of plasma cells vs DAPI+CD138+ (rho = 0.94, p < 0.002, Fig. 5D, top), and lymphocytes with vs DAPI+CD3+ or DAPI+CD20+ cells (rho = 0.95, p < 0.0001, Fig. 5D, bottom). Krenn inflammation score is an important, expert derived histopathology score driven mainly by the number of lymphocytes that are present in the whole synovial biopsy specimen. Thus, we calculated the percent of machine learning predicted lymphocytes to total cells and correlated with the Krenn inflammation score in all samples (n = 60). This revealed a high degree of correlation between the predictions and pathologist scores (rho = 0.88, p < 0.0001, Fig. 5E). Taken together, this analysis demonstrates we have an excellent synovial cell type prediction model that faithfully predicts clinically relevant lymphoid cells.

Fig. 5: Cell type modeling correctly classifies synovial stromal and immune cells in RA synovial biopsies.
figure 5

A A subset of cells from 13 RA synovial biopsies were annotated (n = 2,341) using an active learning strategy. After nuclei detection and custom feature extraction from each cell, a gradient boosted decision tree was trained using a parameter grid search with a nested, stratified, 5-fold cross-validation training strategy. The F1 scores (M ± SD) of the five folds for each cell class are presented with the overall weighted F1 of 0.85 ± 0.01 (M ± SD). B The confusion matrix from the most performant model demonstrates the typical misclassification in this dataset (data is cell counts). Stromal cells can be mistaken for other stromal cells (vascular endothelial cells, synovial lining cells, and fibroblast) and lymphocytes can be mistaken for plasma cells. F: Fibroblast, L: Lymphoid, M/H: Macrophage/Histocyte, PC: Plasma Cell, S/C: Stromal/Connective Cell, SLC: Synovial Lining Cell; VEC: Vascular Endothelial Cell. C, D The most performant model was used to predict the cell type of the remaining cells from all RA synovial biopsies (n = 60). C Adjacent sections to the H&E-stained slides were stained with either CD3 (T-Cells), CD20 (B-Cells) and CD138 (Plasma Cells) or CLIC5 (Synovial Lining), CD3 (T-Cells), CD68 (Macrophages), and CD34 (Vascular Endothelial Cells) (n = 15). Representative images of plasma cells (Ci), lymphocytes (Cii), synovial lining cells (Ciii) and vascular endothelial cells (Civ) with the original H&E in the left column, prediction overlays in the middle, and adjacent slide IF in the right column. Immunostains and magnification are denoted within the image. D Correlation of machine learning predictions with quantitative histomorphometry of IF+ cells from adjacent sections of n = 15 RA synovial biopsy pieces. Top: CD138+ cells vs ML Predictions of Plasms cells; Bottom: CD3+ and CD20+ cells vs ML Predictions of Lymphocytes; Spearman’s Correlations (n = 15). Please note the log scale. E Correlation of machine learning predictions of lymphocytes (as a percent of total cells) vs the Krenn Inflammation Score, Spearman’s Correlations (n = 60). Please note the log scale on the x-axis.

To further validate model performance, cell type counts were enumerated and proportions of total cells were calculated for the entire synovial biopsy dataset and grouped by pathotype. Importantly, our cell type predictions were consistent with the previously described cellular distribution within each pathotype4,10,11,13. Specifically, synovial fibroblast enrichment is found in the pauci-immune pathotype and some diffuse cases, while lymphocytes and plasma cells are found primarily in the lymphoid pathotype (Fig. 6A). Efficient pathotype prediction is a clinically relevant task that we propose will reduce time and cost in RA clinical trials. Area under a receiver operator curve (AUROC) analysis demonstrates that percent plasma cells have a high predictive capability for discriminating between diffuse and lymphoid cases (AUROC = 0.82 ± 0.06, p < 0.0001, Fig. 6B). The optimal simple threshold is 0.82% plasma cells of total can classify 19 out of 24 lymphoid cases and 25 out of 29 diffuse cases correctly. Representative images of pauci-immune, diffuse and lymphoid cases with their respective predictions demonstrate the cellular distributions of these pathotypes (Fig. 6C). This data shows that our model can be used in a clinically meaningful scenario and supports the development of such tools in clinical trials.

Fig. 6: Cell type modeling can differentiate diffuse vs lymphoid cases with plasma cell counts alone.
figure 6

A Cell type predictions were made on 58 RA biopsy specimens (n = 5 Pauci-Immune, n = 27 Diffuse, n = 26 Lymphoid) and plots of the Synovial Fibroblasts, Lymphocytes, and Plasma Cells percent of total cells demonstrate the known clinical differences among these pathotypes. Each dot is one human biopsy specimen, Box and Whisker plots are construct by showing the Min, 25th percentile, Median, 75th percentile and Max. All data was not normally distributed. Kruskal–Wallis tests with Dunn’s post hoc were performed. B Using plasma cell counts alone, we can discriminate between diffuse and lymphoid cases with a ROC-AUC of 0.82 ± 0.06 (n = 53). The optimal threshold is 0.82% plasma cells of total cells. C Representative H&E and cell type prediction overlays with low magnification and high magnification images of a pauci-immune case, diffuse case and lymphoid case.

Discussion

Here, we demonstrate that multi-scale modeling of synovial histopathology can pathotype RA and inflammatory arthritis in clinically meaningful settings, such as treatment response. Our model will reduce the analytical bottleneck associated with histopathology assessment in both the clinical and preclinical settings allowing quicker times to intervention or hypothesis resolution. Furthermore, it will reduce the amount of accessory immunostaining required for pathotyping by using H&E stains to infer cell types, like lymphocytes and plasma cells, which would otherwise need an immunohistochemical (IHC) stain to confirm specific cell types. This could be very impactful as the diffuse pathotype (plasma cell poor with enrichment in myeloid cells) shows improved response to tocilizumab (anti-IL-6R antibody)4 while the fibroblast rich pauci-immune pathotype shows inadequate response to anti-TNF therapy11. While we did not have enough specimens to build a model to classify the pauci-immune pathotype in the AMP-RA data set, we were able to classify a diffuse vs lymphoid pathotype utilizing a simple threshold on plasma cell percent with an AUROC of 0.82 ± 0.06.

Computational approaches to understanding tissue and cellular information from histology slides have greatly improved in recent years. Specifically, tools to segment or classify malignant tissues and cells from biopsy specimens28, and cell type classification on cytology blood smears29 have seen the largest amount of development with many applications acquiring FDA approval30. These tools have dramatically increased the throughput of clinical histologic analysis by, for example, red flagging potential malignant cases with >0.95 AUC for further review31, classifying and counting cells on cytology smears or slides for detection of malignant cells and other pathologies24,29. However, there is a dearth of computational pathology tools outside of cancer or outside of diagnostic settings and within the space of musculoskeletal pathologies32. Our work represents the first set of comprehensive tissue and cellular analysis tools for both pre-clinical and clinical phenotyping (in this case pathotyping) in inflammatory arthritis. In adjacent work in the oncology field, Pati and colleagues utilized hierarchical graph convolutional neural networks to integrate information from both the cellular and tissue levels on 2048 px x 1536 px sized images (0.42 uM2 pixels)33. This model was largely effective at detecting cancerous images with an F1 of 84.9% ± 0.8%, however did not perform well for non-cancerous, pre-cancerous or normal images with F1s of 56.6 ± 1.7%, 66.1 ± 3.7%, and 66.2 ± 1.7%, respectively. In other work, HoVer-Net is a state-of-the-art nucleus segmentation and cell type classification deep learning model for H&E-stained tissues24. HoVer-Net outperformed all other models in nuclei segmentation, however, only achieved cell classification F1s from 0.30–0.68. Thus, we utilized the nuclear segmentation portion of the model and transferred this knowledge into our custom feature extraction pipeline for classification of cells relevant in RA synovial pathology. In designing our feature extraction pipeline, we were inspired by Wang et al.25 who used cell intrinsic features, contextual features about surrounding nuclei, density features and spatial arrangement features generated from a H&E tumor biopsy in a graph neural net framework to predict if patients would benefit from check-point inhibitor therapy. We extend this feature modeling framework to include both hematoxylin and eosin specific features, as well as cytoplasm features with statistical calculations of neighborhood features that empirically show improved that cell classification performance (Cell Intrinsic Features: 0.75 ± 0.01 vs Cell Intrinsic Features plus 150 px distance features: 0.86 ± 0.02 vs All Features: 0.88 ± 0.03, Supplemental Table 3 and Fig. 3). To the authors’ knowledge, our work is the first to utilize this type of framework in a cell type classification task. Independent validation of our human cell tying model is needed to formalize general applicability. However, we were able to utilize an independent cohort of mouse inflammatory arthritis slides (Fig. 4) which formally demonstrated the utility of such a model to recapitulate historical scoring measures and discover novel phenotypes.

Some computational approaches have already been successfully applied to H&E slides of synovial biopsies of RA patients to quantify cellular changes, such as nuclei density and its association with clinical inflammatory measures34, and simple counting of CD3+ T-cells or CD68+ macrophages on IHC35,36,37. Further, pathologist scores of specific cells types have been associated with quantitative inflammatory gene expression changes in the RA synovium38. Our approach aimed to incorporate the important cell types from this previous work while producing models that only require H&E-stained tissues. However, RA has been shown to be a complex polygenic autoimmune disorder with various environmental risk factors contributing to multiple etiologies4,39,40. As a consequence of this complexity, many RA patients are refractory to existing approved therapies41,42,43,44. This highlights the need for a personalized medicine approach to improve clinical trial design and treatment allocation, as done in recent trials like the R4RA6 and PEAC26 that utilize ultrasound guided synovial biopsies and pathotype evaluation to stratify patients. Our models may be able to improve the workflow of these types of clinical trials and reduce the overall cost. To realize the full potential of our human models and understand unbiased performance, a formal independent validation cohort needs to be developed.

One major benefit of our model is the fact that we used H&E-stained tissues. These tissues are routinely and easily collected and represent a large proportion of historical datasets allowing for larger retrospective studies. While some of the computational pathology tools utilize H&E-stained slides, many have utilized special stains geared towards tissue- or cell type- specific classification32. This limits the overall throughput and utility of such pipelines by adding additional steps and costs that may be prohibitive. In addition, using a common stain facilitates transfer learning applications to other musculoskeletal pathologies, like osteoarthritis, bone fracture or disc degeneration, with different etiologies but similar tissue involvement and histopathology requirements. However, this fundamentally limits the specificity of some cell classes. For example, not all fibroblasts are long, thin and spindly; not all macrophages are plump, granulated cells; and not all plasma cells have a red/pinkish cytoplasm or clockface appearance; all H&E histologic features which our pathologist utilized during cell annotation (Supplemental Fig. 10). We address some of these concerns by staining adjacent slides with fiducial protein markers (CD3, CD20, and CD138) that largely validate our lymphocyte and plasma cell predictions. However, we found less consistency with macrophage identification, likely due to the heterogeneity of CD68+ cells. Until high dimensional biochemical methods for cell labeling that also allow subsequent high-quality H&E staining and imaging are developed, like spatial transcriptomics, we will be fundamentally limited when reaching for multi-class cell typing.

Consensus scoring that utilize Likert scales45 is the gold standard analysis method for histopathology. For RA, various types of assessment, including Krenn lining and inflammation scores, rely on a consensus grading system to summarize high level pathologic features of the tissue (e.g. percent of area effected) and cells (e.g ranges of quantity within a region)46,47. This reduces the challenge of quantifying complex and heterogeneous disease states. This approach also has an added benefit of often measuring large differences which usually correspond to clinically meaningful differences. However, this reduction in complexity may remove vital biologic information about treatment response or disease heterogeneity. For example, in our previous work we used a Likert scale to quantify synovial inflammatory infiltrate in which a score of 3 corresponded to “>30 inflammatory cells thick”. If we had utilized this system in an interventional study which aimed to achieve a 50% reduction in synovial infiltrates but the number of cells across the synovium went from 100 cells thick to 50 cells thick, there would be no difference in the histology score despite a quite large treatment effect.

In the current work, we attempt to quantify this complexity with tissue segmentation and cell typing computational approaches. To demonstrate the benefit of these approaches, we can calculate the effect size (Glasse’s delta) between male and female TNF-Tg, 3 month-old synovium comparing the historical synovial inflammatory infiltrate scoring system (Bell et al., Fig. 2F19) with our machine learning predictions of lymphocyte counts within the synovium (Fig. 4A) of the exact same slides. This analysis reveals that our computational methods are orders of magnitude more sensitive to biologic differences than histologic scoring (Histologic Scoring: 2.79 Glasse’s delta vs Lymphocyte Counts: 74.82 Glasse’s delta). These differences in quantification method represent ~25 fold increase in measured effect size using the computational approach. This increased data sensitivity does, however, place an additional burden on the investigator to fully realize if a measured difference is clinically meaningful. Previously, this burden was partly shouldered by the scoring mechanism. In addition to more granular quantification of pathology these analyses are also more efficient. For example, to generate the annotations to build our segmentation model we spent 200 + hours drawing annotations on the 94 slides. However, to infer the tissue segment on the 174 slides in Fig. 2 the model took ~30 h of hands-off compute time with only 2–3 h of labor to visualize the results, representing a ~120-fold increase in efficiency.

Lastly, utilizing computational tools that quantify multiple tissues and cell types improves the ability to find novel phenomena. For example, while our primary focus is on the synovial pathology in inflammatory arthritis, having a model that measures cartilage, meniscus, and bone pathology provides a more comprehensive picture of disease. This allows easier detection of off-target or unexpected therapeutic effects with a singular methodology. However, reliance solely on computational modeling may increase false positives and expert-level quality control is advised for high impact results.

In conclusion, we have developed a set of models that can characterize tissue and cellular pathology in pre-clinical and clinical inflammatory arthritis settings. These models can be leveraged to better understand disease mechanisms in pre-clinical settings and be used in a precision medicine pipeline to improve patients’ health.

Methods

Dataset description

All mouse work was approved by the University Committee on Animal Resources at the University of Rochester Medical Center and the Institutional Animal Care and Use Committee at the Hospital for Special Surgery. Whole slide images (WSIs) of sagittal mouse knee sections were taken from two different mouse models of inflammatory arthritis and the accompanying controls for segmentation experiments (Supplemental Fig. 1A). Batch A consisted of male and female TNF Transgenic mice (TNF-Tg, n = 47) and wild-type littermates (WT, n = 15) used in previous publications19,20. Batch B consisted of previously unpublished male and female knees that received intra-articular injections of 180 ug of Zymosan to induce Zymosan Induced Arthritis (ZIA, n = 24) and control contralateral limbs (Control, n = 8)48 that were euthanized on Day 7 after injection. Different batches were used to test model generalizability across different biological mechanisms of arthritis development, differences in H&E staining protocols and slide scanners used to digitally capture slides at 40 x magnification (Batch A: VS120 Olympus, 0.173 μm per pixel; Batch B: CS2 Aperio Leica, 0.253 μm per pixel).

To further test model generalizability, 2 different independent holdout datasets were used to validate the model: (1) the remaining H&E-stained sagittal knee WSIs from Bell et al.19,20 that were not annotated or used in model training and (2) Orange G-H&E stained sagittal knee WSIs from Kenney et al.21. Slides from Bell et al. were ensured to not have been used in the initial model training, internal validation, or testing. These included slides from 6 month-old male TNF-Tg (n = 33 slides from 14 knees) and WT littermates (n = 43 slides from 17 knees), and slides from 9.5 month old male TNF-Tg mice (anti-TNF: n = 24 slides from 8 knees; Placebo: n = 29 slides from 10 knees) and WT littermates (Placebo: n = 42 slides from 15 knees) either treated with a 6 week course of anti-TNF antibodies or placebo control (irrelevant IgG). To generate downstream measurements of tissues of interest, a region of interest (ROI) was drawn from the tibial growth plate to femoral growth plate including the anterior and posterior extra articular tissue.

WSIs of H&E-stained human synovial biopsies were collected from the Accelerating Medicines Partnership Rheumatoid Arthritis (AMP-RA) Phase II consortium47. In short, synovial tissue biopsies were acquired from RA patients at 13 different clinical sites in the United States and 2 in the United Kingdom from October 2016 to February 2020. The study was performed in accordance with protocols approved by the institutional review board at each site. The tissue was paraffin embedded, stained with H&E and imaged on a VS120 Olympus. Three pathologists independently determined Krenn lining and inflammatory infiltrates scores (0–3 each) for each tissue sample27, and the mode of the three scores was used for further analysis. To classify the cases into H&E based pathotypes, the UK Birmingham group developed consensus semiquantitative four point scores for infiltrate density and aggregate radial size on a per fragment basis with a custom atlas using a test set of tissues from the Birmingham Early Arthritis Cohort49, scored by three pathologists. Aggregate grade was derived as follows: Grade 3; high ≥ 20 radial count. Grade 2: medium 10–19 radial count. Grade 1: low 6–9 radial cell count. Grade 0: No aggregates. This approach was validated by scoring tissues from the first AMP RA cohort46 and original data is presented from the second AMP RA cohort47. These semiquantitative scores were then used to classify the cases into three pathotypes, either lymphoid (n = 27), diffuse (n = 26) or pauci-immune (n = 5) according to the following rules: Lymphoid: The presence of ≥1 grade 1 aggregate in at least two fragments, or any grade 2 aggregate, or any grade 3 aggregate. Diffuse: Does not meet lymphoid criteria but with a mean fragment density score ≥ 1. Pauci-immune: Does not meet lymphoid criteria, mean fragment infiltrate density score < 1.

Semantic segmentation annotation and preprocessing

Manual annotations were performed within QuPath50 to assign labels for WSIs. To test model performance across tissue types at different granularity, multiple different class structures were tested (Supplemental Fig. 2). Eleven different classes were manually annotated, including synovium, muscle and tendon, growth plate, bone marrow, cortical bone, trabecular bone, articular cartilage, meniscus, fat, bone marrow fat, and histology artifact (i.e., out of focus). A seven-class, nine-class, and ten-class segmentation task was generated by merging histologically similar tissues, such as merging cartilage and meniscus into the same class (Supplemental Fig. 2). Overall, we estimate about 250 hours were spent annotating the 11 tissue classes on 94 WSIs.

Due to the gigapixel nature of WSIs, the entire slide cannot be fed directly into a deep learning model. Previous work has shown that WSIs can be broken into patches, in this case 512 pixel x 512 pixel, to perform downstream learning tasks31. For semantic segmentation, a custom QuPath script was used to export patches at a 4x downsample while filtering out regions of the scanned slide that lacked annotations or without tissue. Additionally, images were normalized to mean 0 and standard deviation 1 by sampling a subset of patches to get mean and standard deviation RGB statistics.

Semantic segmentation models and training strategy

Initially, a stratified random sampling method was used to randomize the 94 annotated WSIs into a Training, Validation, and Test split, using 70%, 15%, and 15% for each split, respectively (3 splits total, Supplemental Fig. 1B). We stratified by batch (e.g. staining and site differences) and disease type (e.g. healthy and disease) to ensure even allocation of data variation into each set. Randomization occurred at the slide level, as opposed to the patch level, to ensure no data leakage across splits. During our initial qualitative explorations of models and hyperparameters (detailed below), we only utilized Training and Validation sets; and calculated the Dice score on the validation set to measure performance. Dice score was calculated as

$${Dice}=\frac{2 * {intersection}}{{union}+{intersetion}}\,$$

These initial experiments included variations in SLIC feature extraction, model selection (UNET vs UNET++), efficient-net backbone size (B0, B2, B5), loss metrics (weight loss vs unweighted), and learning rate parameters as described below. Once these items were tuned, quantitative experiments (details below) were performed varying the tissue segmentation number (7, 9, 10, or 11), image augmentation (None, Low, Medium, or High), or patch overlap percentage (0%, 50%, 66%). These experiments were performed by training the models with the Training and Validation sets, freezing the models’ weights, and then inferring segmentation on the Test set. These inferences were then compared to the ground truth labels to calculate the mean Intersection over Union (mIOU) or the frequency weighted mIOU (fwIOU). The fwIOU is the class frequency weighted sum of the mIOU.

In addition, to assess how site-specific differences in histology slides may impact model performance51, we performed a single batch training method while varying the image augmentation style. In this training strategy, Batch A (n = 62) was used for the Training (45%) and Validation (20%) sets, and Batch B (n = 32, 35%) was used as the held-out Test set. These experiments were performed to assess if image augmentation could overcome batch related staining differences that are seen in the real world.

Data augmentation strategies were also tested to assess their impact across the different training strategies (Supplemental Fig. 3). We had three different levels of augmentation tested, (1) None, (2) Low, (3) Medium, and (4) High. The python package imgaug52 as used to implement augmentation. Augmentation was applied in the following way: (1) None had zero augmentation, (2) Low had 2–4 augmentation process applied 25% of the time, (3) Medium had 3–7 augmentations applied 50% of the time and (4) High augmentation had 5–11 augmentation process applied 50% of the time during training. Augmentation was randomly selected from 11 different types of augmentations including, horizontal flip (p = 0.5), coarse dropout or pixel dropout from 0.2x the original image resolution (p = 0.1), one of three different rotation types at 90°, 180°, and 270°, additive gaussian noise sampled from a normal distribution with mean 0 and variance 0.2*255, blur using gaussian kernel with sigma of 1.5, hue modification using addition (−30,10) and saturation modification using multiplication (0.5,1.5) and linear contrast (0.5,2), brightness adjustment both add(−30,30) and multiply (0.5,1.5), and color change adjustment (3000, 8000). Data augmentation was performed only during the training of the models, not during testing or inference.

To compare segmentation performance using both conventional machine learning and deep learning methods, we tested two different model architectures. A Random Forest (RF) model implemented in QuPath with OpenCV53 and an U-Net++54 deep learning model implemented in PyTorch (1.8.0)55. We qualitatively assessed both pixel level segmentation and super-pixel level segmentation in QuPath and determined super-pixel segmentation performed better. To generate super-pixels, we applied a Simple Linear Iterative Clustering (SLIC) algorithm (σ = 5, spacing = 20 μm, Max Iterations = 1, Regularization = 0.01) in which each over segmented area was considered a super-pixel. We qualitatively assessed SLIC feature extraction variations and chose to extract RGB, estimated Hematoxylin, Eosin and residual stain means, standard deviations, min, max, and median values from each SLIC super pixel. We also calculated the Haralick value using a distance of 1 and bin of 32. We next calculated the average features of super-pixels within 40 and 80 μms. These features were then used in the RF, which was implemented fully in QuPath using their “Train Object Classifier” GUI with max depth = 20, min sample count = 10, Active variable count = 0, maximum trees = 50 and termination epsilon = 0. Four models were built, one for each of the 7 class, 9 class, 10 class and 11 class segmentation tasks. After the models were trained, they were used to infer on the Test set and performance metrics calculated.

For our deep learning pipeline, we utilized the segmentation_models_pytorch56 package for the UNET++ implementation with an ImageNet pre-trained EfficientNet-B5 backbone for the encoder to improve training time and computational efficiency57. The decoder was left unchanged from the native UNET++ architecture except for the final layer which was changed to match the number of tissue types being segmented. We also initially explored using the UNET architecture and other efficient net backbones, however UNET++ with a B5 backbone was found to be most performant. We initially explored both class frequency weighted loss and un-weighted loss and the un-weighted loss demonstrated improved performance. We utilized a combo loss calculation, which was the arithmetic average of the dice loss and binary cross entropy (BCE) loss during model training58. We explored a few hyperparameter variations for the learning rate scheduler including step size = [1,2], gamma = [0.25, 0.5] and learning rate start = [0.01, 0.02, 0.025]. From initial explorations we found that a step size of 2, a gamma of 0.5 and a learning rate start of 0.025 resulted in the best performance. We used a stochastic gradient descent (SGD) optimizer with momentum (0.9) and regularization of 1 × 10−4 and the models were trained for 10 epochs at a batch size of 40.

When jointly training across 10 different classes, this model provided no prediction for the meniscus class resulting in a mean Intersection Over Union (mIOU) of 0 ± 0 (Supplemental Fig. 6), likely due to insufficient training examples (Supplemental Fig. 2; 0.9% frequency overall). As the meniscus was important for downstream biological analyses, we developed a strategy to improve predictions for this class. A second UNET++ model was fine-tuned from the nine-class model (i.e. the model that has cartilage and meniscus merged into one class) by changing the prediction (final) layer specifically to predict between cartilage and meniscus. We then re-trained the full model only on images and Ground Truth annotations (GTs) that contained either cartilage or meniscus within the mixed training set.

Previous work had suggested that UNET based semantic segmentation can have image boundary level artifacts17. Therefore, we assessed how including patch level overlap for prediction can improve model performance. We included experiments using no overlap, 50% overlap and 66% overlap between them with images from both batches (Supplemental Fig. 4). To analyze the results, we looped through all the predictions (N) for the entire WSI and calculated a majority vote for each pixel after thresholding to remove low confidence predictions (pixel value of 75). N can be variable depending upon the region, for example if it is on border, but typically is between 4 and 9.

Semantic segmentation model evaluation, inference on external validation set and statistical evaluation

To evaluate the semantic segmentation model performance for the Validation and Test set slides with ground-truth labels, we calculated the mean Intersection over Union (mIOU) and frequency-weighted mIOU (fwIOU) to prevent very rare classes from drastically impacting overall model performance59. All model hyperparameter optimization was performed on the Validation set, and once the above parameters were chosen the models were used to infer segmentation on the Test set and the mIOUs were calculated.

We used the optimal settings from the training/validation process to evaluate the model performance on the held-out datasets. Pearson’s correlation was used to compare the hand drawn synovial tissue area reported in Bell et al.19 with model classified synovial tissue area. The fine-tuned 10-class model was used to infer tissue segmentation on the held-out data21,60. Specifically, the UNET++ 9 class model was used to predict tissues classes on each patch, and if the nine-class model had a prediction output for the combined Cartilage-Meniscus class on a patch, then the patch was passed through the fine-tuned 2 class model to assign to either the meniscus or cartilage class. Predictions were merged by only allowing the fined tuned predictions to be within the predictions from the combined Cartilage-Meniscus class. Once the inference was complete, a ROI was drawn on each slide from femoral growth plate to tibial growth plate around the joint to restrict the downstream analysis to the joint space, subchondral bone, and synovial adjacent tissue. Tissue area was calculated for each slide and averaged for each knee then a One-Way ANOVA with Tukey’s post-hoc adjustment was used to detect significant differences.

Cell type classification framework and preprocessing

For cell type classification, a combination of transfer learning and active learning was used to identify several different cell types that exist within the joint tissue. Cell type classification can be broken into a two-step process, (1) segmentation and (2) classification. For cell segmentation, transfer learning was used by leveraging a deep learning model, HoVer-Net24, pretrained on the PanNuke dataset61, to extract nuclei regions. Image patches (1024 × 1024 pixels) at 40 x magnification were given as inputs, and ROI contours of nuclei were obtained to perform feature extraction upon. These nuclei with their features and labels (detailed below) were then leveraged in a gradient boosted decision tree (GBDT, XGBoost (https://xgboost.readthedocs.io/en/stable/) implemented within the ScikitLearn package (https://scikit-learn.org/stable/)) model to classify cells.

The input for the classifier was derived from features extracted from each ROI generated by HoVer-Net. Specifically, every nucleus from the json file output of HoVer-Net was converted into a ROI (.roi) file to be read into FIJI/ImageJ62 for feature extraction using a custom script. Detailed workflow for the ImageJ/FIJI analysis is described as follows. Each image was split using the built-in color deconvolution63 algorithm in FIJI into hematoxylin and eosin color channels. For each channel the nuclei were measured for several different parameters, including morphological quantities (area, perimeter, circularity, feret’s diameter, feret angle, aspect ratio, roundness, and solidity) and staining color quantities (mean, mode, min, max, standard deviation, skewness, median, and kurtosis). The nuclei ROI was then enlarged by 20 px, the original nuclei masked out, and the data from this surround 20 px was used to calculate cell specific cytoplasmic H&E color information for each cell. These features are called cell intrinsic features (Supplemental Table 1). Neighborhood characteristics of cells at several different distance ranges (150 px and 300 px) were used to include local cell and tissue level context into cell type classification (Supplemental Table 2)25. Simply, a neighborhood is determined as all the cells within a certain distance to the parent cell. These neighborhood characteristics included the average, standard deviation, skew, kurtosis, Z-score, interquartile range, standard error of the mean and entropy the cell intrinsic features of cells in the neighborhood. We also calculated shape characteristics of the neighborhood including the average distance of the neighboring cells, the linear correlation coefficient of the cells within the neighborhood, a string of up to 30 cells linear correlation coefficient, the strait line distance of a string of up to 30 cells, a scored measure of density of the cells and the number of cells within the distance measure. A final total of 854 features were extracted for the downstream analysis.

Known healthy and pathologic cell types that contribute to inflammatory arthritis were then annotated on both mouse and human tissues (details below, Supplemental Figs. 8 and 10). The mouse proof-of-concept classification task consisted of bone-embedded cells, blood vessel cells, adipo-stromal cells (both adipose and stromal cells within fatty tissue), synovial fibroblasts (healthy and pathologic), chondrocytes, lymphocytes, and other synovial lining cells (healthy and pathologic) as detailed in Supplemental Fig. 8; annotated by subject-matter experts familiar with histologic analysis of these cell types. These cells were annotated on six healthy, eight mild disease and five severely diseased TNF-Tg knee sections. For human samples, a clinically meaningful set of cell types were labeled by a senior pathologist, following a standard cell type hierarchy (Supplemental Fig. 10). These included stromal/connective tissue cells, synovial lining cells, synovial fibroblasts, vascular endothelial cells, tissue macrophages/histocytes, lymphocytes, and plasma cells. These cells were labeled on five lymphoid, five diffuse and three pauci-immune cases. Nuclei were then mapped to manually annotated nuclei by checking if a nuclei’s centroid (as determined by Hover-Net) was within an annotation mask.

Mouse cell type classification model

A total of 4,712 cells were annotated for mouse cell type classification from seven different classes (Supplemental Fig. 8). Cells were labeled from a total of 19 different slides. A GBDT model was trained for cell type classification (XGBoost (https://xgboost.readthedocs.io/en/stable/) implemented within the Scikit-learn package (https://scikit-learn.org/stable/)), using stratified nested 5-fold cross validation with grid search to select the best models. In order to minimize the influence of annotations from any one slide, we enforced an even sampling method to ensure approximately equal numbers of cells from each slide appeared in all folds (sklearn.model_selection.StratifiedKFold). To tune the parameters of the GBDT, we performed a grid search (sklearn.model_selection.GridSearchCV) on the inner CV for learning rate = [0.05, 0.1, 0.2], colsample_bytree = [0.6, 0.8, 1.0], subsample = [0.25, 0.5], max_depth  [6,12], n_estimators = [10, 100, 200, 400], gamma = [0, 0.1, 0.3], and min_child_weight = [1,5,10]. To evaluate model performance, F1 statistics were calculated as the average of the 5 external folds and then the best performing models was used to infer cell type in two different biological settings, (1) to identify cell composition changes across different disease severities on the remaining cells (>300,000) on the 19 slides, and (2) identify differences between male and female mice in the context of disease progression in a held-out dataset19. Finally, average synovial inflammatory infiltrate scores and average pannus invasion scores were correlated with lymphocyte and synovial lining cell counts respectively (Spearman’s Correlation).

Feature ablation studies

To demonstrate the performance improvements of the distance features, we performed a feature ablation study within mouse cohort in which no distance features were utilized, features at distance 150 px and all features (cell intrinsic features, 150 px features and 300 px features) in out modeling framework (detailed above). To evaluate model performance, F1 statistics were calculated as the average of the 5 external folds.

Active learning implementation

Human annotation was the time-consuming step for the cell type classification pipeline. Therefore, we applied the active learning strategy to improve the annotation efficiency for cell annotation of human samples. To develop this strategy, we tested a proof-of-concept active learning strategy using labeled data from the mouse H&E slides (Supplemental Fig. 9). Active learning is an iterative process that consists of three main steps, (1) annotation, (2) model training, and (3) sample selection for further annotations. Its goal is to select the samples that can lead to the largest model performance improvement when adding to the training data after annotation. To validate the strategy, 100 different rounds of 5-fold cross-validation were performed. Average F1-scores were reported for each class and a macro-F1 score was additionally reported. 25 runs of 5-fold cross validation were removed due to cells from a single class not being present in both the training and testing sets. For the training dataset for each split, 5% of cells were first randomly selected as the first set of cells selected as being labeled annotated. Subsequently, the GBDT classifier was trained using this randomly selected data. Several different metrics for determining cells for annotation and subsequent model finetuning, including smallest margin uncertainty64, least confidence uncertainty65 and entropy-based uncertainty64 were assessed. The top 5% of cells were added to the training dataset and the cycle of model training and evaluation and new cell annotations continued until the entire training dataset was used. A random selection of cells after shuffling was also tested to compare model performance to the various active learning strategies. The package modal66 was leveraged in our implementation. Mean and 95% confidence intervals are reports for each subset across the 75 different runs of 5-fold cross validation.

Cell classification model evaluations

Confusion matrices were generated for model prediction along with F1-scores calculated as \(2\cdot \frac{{precision}\cdot {recall}}{{precision}+{recall}}\), where \({precision}=\frac{{TP}}{{TP}+{FP}}\) and \({recall}=\frac{{TP}}{{TP}+{FN}}\), where TP, FP, FN stand for true positives, false positives, false negatives. Models were tested using known cell types within specific tissue types to evaluate the model qualitatively.

Human synovial biopsy cell type modeling

Active learning was then leveraged for human cell type classification using H&E-stained slides of human synovial biopsy tissue of RA patients from the AMP consortium as described above. A subset of slides was selected to be annotated (Lymphoid, n = 5; Diffuse, n = 5; Pauci-Immune, n = 3) that represent the diversity of specimens within this cohort. Multiple rounds of cell type labeling were performed with the assistance of active learning, to obtain a total of 2,639 cells grouped in seven different cell types, detailed in Supplemental Fig. 10 (stromal/connective tissue cells n = 597, synovial lining cells n = 309, synovial fibroblasts n = 189, vascular endothelial cells n = 486, Tissue Macrophages/Histocytes n = 201, lymphocytes n = 826, and plasma cells n = 310). A cell type classification model using GBDT was trained using a stratified nested 5-fold cross validation with grid search strategy (as described above) to select the best models. F1 statistics were calculated as the average of the 5 external folds. The best performing model was used to infer cell types all cells on the slides within this patient cohort (n = 58 subjects; 2,976,535 total cells). Summary cell type quantification (total cell counts and percent of total) was then assessed for each patient. Two analyses were performed using the derived cell types from the cell classification model. First, cell type counts and proportions were correlated with either immunofluorescent stained adjacent sections (described below) or with a pathologist-derived, and clinically relevant Krenn inflammation scores. As these data were non-normal, we utilized a Spearman’s correlation. Second, we assessed the frequency of cell types across pathotypes. Specifically, statistical significance testing using lymphocyte, plasma cell, and fibroblast slide proportions were evaluated across pathotypes. Additionally, we performed a receiver-operator curve analysis of plasma cell frequency of total to predict if a biopsy was a lymphoid or diffuse case (n = 53).

Immunofluorescence (IF) and histomorphometry

Adjacent sections from 15 of the RA synovial biopsies were stained in batches with either CD3 (T-Cells), CD20 (B-Cells) and CD138 (Plasma Cells) or CLIC5 (Synovial Lining), CD3 (T-Cells), CD68 (Macrophages), and CD34 (Vascular Endothelial Cells) antibodies; and counter stained with DAPI. In depth staining procedures are described in the original work47. All IF images were imported into QuPath to perform histomorphometry. All biopsies were evaluated for tissue morphology similarity to the adjacent H&E to ensure as little physical distance between the sections as possible. To count IF+ cell, DAPI+ cells were first segmented with a watershed algorithm in QuPath (cell detection) and then mean CD138, CD3, and CD20 IF intensity for each cell was calculated. Staining batch specific thresholds for each channel were used to count positive cells.

Visualization of data

Uniform Manifold Approximation and Projection67 visualization was used for feature representations between batches and cell type framework features. Both tissue segmentation masks and cell type masks for each class were reimported into QuPath50 for visualization purposes.

Statistical approach and implementation

All graphing and hypothesis testing statistics were performed in Prism (10.0, Graph Pad, Boston, MA). For all continuous variables, a Shapiro-Wilks Normality test was performed to assess normality. If the test determined the specific distribution to be non-normal, the equivalent non-parametric test was utilized to test for significance or correlation. If an ordinal variable was being associated with a continuous variable a non-parametric Spearman’s correlation was chosen. Otherwise, One-Way, Two-Way and Two-Way Repeated Measures ANOVAs with Tukey’s Post-Hoc tests were used to test for significant main effects, interaction effects and post-hoc pairwise differences. All pairwise tests are two-tailed. Specific test information including sample size for each figure is provided in the “Supplemental Statistical Information Pertaining to Data Presented in Figures” document.

Software and hardware

Qupath (0.3.2 or later) was used to visualize WSIs and annotate tissues or cells as well as perform some image processing (detailed above). All other machine learning or deep learning techniques were performed in Python (3.8.1) as described above. Primary machine learning libraries include PyTorch (1.8.0), segmentation_models_pytorch (0.1.3)56 Sklearn (scikit-learn, 1.3.2), and xgboost (2.0.2). Deep learning segmentation training was performed on four Nvidia V100’s GPUs with 16 gb of RAM in parallel with a CUDA implementation (11.6.2). Segmentation inference was performed on either a Nvidia 3070 or 3090. Cell type classification was performed on an Nvidia 3070.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.