Reliability of Histologic Scoring for Lupus Nephritis: A Community-based Evaluation

  1. Richard M. Wernick, MD;
  2. David L. Smith, MD;
  3. Donald C. Houghton, MD;
  4. David S. Phillips, PhD;
  5. James L. Booth, MD;
  6. Douglas N. Runckel, MD;
  7. David S. Johnson, MD;
  8. Kevin K. Brown, MD; and
  9. Cynthia L. Gaboury, MD
  1. From Providence Medical Center, Portland Veterans Affairs Medical Center, Oregon Health Sciences University, St. Vincent Medical Center, and Good Samaritan Hospital, Portland, Oregon. Request for Reprints: Richard Wernick, MD, Providence Medical Center, 4805 Northeast Glisan Street, Portland, OR 97213. Acknowledgments: The authors thank Chris Siegenthaler for her expert assistance in manuscript preparation; P. Slopoko and Debra Miles for inspiration and support; and Howard A. Austin III for helpful discussion.

    Abstract

    Objective: To determine the reliability of the National Institutes of Health (NIH)-modified semiquantitative histologic scoring system for lupus nephritis.

    Design: Cross-sectional study, repeated after 8 to 9 months.

    Setting: Four community hospitals and one university medical center.

    Participants: Five pathologists, all experienced in reading renal biopsy specimens, assessed 25 specimens that had been obtained from patients with a clinical diagnosis of systemic lupus erythematosus and showed diffuse proliferative glomerulonephritis.

    Measurements: Biopsy specimens were scored independently and blindly by pathologists for components of nephritis chronicity and activity. Reliability was measured by percentage agreement, intraclass correlation coefficient or statistic, and individual reader effect on the group arithmetic mean.

    Results: As scored by the readers, the mean chronicity index score varied from 2.3 to 4.8 on a 12-point scale (P = 0.001) and the mean activity index score varied from 5.8 to 11.4 on a 24-point scale (P = 0.0001). Pairs of readers gave scores within 1 point for the chronicity index and within 2 points for the activity index in 50% of cases, and risk group assignments based on chronicity index (three strata) and activity index (two strata) were concordant in 59% and 76% of cases, respectively. Intraclass correlation coefficients for inter-reader agreement were 0.58 for the chronicity index (P < 0.01) and 0.52 for the activity index (P < 0.01). Intrareader agreement was uniformly higher than inter-reader agreement, but mean intraclass correlation coefficients exceeded 0.70 for only 1 of the 10 index components. Repeated readings yielded chronicity index scores that were more than 1 point discordant in 45% of cases and activity index scores that were more than 2 points discordant in 43% of cases. Risk group assignment changed on the basis of chronicity index and activity index in 36% and 21% of cases, respectively.

    Conclusions: In a nonreferral setting, the NIH-modified scoring system for lupus nephritis is only moderately reproducible and, if used to prognosticate renal outcome, may result in erroneous predictions of risk for renal failure and response to therapy.

    Diffuse proliferative glomerulonephritis is the most severe form of lupus nephritis [1, 2], but the renal outcome in patients with this form of nephritis varies greatly, and much histologic heterogeneity is found within this subgroup. Efforts have been made to define whether specific histologic features may allow a more accurate prediction of renal outcome, which might in turn facilitate decision making regarding intensity of therapy for an individual patient. In particular, a semiquantitative scoring system developed by Pirani and Salinas-Madrigal [3, 4] and modified by Austin and colleagues [5] has been used widely to assess the potential reversibility of lesions and to predict therapeutic response [5-20]. In this system, an activity index measures six histologic components of activity of lupus nephritis and a chronicity index measures four histologic components of chronic irreversible lupus nephritis. Most studies in which this system has been applied to patients treated for lupus nephritis have found that one or both of the indices has predictive value for progressive renal failure [5-9, 14, 16, 17, 19]. In particular, a high chronicity index portends refractoriness to aggressive therapy, whereas active lesions are potentially reversible [5, 11, 14, 21]. Based on its apparent validity, the system has been recommended for use in guiding intensity of therapy [6, 8, 21-23].

    However, before a test is used widely, reliabilityor reproducibilitymust be shown. To date, little evidence for the reliability of this system exists. Several investigators have noted parenthetically in their reports that agreement between two readers was almost perfect [5, 6, 8]. Gamba and colleagues [23] concluded that agreement among three experienced academic pathologists, working in the same institution, was excellent. However, no attempt has been made to study the reliability of the system in a community setting or among readers from different institutions. Because small differences in scoring imply vastly different renal prognoses [11], poor reproducibility would produce misleading information and could lead to inappropriate management. Our aim was to assess the reliability of activity and chronicity index scoring by pathologists who routinely read renal biopsy specimens in a community setting.

    Methods

    Renal Biopsy Specimens

    By searching pathology files and querying community nephrologists, we identified 26 renal biopsy specimens that had been obtained from patients with a clinical diagnosis of systemic lupus erythematosus and showed diffuse proliferative glomerulonephritis. Two pathologists believed the quality of 1 specimen was unacceptable, and it was omitted from the analysis. Of the remaining 25 specimens, 16 were taken from the files of a university medical center and 9 from those of three community hospitals. Four biopsy specimens had been embedded in paraffin and the rest in methacrylate plastic. For each case, pathologists reviewed two slidesone containing sections stained with hematoxylin-eosin or periodic acid-Schiff stain, and the other containing sections stained with methenamine silver. The mean number of glomerular profiles present in each section was 15 (range, 6 to 36 profiles).

    Pathologist-Readers

    Five pathologists, all responsible for the light microscopic interpretation of renal biopsy specimens at their respective institutions, were asked to participate in the study. Four of the pathologists were based at community hospitals (three in Portland, Oregon, and one in Salem, Oregon), and one was based at Oregon Health Sciences University. None routinely used a scoring system, but all were experienced with individual components of the chronicity and activity indices.

    Scoring System

    Readers were asked to score biopsy specimens based on the system developed by Pirani and colleagues [4] and modified by Austin and colleagues [5] (Table 1). Briefly, each biopsy specimen was assessed for 10 components, of which 6 make up an activity index and 4 a chronicity index. Each individual component is scored 0 (normal), 1, 2, or 3 (severe abnormality). In calculating the activity index, fibrinoid necrosis and cellular crescents are weighted by a factor of 2. The maximum activity index is 24 points and the maximum chronicity index is 12 points.

    Table 1. Scoring System for Histologic Features of Renal Biopsy Specimens*

    Reader Scoring

    Biopsy specimens were labeled nonconsecutively and were blindly and independently read in random order by each pathologist. Each reader was provided with a standard score sheet for grading individual components and with detailed descriptions defining each component and scoring gradation [5]. Four of the five pathologists participated in a second blind reading of relabeled specimens 8 to 9 months after the initial reading. Results of the study were neither discussed with nor distributed to any of the readers until completion.

    Statistical Analysis

    Reliability was assessed by three methods. First, the percent agreement for the 10 possible pairs of readers in the first round and the 6 pairs of readers in the second round was calculated. Scores were considered concordant if they were within 1 and 2 points for the chronicity index and the activity index, respectively. For all other features, identical responses were required for concordance. Second, the intraclass correlation coefficient was calculated for inter-reader and intrareader reliability regarding noncategorical results [24, 25], and the statistic was calculated for reliability regarding dichotomous risk group assignments [26]. These statistics reflect the agreement between observers while taking into account the contribution of chance. Results may range from 1.0,indicating complete disagreement, to 0 for chance agreement alone, to 1.0 for perfect agreement. Third, the effect of individual readers on the group arithmetic mean was calculated [27]. The reader's effect is the arithmetic mean of all scores assigned by that reader minus the arithmetic mean of all scores assigned by the group. This measure shows the contribution of each reader to the mean score.

    The t-test was used to compare means. P values were not corrected for multiple comparisons.

    Results

    Chronicity and Activity Index Scoring

    The mean score (SD) for the chronicity index was 3.3 2.2, and the mean score for the activity index was 8.2 3.1. Ranges for the chronicity and activity indices were 0 to 11 and 1 to 21, respectively. For the chronicity index, 36% of scores were in the low-risk [5] range of 0 to 1, 30% were in the intermediate risk range of 2 to 3, and 34% were in the high-risk category of 4 or greater. For the activity index, 22% were scored in the high-risk [5] range of 12 or greater.

    Reader Effect on Chronicity and Activity Index Scoring

    The effect of individual readers on the mean score for chronicity index in each reading is shown in Figure 1. There was substantial divergence among pathologists. The effect of individual readers on activity index scoring is shown in Figure 2. Again, both intrareader and inter-reader divergence are shown. Reader 2, neutral for chronicity index scoring, had a strongly positive effect on the activity index, whereas reader 3, who had a positive effect on chronicity index scoring, was neutral for grading of the activity index. When analyzed by reader, the mean chronicity index varied from 2.3 (pathologist 2) to 4.8 (pathologist 3) (P = 0.001) and the activity index ranged from 5.8 (pathologist 4) to 11.4 (pathologist 2) (P = 0.0001).

    Figure 1. Each bar represents the mean chronicity index as scored by an individual reader minus the mean chronicity index of the group of readers. Hatched bars = first reading; open bars = second reading.
    View larger version:
      Figure 1. Each bar represents the mean chronicity index as scored by an individual reader minus the mean chronicity index of the group of readers. Hatched bars = first reading; open bars = second reading. Reader effect on chronicity index scoring.
      Figure 2. Each bar represents the mean activity index as scored by an individual reader minus the mean activity index of the group of readers. Hatched bars = first reading; open bars = second reading.
      View larger version:
        Figure 2. Each bar represents the mean activity index as scored by an individual reader minus the mean activity index of the group of readers. Hatched bars = first reading; open bars = second reading. Reader effect on activity index scoring.

        Inter-reader Reliability

        Pairs of readers were within 1 point for chronicity index scoring in 50% of cases (Table 2). Risk group assignment on the basis of the chronicity index was concordant in 59% of cases. By random chance, concordance for risk group assignment would be expected to occur in 33% of cases. The chronicity index varied among the readers by at least 4 points for 40% of the biopsy specimens in the first set of readings and 56% in the second set of readings. Scores assigned by readers for individual biopsy specimens varied from low to high risk in 28% of cases. The statistic ranged from 0.26 (P > 0.05) to 1.0 (P = 0.001) (mean, 0.51) for dichotomous risk stratification by 16 reader-pairs (10 from the initial reading and 6 from the second reading). Pair-wise agreement for individual components of the chronicity index varied from 40% for glomerular sclerosis to 71% for fibrous crescents (Table 2). The intraclass correlation coefficient for scoring by the group was 0.58 (P < 0.01); for individual components, intraclass correlation coefficients ranged from 0.32 for glomerular sclerosis (mean of first and second readings) to 0.60 for fibrous crescents (Table 2). When analyzed by pairs of readers, intraclass correlation coefficients were predictably higher: The mean intraclass correlation coefficient for 16 reader-pairs was 0.70 (range, 0.48 to 0.91) (P < 0.001 for 14 reader-pairs and P < 0.01 for 2 reader-pairs when compared with chance).

        Table 2. Inter-reader Reliability of Chronicity Index Scoring

        Scores of reader-pairs for the activity index were within 2 points in 50% of cases (Table 3). The variance between low and high scores for individual specimens was at least 6 points for 64% of the specimens in the first reading and for 52% in the second reading. Concordance regarding dichotomous risk group assignment (low or intermediate versus high) was present in 76% of cases (mean statistic for 16 reader-pairs, 0.30; range, 0.07 [P > 0.05] to 0.66 [P = 0.001]). At a lower cutoff point of 9 or greater (above which were 42% of activity index scores), agreement occurred in 75% of cases. Pair-wise agreement for individual components of the activity index varied from 36% for cellular proliferation to 54% for cellular crescents. The intraclass correlation coefficient for scoring by the group was 0.52 (P < 0.01); for individual components, intraclass correlation coefficients varied from 0.16 for cellular proliferation (mean for first and second readings) to 0.50 for cellular crescents (Table 3). Pair-wise analysis yielded predictably higher intraclass correlation coefficients: The mean intraclass correlation coefficient for 16 reader-pairs was 0.66 (range, 0.49 to 0.80 [P < 0.001 for 12 reader-pairs and P < 0.01 for 4 reader-pairs]).

        Table 3. Inter-reader Reliability of Activity Index Scoring

        To examine whether an extreme outlier may have disproportionately affected results for inter-reader reliability, data for both indices and for risk group stratification were re-analyzed by comparing reliability for all pairs with each individual reader to all pairs without each individual reader. Only trivial differences were found (data not shown).

        Intrareader Reliability

        For each individual component and index, intrareader reliability was slightly better than inter-reader reliability (Tables 4 and 5). In 55% of cases, repeated readings resulted in chronicity index scores within 1 point; in 64% of cases, risk group assignments were concordant (Table 4). Activity index scores varied by 2 points or less in 57% of cases (Table 5). Dichotomous risk group assignment based on the activity index was concordant in 79% and 76% of cases for cutoff points of 12 or greater and 9 or greater, respectively. However, no reader attained better than 80% agreement between first and second readings for either index or for any individual component. Mean intraclass correlation coefficients for chronicity and activity indices were both 0.76 (chronicity index, P = 0.001 for readers 1, 3, and 4 and P=0.01 for reader 2 when compared with chance; activity index, P = 0.001 for all readers when compared with chance). For individual components, intraclass correlation coefficients varied from 0.30 for cellular proliferation (Table 5) to 0.74 for fibrous crescents (Table 4). The statistic for risk stratification by chronicity index varied from 0.44 (P = 0.05) to 0.78 (P = 0.001) (Table 4) and by activity index from 0.34 (P = 0.05) to 0.65 (P = 0.001) (Table 5). The reliability of the university-based pathologist (reader 3) exceeded that of the other readers for 7 of 10 individual components (P < 0.01) and for both indices as assessed by intraclass correlation coefficient.

        Table 4. Intrareader Reliability of Chronicity Index Scoring
        Table 5. Intrareader Reliability of Activity Index Scoring

        Discussion

        Although several studies failed to show that the chronicity index predicts renal outcome [9, 18, 20, 28], most studies of the NIH-modified scoring system for renal biopsy specimens from patients with lupus nephritis have found that a higher chronicity index confers greater risk for progression of renal disease despite therapy and therefore is a valid predictor of renal outcome [5-8, 14, 16, 17, 19]. Individual histologic components have shown predictive value, but results have varied among studies [5, 7, 9, 14]. The validity of the activity index as a predictor of renal outcome is less clear; approximately half the reported studies have shown that a higher activity index is associated with an increased risk [5-7, 9, 17], but the remainder have not shown such an association [14, 18-20, 28, 29].

        In a study of patients enrolled in treatment trials with a mean follow-up period of 85 months, only 3 of 21 patients with a low chronicity index (0 or 1) experienced at least a twofold elevation in serum creatinine level, whereas 9 of 10 patients with a high chronicity index (at least 4) showed such an increase [16]. Immunosuppression was most beneficial for patients with an intermediate index. In another study of the NIH cohort of patients with lupus nephritis, Austin and colleagues [5] found that after 5 years no patient (0 of 29) with a low index had developed renal failure, whereas 40% of patients with a high index did develop renal failure. Laitman and colleagues [8] concluded that scoring was important for therapeutic decision making, having found that patients most likely to benefit from immunosuppression-induced normalization of serum complement were those with a chronicity index of less than 3. Several investigators have recommended the use of this scoring system to guide prognosis and aggressiveness of therapy [6, 8, 21-23]. However, our results strongly suggest that the reliability of the system is not sufficient for use outside of academic centers.

        In contrast to validity, the reliability of a test is the extent to which the results of the test are reproducible. Although not sufficient for ensuring accuracy, because repeated readings may be concordant yet wrong, reliability is necessary for a test to be accurate and is a prerequisite for widespread use. In 1964, Pirani and colleagues [4] reported that two nephropathology colleagues, using a scoring system later modified by Austin and coworkers [5], differed in a minority of cases and usually in a consistent direction. In a treatment study of lupus nephritis, Donadio and colleagues [30] noted that two investigators, using an activity score modified from that of Pirani and Salinas-Madrigal [3], differed minimally in the grading of biopsy specimens from patients with lupus. Laitman and colleagues [8] reported that a nephropathologist and a nephrologist never differed by more than 1 point for any individual feature. The two readers differed regarding composite activity index by more than 2 points for only 3 of 39 biopsy specimens, and regarding the chronicity index, they never differed by more than 2 points. Austin and colleagues [5] commented that the independently graded chronicity and activity indices of a nephropathologist and a nephrologist at NIH rarely differed by more than 1 point. Nossent and colleagues [6] noted that two readers differed by no more than 1 point for 95% of biopsy specimens scored. In the only study of reproducibility, Gamba and colleagues [23] recently reported that interobserver agreement (intraclass correlation coefficient) was 0.81 and 0.86 for activity and chronicity index scoring, respectively, and that intraobserver agreement ranged from 0.89 to 0.95 for the activity index and from 0.55 to 0.82 for the chronicity index. Nevertheless, risk group assignment by pairs of readers for the 15 biopsy specimens showing diffuse proliferative glomerulonephritis was discordant in 12 of 45 possible cases for the activity index (low or intermediate versus high risk) and in 14 of 45 cases for the chronicity index (low versus intermediate versus high risk). Scores for the chronicity index varied by 2 points or more in 11 of 45 possible pairs of readers, and scores for the activity index varied 3 points or more in 14 of 45 cases.

        Our study was the first to measure the reliability of the NIH-modified semiquantitative scoring system for lupus nephritis in a nonacademic setting. We believe the reliability of the system when studied by Gamba and colleagues [31] and briefly noted by others [4-6, 8, 30] was higher than in our study because of the involvement of experienced readers from referral institutions who may have benefited from mutual education, which would lead to a more uniform standard of grading. In our study, pair-wise disagreement for individual components of either the activity or chronicity index of at least 1 point occurred in approximately 50% of cases. Disagreement of greater than 1 point for chronicity index or 2 points for activity index also occurred in 50% of cases. Even for the pair of readers with the best concordance, disagreement of such magnitude still occurred in one third of cases. Risk assignment for renal failure into two strata for activity index and into three strata for chronicity index was discordant between pairs of pathologists in 24% and 41% of cases, respectively. As expected [32], readers agreed with themselves more often than with other readers for each of 10 individual components and for chronicity and activity index composite scores. However, the incremental reliability was slight.

        Our results indicate that readers differed in the setting of their cutoff points for presence and extent of abnormality. Variability among readings may be the result of the subjective nature of and non-uniform standard for biopsy specimen scoring. Interpretive shifts between readings might occur even for persons who use these scoring criteria consistently but who infrequently see biopsy specimens from patients with lupus nephritis. Small differences in the scoring of many individual components may result in large differences for activity or chronicity indices.

        One remaining issue is whether our results can be extrapolated to other settings. As noted by Koran [33], a small sample of study physicians is not necessarily representative of a large sample from a well-defined population. Nonetheless, these results are the best available estimate of what a larger-scale study might find in a nonreferral setting. To ascertain whether the scoring system should be used widely, we attempted to simulate a typical community setting by choosing five pathologists who were experienced in reading individual components of the indices and were responsible for reading kidney biopsy specimens at their respective institutions. However, none of the pathologists had significant experience with the application of the scoring criteria. Furthermore, no attempt was made to develop scoring consistency among the participants before the study. On the other hand, reliability may be inflated in an experimental setting by conditions superior to those seen in clinical practice (a quiet atmosphere, absence of interruptions, and a more standardized protocol to test the system) [34]. We believe both inter-reader and intrareader reliability might improve with education to standardize scoring and with regular use of the system. This hypothesis is supported by the observation that our most experienced pathologist attained the highest intrareader reliability. Nevertheless, our results support the likely hypothesis that reliability of histologic scoring in a community setting, in which relatively few renal biopsy specimens from patients with lupus nephritis are seen, would be lower than in a highly specialized academic center. It should be noted that substantial differences have also been found in observer interpretation of other types of histologic material [31, 35-38].

        Our findings have several implications. First, it is possible that inherent differences in reader interpretation may explain some aspects of the conflicting findings regarding the ability of the indices to predict renal outcome. Second, apparent improvement in chronicity index scoring for serial biopsy specimens [14] may instead be a result of inconsistent scoring. Third, if index scoring is used to stratify patients with lupus nephritis for a treatment trial, the reliability of scorers should be shown. Fourth and most important, the incorporation of index scoring into patient management may lead to erroneous decision making. Small differences in chronicity index scoring may profoundly alter risk group assignment and subsequent estimates of a patient's likelihood of developing renal failure and of responding to aggressive immunosuppressive therapy. In our study, discordance between pathologists regarding risk group assignment was too frequent to yield reliable guidance for patient management. As McGuire [39] has suggested regarding the histologic grading in breast cancer, we suggest that for the semiquantitative histologic scoring system to best enhance the management of lupus nephritis, biopsy specimens should be scored by nephropathologists working in a quality-controlled reference center and for whom scoring reliability and, ideally, validity have been shown.

        References

        1. 1.
        2. 2.
        3. 3.
        4. 4.
        5. 5.
        6. 6.
        7. 7.
        8. 8.
        9. 9.
        10. 10.
        11. 11.
        12. 12.
        13. 13.
        14. 14.
        15. 15.
        16. 16.
        17. 17.
        18. 18.
        19. 19.
        20. 20.
        21. 21.
        22. 22.
        23. 23.
        24. 24.
        25. 25.
        26. 26.
        27. 27.
        28. 28.
        29. 29.
        30. 30.
        31. 31.
        32. 32.
        33. 33.
        34. 34.
        35. 35.
        36. 36.
        37. 37.
        38. 38.
        39. 39.
        « Previous | Next Article »Table of Contents