|
|
||||||||
1 Department of Pediatric Pulmonology and Allergology, Erasmus Medical Center-Sophia Children's Hospital, Rotterdam, The Netherlands; 2 Department of Radiology, Meander Medical Center, Amersfoort, The Netherlands; and 3 Department of Radiology, University Medical Center, Utrecht, The Netherlands
Correspondence and requests for reprints should be addressed to Pim A. de Jong, M.D., Ph.D., Meander Medical Center, Department of Radiology, Postbus 1502, 3800 BM, Amersfoort, The Netherlands. E-mail: pimdejong{at}gmail.com
ABSTRACT
This article presents a review and discussion of the current knowledge regarding cystic fibrosis (CF)–specific scoring of chest computed tomography (CT) scans. First, the basic principles of CT scoring systems in CF are described. Second, between- and within-observer variability of a composite CT score and of component CT scores are reviewed, and issues regarding training of CT scan readers discussed. Third, arguments regarding whether CT scoring systems are ready to be used in clinical studies as a surrogate endpoint are summarized. The between- and within-observer variability of the present CT composite scoring systems is low enough to be useful for clinical studies, although the variability for some of the component scores is larger than for others. Scoring systems fulfill the requirements for surrogate endpoints for CF lung disease, but this role could be further strengthened by including CT scans in large trials and demonstrating the correlation with true endpoints. The conclusion presented is that, given the experience of the variety of published scoring systems, it is important to develop a consensus CT scoring system for future studies in CF. Such a scoring system should evaluate all lung lobes individually and include all relevant CT findings in CF. Development of reference images for the components of this system will be important in reducing the variability between observers and to train new readers.
Key Words: between- and within-observer variability cystic fibrosis high-resolution computed tomography surrogate endpoint
BASIC PRINCIPLES OF COMPUTED TOMOGRAPHY SCORING IN CYSTIC FIBROSIS
A computed tomography (CT) scoring system of the chest is a tool to describe the abnormalities that can be observed on the slices obtained from a single CT investigation of the chest in a semiquantitative way. Bhalla and colleagues published the first high-resolution computed tomography (HRCT) scoring system for cystic fibrosis (CF) in 1991 (1). Since then, several other, mainly modified, scoring systems have been published (2–10). For all these systems, the reader identifies various abnormalities on the CT scans and assesses their severity (Figure 1). Important abnormalities that are included in most of the scoring systems are bronchiectasis, mucous plugging, airway wall thickening, and parenchymal opacities. The components of the scoring systems are summarized in Table 1. Other abnormalities, such as small nodules, mosaic attenuation, sacculations, and air trapping on expiratory images, are included only in some of the systems. Especially in more recent CT studies, expiratory images are included, since they are thought to reflect small airway disease that occurs early in the course of CF lung disease (3, 10–12).
|
|
Scoring systems are sensitive to detecting (early) CF-related lung disease and following its progression well. In cross-sectional studies, scoring systems of CT scans have been shown to correlate well with chest radiograph scores, PFTs, and clinical findings (1, 15, 17); hence, a subject with a high CT score is likely to have impaired PFTs. In addition, it has been shown that CT scores improve after treatment of pulmonary exacerbations (2, 9, 18). Recently, Brody and colleagues demonstrated that the CT score can predict the risk of exacerbations within two years after the CT scan (19), and Davis and colleagues found that the severity of lung inflammation determined by bronchoalveolar lavage fluids is correlated with the severity of the CT score for a given lobe (20). Finally, it has been shown that, in longitudinal studies, CT scores and PFTs are dissociated in around half of the patients (14, 16, 21); thus, the CT score can worsen over time despite stable or even improved PFTs (14, 21). These longitudinal studies show that scoring CT scans adds important information on the progression of lung disease that is not detected well by PFTs.
Scoring systems were developed to score sequential HRCT scans with 10-mm gaps between subsequent images (
25 images). With the introduction of multidetector row CT scanners, sequential CT scans may be replaced by volumetric HRCT scans of the whole lung with contiguous images (
250 images). The advantages are that anatomical structures on these whole-lung scans can be identified more correctly relative to conventional sequential CT scans, and that more abnormalities may be detected (22, 23). The downside may be that scoring of a large number of volumetric images might be more time consuming relative to the scoring of a limited number of sequential images.
In the last decade (semi-)automated image analysis systems have been developed to analyze important features of CF lung disease, such as airway wall thickness and parenchymal density distribution (12, 13, 24–29). The pros and cons of these systems are discussed in detail elsewhere in this issue of the Journal. Scoring systems have the important advantage over automated systems that they are less dependent on (subtle) differences in scanning techniques. In contrast to automated systems, scoring systems are able to recognize and quantify structural abnormalities, such as mucous plugging and atelectasis. In addition, they are able to handle the great diversity and inhomogeneity of CF lung disease. For these reasons it is likely that scoring systems will continue to be used in parallel with automated systems for years to come.
BETWEEN- AND WITHIN-OBSERVER VARIABILITY
To evaluate the variability of a scoring system, one should evaluate the between- and within-observer variability of both the composite CT score and of its component scores. In a cross-sectional study, the between- and within-observer variability of the following five scoring systems were evaluated: Castile and colleagues (unpublished), Brody and colleagues (2), Helbich and colleagues (4, 5), Santamaria and colleagues (8), and Bhalla and colleagues (1). A total of 25 CT scans from children with CF, ages 5–18 years, were scored and rescored by three observers after an interval of 1–2 weeks to 1–2 months. Between- and within-observer variability was low, with intraclass correlation coefficients generally greater than 0.8 (15). After this validation, the five systems were used in a 2-year longitudinal study of 48 children with CF. Again there was no difference in the ability to track disease progression between the five scoring systems (14).
Recently, Brody and colleagues published an improved version of his scoring system (Brody II) (3, 11). de Jong and colleagues employed this Brody II system in a large clinical follow-up study of children (n = 72) and adults (n = 47) with CF who had two or three CT scans with a 3-year interval between scans (16). As for the other scoring systems, the Brody II score showed a low within- and between-observer variability (16). Brody and colleagues showed (11) that, between three observers, there were small, systematic, and statistically significant differences in two of the three comparisons; however, the between-observer variability was much smaller than the between-patient variability. After more than 11 months, the observers rescored the same scans. Statistically, there were no differences between the first and second scores (within-observer variability was low). The findings of these and other studies, with respect to between- and within-observer variability of the composite CT scores, are summarized in Table 2 and have recently been reviewed by Aziz and colleagues (30).
|
coefficients between 0.61 and 0.40. For mosaic perfusion, acinar nodules, and airspace disease, there was fair to poor variability, with
coefficients below 0.40. Within-observer variability for the different component scores was good, with all
coefficients greater than 0.61 (15). Brody and colleagues initially reported between-observer agreement of 74% for bronchiectasis, 89% for mucous plugging, and 61% for air trapping (3). In a recent and more detailed statistical analysis, they found statistically significant differences between three observers for all components. Differences were, however, small compared with the differences one can expect to find between patients. Within-observers variability did not show statistically significant differences, except for peribronchial thickening (11). In conclusion, overall scoring systems have good within- and between-observer variability for the composite CT scores. The within-observer variability is reasonable for component CT scores, but between-observer variability can be poor for some components. We think that the reason for the high variability of some component scores is related to the lack of unambiguous definitions and of reference images. It is likely that between- and within-observer variability of the scoring systems can be reduced by improving definitions of components, using reference images, and standardizing training of observers. There is no literature regarding training in CF CT scoring for new readers.
CAN SCORING SYSTEMS BE USED FOR CLINICAL STUDIES?
To use CT scoring systems as surrogate endpoints, they must be biologically plausible, reflect clinical severity, improve with effective therapy, and correlate with true outcomes (31). Furthermore, changes in CT scores must be closely linked to changes in the true endpoint, the presence of the surrogate should be closely linked to the presence of disease, and detection of the surrogate must be accurate, reproducible, and feasible over time (32). In this section, the evidence that CT scores can be used as surrogate endpoints in CF-related studies is discussed (33, 34).
Abnormalities detected by CT are biologically plausible and linked to the presence of disease. Chest CT is considered the gold standard for diagnosing bronchiectasis, the cardinal feature of CF lung disease and an important component of CT scores. Bronchiectasis, as observed on CT, has been shown to correlate closely with findings in pathologic specimens (35).
CT scores meet the second requirement for use as a surrogate endpoint, as they reflect the clinical severity of the disease. CT scores correlate with other surrogate markers of disease severity, such as PFTs and exacerbations (15, 19, 36, 37). In addition, it has been demonstrated that CT scores improve in response to treatment. CT scores improve in patients who are treated with antibiotics for their exacerbations (2, 9, 18). Some small, placebo-controlled trials have shown improvement of CT (component) scores in patients treated with Rh-DNase (10, 38). CT scans have not yet been included in large clinical trials of long duration. Such studies are needed to provide further insight into the sensitivity of CT scores for detecting relevant changes in response to therapy. Clearly, stopping progression or even reversing CT score components, such as bronchiectasis, would be an important treatment goal. Finally, CT scores demonstrate the expected worsening in CF lung disease in longitudinal studies (14, 16, 21, 27).
The third important requirement for CT scoring systems to be used as a surrogate endpoint is that they should correlate with true outcomes, such as mortality, quality of life, and the number of exacerbations. Recently, it has been shown that a composite CT score can predict the number of exacerbations within 2 years after the CT scan (19). To date, no studies are available to link CT scores to mortality and quality of life.
The fourth requirement is that the use of CT scores should be feasible, accurate, and reproducible over time. There are numerous studies that have now shown that this is the case in single-center or small multicenter studies (9–11, 14–16). The accuracy and reproducibility of CT scores in the large multicenter trial setting are untested and unknown. CT scores have been shown to detect disease progression in cohorts with stable PFTs (14, 16, 21). The effect of clinical interventions on CT score changes remains a point for further investigation.
DISCUSSION AND FUTURE STUDIES
A consensus on the optimal CT scoring system for patients with CF is urgently needed. In the authors' opinion, a CT scoring system, such as the one published by Brody and colleagues (3), is currently the most attractive. This system scores all the cardinal features of CF lung disease, including gas trapping for each lobe of the lung and the lingula. Scores derived with this system are reproducible both between and within expert readers (3, 16). Recently, such a new CF CT score was designed by consensus, based on the Brody II scoring system, in conjunction with a web-based training system (de Jong and coworkers, unpublished data). The reproducibility of this score in the hands of trainees is under investigation. This system also needs to be tested on volumetric CT scans obtained with multidetector row CT scans.
To further establish the role of CT as a surrogate endpoint for CF lung disease, it is important to establish the relationship between CT scores and mortality. Furthermore, the relationship with other endpoints, such as quality of life, should be investigated. In addition, consideration should be given to including CT scans and scoring in large clinical trials, especially of agents that aim to stop or slow progression of structural lung disease.
In conclusion, CT scores show low variability between and within observers, and much experience has been gained with these systems in the past years in clinical studies and some small therapeutic trials. Recently, a CF CT score was designed by consensus, based on the Brody II scoring system, in conjunction with a web-based training system. Incorporation of CT scans in future clinical trials will further establish the role of CT scans as a surrogate end point in CF.
FOOTNOTES
Conflict of Interest Statement: P.A.d.J. does not have a financial relationship with a commercial entity that has an interest in the subject of this manuscript. H.A.W.M.T. acted as a member of two ad hoc advisory boards for Novartis and, within the last 3 years, received honoraria and travel expenses for lectures and workshops from Hoffman-La Roche and Genentech. In the last 3 years, the BV Kindergeneeskunde of Erasmus MC—Sophia Children's Hospital has received research grants from Hoffman-La Roche.
(Received in original form November 29, 2006; accepted in final form March 24, 2007)
REFERENCES
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |