|
|
||||||||
Divisions of 1 Nuclear Medicine/Radiology and 2 Pediatric Pulmonary Medicine/Pediatrics, Stanford University, Stanford, California
Correspondence and requests for reprints should be addressed to M. L. Goris, M.D., Ph.D., Division of Nuclear Medicine, H0101, Stanford University School of Medicine, Stanford, CA 94305-5281. E-mail: mlgoris{at}stanford.edu
ABSTRACT
Medical imaging has increasingly provided surrogate endpoints in therapeutic trials. This use assumes that the interpretation of the images can be unbiased and reproducible and that the image attributes included in the interpretation are relevant to the mechanism of the trial. The principal motivation for computer analysis is to evaluate an attribute of the image as a metric in an algorithmic manner, independent of observer bias or variability. The metric is expected to reflect change in rough proportion with at least one aspect of the degree of disease or the effectiveness of the therapeutic intervention. If either condition is satisfied, the measure is quantitative. Visual interpretation explicitly or implicitly tends to be based on multiple image attributes. Explicit combination of multiple attributes yields composite scores. To evaluate the risk or probability of disease, they are useful. But the components of the scores can be combined only if they are mathematically isomorphic. For the evaluation of interventions, they are less useful because the effect on one component may be obscured by the lack of effect on other components. This article reviews quantification of air trapping in cystic fibrosis and quantification in general. Validation of any computer analysis can rely on agreement with visual interpreters (on average), they can be derived from first principles, or by agreement with an alternative method that measures the pathophysiological mechanism directly (xenon washout for air trapping). However, in the context of trials, the validation may come from a superior ability to detect objective change and to discriminate between affected and unaffected individuals.
Key Words: computer analysis medical images composite scores
Regional or focal air trapping is part of the pathophysiology of a number of pulmonary disorders, including bronchiolitis obliterans (1, 2), reactive airway disease (3), chronic bronchitis (4), atypical pneumonia (5), bronchiectasis (6), emphysema (7), sarcoidosis (8), eosinophilic granuloma (9), and cystic fibrosis (CF) (10–20).
If we can accept that air trapping can easily be quantified on the basis of density distributions in pulmonary computed tomography (CT) scans, we also have to consider what it means, in this context and in general, to quantify. Close analysis will demonstrate that air trapping is reflected in densities, but densities do not always reflect air trapping. The simple metric of air trapping does not define the patient but is useful to measure specific interventions. Validation is not based on a defining test but on the ability to discriminate and measure small effects. The measure is simple and incomplete. Why? In this article, we define the value and limitations of quantitative measures, in the particular context of CF, with examples from cardiology.
DISCUSSION
Quantitative Metric versus Physiological Attribute
A quantitative measure or metric should ideally be singly and directly related to the attribute it tracks, but that is not necessarily the case. Sometimes, the metric is a derived value.
A typical example is the assumption that, with CT, we measure lung density, and lung density in the expiratory scan is the basis for the metric that estimates the degree of air trapping, an important component of the physiopathology of CF. However, this lung density is an artifact of the low resolution of the CT in relation to the pulmonary structures. (More precisely, it is an artifact related to the degree of granularity of the analysis.) Schematically, the lung is composed of structures with tissue density (alveoli, blood vessels, blood) and air. In expiration, none of the components changes in density, but, per cubic centimeter, the structures with tissue density become more prevalent. On average, at the resolution level of the CT, the density increases. In some other cases, the metric is indirect. If we imagine a CT scan with a much higher resolution, voxel values could not be used individually to define lung density (i.e., air trapping) but averaging n voxel values would be necessary.
In a recent article, infarcted myocardium was repopulated with marrow stem cells (21). The metric was the ejection fraction. That may have been a mistake, because the infarct is regional, and the ejection fraction is global. There is no evidence that a decrease of ejection fraction is proportional, let alone related in a linear fashion, to the size of the infarct. This analysis failed to show any benefit (in terms of the ejection fraction). A change in the infarct size, the immediate effect of repopulation, was not measured. In terms of endpoints, using the ejection fraction (as a predictor or surrogate for patient benefit) may have been correct, but as a proof of concept, nothing was contributed.
Algorithmic Analysis Is Reproducible but Not Necessarily Data Acquisition
If the quantification is algorithmically defined, the reproducibility (precision) is high (if not perfect), but the measure is apt to be influenced by unrelated factors: for example, variations in methodology (imaging protocols and technical parameters); interpatient variation in lung density, even in the inspiratory images; the degree of expiration at the time of imaging; and the selection of the density threshold defining air trapping. Elsewhere in this symposium (pp. 310–315), Robinson discusses the influence of the degree of the inspiratory and expiratory effort on the estimation of air trapping from densities.
Myocardial perfusion studies have been quantified for ages (22–25). Yet, at a recent unpublished panel discussion, three originators of those methods admitted to disbelieving the results in 15% of the cases. The assumption is that there is reproducibility (for any dataset), but that unrelated influences (patient movement, body habitus, imaging settings, or intervention variability) may disturb any of the datasets.
Heuristic Analysis Is More Resistant to Unrelated Variability Factors
Heuristic (e.g., visual) quantification is less reproducible (the precision is lower) but more resistant to the influence of unrelated attributes because the observer can more easily modulate his or her response by context (e.g., a decreased expiratory effort in the case of air trapping in CF), for example by relying more on contrast than on absolute values. Some nonrelated attributes that would influence the metric can be predicted (like the effect of expiratory effort on lung densities), and can be taken into account to some extent. In the following formula:
![]() |
|
In cardiology, a strange multiplicative score has been introduced by Hachamovitch and colleagues (27). In myocardial perfusion studies, one observes two attributes of perfusion abnormalities: the degree of hypoperfusion locally, and the extent of the area of hypoperfusion. Hachamovitch and colleagues evaluate the degree of hypoperfusion (in a scale from 0 to 4) in 17 myocardial segments. The score is the sum of the segmental values. Four segments with a value of 4 have the same score as eight segments with the value of 2. There is a conflation of size and gravity or degree and frequency. Nevertheless, this score seems very predictive of distant cardiac events.
One potentially fruitful approach is histogram analysis, in which the x axis is the degree and the y axis represents frequency. Existing approaches based on the histogram end up with a threshold, hence conflating degree and frequency. A universal comparison between frequency distributions has not been proposed.
Spatial Distribution versus Global Attributes
If the attribute that is tracked is heterogeneous in space, sentinel lesions must be identified and tracked. In oncology, the effectiveness of a drug is judged by the response of the tumors that can be measured. The decrease in volume of one lesion (tumor), with an increase in another location, would indicate (at least partial) failure. For radiotherapy, which is local, that is not the case. The inference is that, for local intervention, the metric should be a local metric. In the case of the infarct treatment with bone marrow stem cells, that rule was broken. This brings us to the distinction between real endpoints (quality of life, decrease in the number of hospitalizations, life expectancy increase) and surrogate endpoints to be discussed elsewhere in this symposium and which are beyond the scope of this presentation. However, for proof of concept and early effect evaluation, the best metrics are noncomplex, not composite, and regional where they need to be.
Validation
Finally, how to validate? Comparisons with previous methods, in which validation means "works the same as," do not provide progress, except that the new method may be a time saver. In the absence of a defining (nosological) test (e.g., histopathology for tumors, angiography for pulmonary emboli), we looked at the automated quantification of air trapping by asking if the metric discriminated between healthy subjects and those with minimal disease, and if treatment effects could be followed. We have applied this analytical method to a group of 25 patients with mild CF lung disease and 10 age-matched control subjects using six anatomically matched high-resolution CT spirometry-triggered slice pairs to evaluate the discriminating power (26). This method was also applied to data from a randomized, double-blind, placebo-controlled 1-year trial of dornase
in 25 children and adolescents with mild CF disease. With the data from this study, the sensitivity for small changes was tested (28). The validation in both cases was that our method was better than all the other ones applied to the same cases, both in discrimination between "normal" and "abnormal," and in tracking progression under treatment. The validation is sometimes purely functional (e.g., predicting outcome) and not necessarily based on ground truth.
CONCLUSIONS
Analysis based on images has a putative advantage in diseases that have heterogeneous or localized expressions in organs. Global organ function measurements will not necessarily reflect significant local changes.
Visual image analysis is complex and often includes a heuristic element, making it subject to intra- and interobserver variability. In addition, if the evaluation is global (implicitly or explicitly by scoring), the result does not necessarily accurately reflect significant changes in the progression of the disease. One of the great advantages is the greater tolerance for methodological variations in the image acquisition or display.
Algorithmic analysis of single image attributes is reproducible by definition, but more dependent on data integrity (expected acquisition parameters). Finally, whereas single attribute metrics do not necessarily predict the ultimate desired study endpoint, they are very useful for the testing of treatment mechanisms (proof of concept).
FOOTNOTES
Conflict of Interest Statement: M.L.G. is a co-investigator in a grant sponsored by Novartis and the Cystic Fibrosis Foundation (CFF). H.J.Z. does not have a financial relationship with a commercial entity that has an interest in the subject of this manuscript. T.E.R. is currently the principal investigator on a Novartis and CFF Therapeutic Development Network grant.
(Received in original form January 10, 2007; accepted in final form March 4, 2007)
REFERENCES
Related articles in Proceedings of the American Thoracic Society:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |