background
sponsor's name
breast imaging
TABLE OF CONTENTS
Digital mammography faces off with film
Advances in technology may leapfrog results in major clinical trials

BY JOHN M. LEWIN, M.D.

The introduction of full-field digital mammography (FFDM) has long been anticipated as a way to improve mammographic detection of cancers. Although FFDM has definite technical advantages over screen-film mammography (SFM) in terms of detector contrast resolution, SFM has an advantage in high-contrast spatial resolution. SFM also has the advantage of familiarity to the radiologist and technologist and decades of accumulated experience to optimize technique. There is no way to predict how the radiologist interpreting the study will interact with the technology, particularly when soft copy, as opposed to film, is used for interpretation.

For these reasons, clinical trials are needed to compare FFDM with SFM for cancer detection. Several issues in designing clinical trials for mammographic breast cancer detection must be addressed: enrollment bias, verification bias, and power.

COMPARISON OF THREE DIGITAL VS FILM MAMMOGRAPHY TRIALS
Study
Population type
No. of subjects
Primary limitation
GE noninferiority
Diagnostic
625
Enrollment bias
Colorado/Massachusetts
Screening
6768
Limited statistical power
ACRIN
Screening
49,500
Very expensive

  • Enrollment bias. Studies conducted by companies for use in the FDA approval process were based on a population of women who had either an abnormal mammogram or a suspicious finding on physical examination. These studies were originally conceived as agreement studies but were modified to be equivalency studies based on truth (presence of cancer) after it was clear that agreement with SFM was affected by too much reader variability and was likely a poor predictor of clinical efficacy.

Studies dependent on abnormalities detected prior to enrollment have inherent limitations and biases. The most obvious limitation is that the abnormality that led to the woman¹s enrollment in the study has already been detected by SFM or physical exam and thus cannot be used to prove the supposition that FFDM can detect nonpalpable cancers that are undetectable on SFM. The only chance for proving this advantage is to detect a second, incidental, finding in that patient and then prove that it is malignant. The chance of doing this is proportional to the number of patients in the study and is not any greater than for a screening population.

Example of enrollment bias. Left: Screen-film screening mammogram demonstrates apparent mass. Patient is enrolled in the study because of this positive exam. Middle: Digital mammogram obtained in the study shows no abnormality. Right: Spot compression view demonstrates that the apparent mass is only overlapping normal tissue. Screen-film exam is therefore a false positive while the digital exam is a true negative.

A worse problem is that the use of a positive screen-film exam for enrollment biases the results toward SFM in terms of sensitivity for cancer detection and toward FFDM in terms of specificity. Even with careful attention to detail, tissue will not overlap the same way on two otherwise identical mammograms. This small difference in overlap can cause apparent densities, masses, and distortion on one image to disappear on another. Additionally, it can result in a cancer being obscured on one image but obvious on another. By loading the test set with both the true- and false-positive screen-film mammograms, the researchers greatly overrepresent both the true-positive and false-positive fractions of screen-film as compared with digital, thus artificially raising the sensitivity and lowering the specificity of SFM. This effect was clearly not appreciated by those designing the first set of FDA trials but was demonstrated using preliminary data from the University of Colorado/University of Massachusetts digital mammography screening study (see table).1

  • Verification bias. Studies that do not use the FFDM reading for clinical management, such as the FDA noninferiority studies, are subject to verification bias. This type of bias results when truth is preferentially obtained on the findings from the standard test, in this case SFM, as compared with the experimental test, in this case FFDM. Because a finding detected only on FFDM would not be worked up, it would remain a false positive when it could actually represent a true positive.

Assuming that SFM has a true sensitivity of at least 66%, the probability of a previously unsuspected cancer being detected by FFDM will be about one-third of the screening cancer incidence, or about one per 1000. A study using a cancer-enriched set of 1000 images, therefore, will result on average in one incorrectly classified true positive. For a prospective screening trial, which will use many times that number of cases, the effect may be significant.

  • Power. A screening study in which enrollment is not predicated on a positive mammogram is the only type of study that can objectively compare SFM and FFDM. Including palpable abnormalities in the cohort would not result in the above biases, but it would not necessarily be applicable to screening, in which the detection of nonpalpable abnormalities is what truly matters. Ultrasound, for example, is known to be much better at demonstrating palpable cancers than mammography, but that does not mean it is a better screening modality.

The cancer detection rate in a mammography practice can range from the incidence rate (about 2.5 per 1000) for a practice with all repeat screeners to the prevalent rate (six to 10 per 1000) for a practice enrolling only women who have never had a mammogram (very unusual these days). A typical practice, with a mixture of groups, will have a cancer detection rate of four to five per 1000. The number of cancers determines the power of a screening study; thus, a very large number of subjects is needed to achieve sufficient power to detect a difference between modalities. Using the numbers above, achieving 100 screening cancers would require 20,000 to 25,000 subjects. For a paired screening study with a 0.5 correlation, it is estimated that this number of cancers would be needed to detect a 0.1 difference in receiver operating characteristic (ROC) curves between FFDM and SFM.

GE NONINFERIORITY STUDY

This study using an enriched diagnostic cohort was submitted as part of the FDA approval process for the GE FFDM unit, the Senographe 2000D, and is typical of study designs used for FDA approval. In the study, 625 diagnostic patients underwent screening views on FFDM in addition to their clinically mandated SFM views.2 There were 44 cancers in the image set. Five readers unconnected to the patients¹ clinical care independently and blindly interpreted the SFM and FFDM cases, using an ROC scale and a BIRADS (Breast Imaging Reporting and Data System) assessment. Each reader interpreted all the cases, with at least 30 days separating the film and digital interpretation to minimize the reader¹s recall of the case.

FFDM had a slightly lower sensitivity than SFM and a higher specificity, as would be predicted by the enrollment bias. The difference in sensitivity was less than that considered allowable at the time of study design, and FFDM was therefore determined to be noninferior to SFM in sensitivity. FDA approval was supported.

COLORADO/MASSACHUSETTS FFDM TRIAL

Between July 1997 and March 2000, 6768 paired FFDM/SFM exams were performed on 4521 women presenting for bilateral mammography, usually for screening, at either the University of Colorado Health Sciences Center or the University of Massachusetts Medical Center.3,4 The examinations were generally performed at the same visit by the same technologist.

SFM was acquired on a GE DMR using Kodak Min-R-2000 screens and film. FFDM was acquired on a prototype FFDM unit made by GE.

FFDM images were interpreted in soft copy using a prototype clinical workstation made by GE, and SFM was interpreted on a standard mammography viewbox. Prior studies were available for both interpretations. For each finding, the interpreting radiologist gave a BIRADS assessment and recommendation, as well as a probability of malignancy rating from 0 to 100 for use in free-response ROC analysis. Each finding was worked up with additional imaging and, if warranted, biopsy, without regard to the modality on which it was detected.

Recommended for recall on at least one modality were 2048 findings. Of these, 690 were called on FFDM only, 1060 on SFM only, and 298 on both. The greater number of findings on SFM is statistically significant by McNemar¹s x2 test at p<0.001. Note that only 15% of the findings in the study were called on both the SFM and FFDM readings.

Recall rate was calculated based on exam, rather than finding, so as to be consistent with the established definition. Of the 1481 exams recalled by at least one reader, 469 were called on FFDM only, 679 were called on SFM only, and 333 were called on both.

The recall rate for SFM was 1012/6768 = 15%; the recall rate for FFDM was 802/6768 = 11.9%. The higher recall rate for SFM is statistically significant (p<0.001).

One hundred eighty-three biopsies were performed for findings in the study: 88 for findings originally detected on SFM only, 38 for findings originally detected on FFDM only, and 57 for findings originally detected on both modalities. As would be expected from the fact that the SFM reading led to more recalls, SFM led to more biopsies than FFDM (145 versus 95). The difference in the proportion of biopsies recommended on each modality is statistically significant (p<0.001).

Nine cancers were detected on the FFDM reading only, 16 cancers were detected on the SFM reading only, and 18 cancers were detected on both readings. The difference was not statistically significant (p>0.1).

Areas under the alternative free-response ROC (AFROC) curves were 0.80 for SFM and 0.74 for FFDM. The difference was not statistically significant (p>0.1).

In summary, the study did not demonstrate a significant difference between the modalities in terms of either cancer detection or area under the ROC curve. There was a significant difference in recall rate, however. It remains to be seen whether that difference will persist in clinical practice or was an artifact of the experimental situation.

Limitations of the study relate to the use of prototype equipment to try to predict the performance of a final clinical system. Although the detector and x-ray platform used in the prototype device were very similar to that used in the commercial product, the prototype interpretation workstation was technically inferior to the commercial workstation in several ways, including monitor brightness, monitor resolution, and ease of use.

ACRIN TRIAL

Because of the lack of power in the Colorado/Massachusetts trial, and because that trial tested only one manufacturer¹s device, a much larger trial was proposed under the auspices of the American College of Radiology Imaging Network (ACRIN). This trial, funded by the National Cancer Institute, is similar in design to the earlier study but hopes to accrue 49,500 cases, a greater than seven-fold increase. The accrual will take place at 19 sites using equipment from four manufacturers: Fischer, Fuji, GE, and Lorad. Accrual was scheduled to start in September 2001.

Even with the larger numbers, the power of the study is marginal for detecting a difference between film and digital of the magnitude expected, and there is no power to evaluate each manufacturer¹s machine separately. To make up for the lack of power in the prospective phase of the trial, a number of reader studies are planned using enriched image sets constructed from the cases collected during the prospective phase. Each reader study will use multiple readers to compensate for interobserver variability, and each will be designed to answer a specific question. Additional studies are planned as part of the trial to evaluate economic and quality of life issues related to the use of digital mammography.

There are many possible designs for testing a new screening technology, but all suffer from the trade-off between selection bias and statistical power. In the case of full-field digital mammography, approaches range from small but biased reader studies on selected images to the large prospective ACRIN trial. Large trials take time, however. During the span of the ACRIN trial, the technology will continue to evolve, just as it did during the Colorado/Massachusetts trial, raising questions as to the applicability of the results. It will be three to five years before the results of this very expensive trial are published, and even then we may not know whether or not it was money well spent.

DR. LEWIN is an assistant professor of radiology at the University of Colorado Health Sciences Center in Denver.

References

  1. Lewin JM. Digital mammography, a candid assessment. Diagnostic Imaging, September 1999:40-45.
  2. Hendrick RE, Lewin JM, D¹Orsi CJ, et al. Non-inferiority study of FFDM in an enriched diagnostic cohort: comparison with screen-film mammography in 625 women. In: Yaffe MD, ed. IWDM-2000 5th International Workshop on digital mammography. Madison: Medical Physics Publishing, 2001.
  3. Lewin JM, Hendrick RE, D¹Orsi CJ, et al. Comparison of full-field digital mammography to screen-film mammography for cancer detection: results of 4945 paired examinations. Radiology 2001;218:873-880.
  4. Lewin JM, Hendrick RE, D¹Orsi CJ, et al. Clinical evaluation of full-field digital mammography in a screening population. Presented at the 2000 RSNA Scientific Meeting. Radiology 2000;217(P):199.
TABLE OF CONTENTS