Medical records are an important source for clinical researchers, but text records used outside the hospital must first be de-identified.
Medical records are an important source for clinical researchers, but text records used outside the hospital must first be de-identified.
Several automatic schemes to achieve de-identification recently surfaced as the result of a challenge issued by the American Medical Informatics Association (J Am Med Inform Assoc 2007;14[5]:550-580).
In one effort, researchers at the University of Szeged in Hungary developed a de-identification model that successfully removes personal health information from hospital records in conformance with the Health Insurance Portability and Accountability Act.
A machine learning-based iterative system, the solution uses a named entity recognition approach on semistructured documents. Named entity recognition (NER) is a subtask of information extraction that locates and classifies elements in text into predefined categories.
"Our named entity approach is based on a complex feature set and boosted decision trees, and it uses a different feature representation from other state-of-the-art NER systems," said Gyorgy Szarvas, Ph.D., of the university's informatics department.
Szarvas's method identifies personal health information in several steps (J Am Med Inform Assoc 2007;14[5]:574-580). First, it labels all entities whose tags can be inferred from the structure of the text, and it then uses this information to find further personal health information phrases in the flow text parts of the document. Customizing the system took only a few weeks.
"Such systems can be built quite rapidly for any institute for de-identification or other NER-like tasks," he said.
Elsewhere, researchers at Mitre (Bedford, MA), along with Harvard, Brandeis, and Stanford universities, took a different approach, focusing instead on rapid adaptation of existing toolkits for named entity recognition. They used two existing tools: Carafe and LingPipe.
The researchers report that the out-of-the-box Carafe system achieved a good score (a phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, they were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736.
Challenge organizer Ozlem Uzuner, Ph.D., of the University at Albany, State University of New York, said the efforts show that most private health information can be recognized with more than 98% accuracy. Whether 98% accuracy is good enough is an open question best left to policymakers, he said.
"The results are nevertheless encouraging, from a technical perspective, and show that much can be accomplished to de-identify data with the best techniques," Uzuner said.
AMIA was so encouraged, it is now in the process of organizing similar challenges on other open research questions in medical language processing.
MRI-Based AI Radiomics Model Offers 'Robust' Prediction of Perineural Invasion in Prostate Cancer
July 26th 2024A model that combines MRI-based deep learning radiomics and clinical factors demonstrated an 84.8 percent ROC AUC and a 92.6 percent precision-recall AUC for predicting perineural invasion in prostate cancer cases.
Breast MRI Study Examines Common Factors with False Negatives and False Positives
July 24th 2024The absence of ipsilateral breast hypervascularity is three times more likely to be associated with false-negative findings on breast MRI and non-mass enhancement lesions have a 4.5-fold likelihood of being linked to false-positive results, according to new research.
Can Polyenergetic Reconstruction Help Resolve Streak Artifacts in Photon Counting CT?
July 22nd 2024New research looking at photon-counting computed tomography (PCCT) demonstrated significantly reduced variation and tracheal air density attenuation with polyenergetic reconstruction in contrast to monoenergetic reconstruction on chest CT.