A Framework for Capturing Clinical Data Sets from Computerized Sources

  1. Clement J. McDonald, MD;
  2. J. Marc Overhage, MD, PhD;
  3. Paul Dexter, MD;
  4. Blaine Y. Takesue, MD; and
  5. Diane M. Dwyer, MD
  1. From the Regenstrief Institute for Health Care and Indiana University Medical Center, Indianapolis, Indiana; and the Maryland Department of Health and Mental Hygiene, Baltimore, Maryland. Note: This article is one of a series of articles comprising an Annals of Internal Medicine supplement entitled “Measuring Quality, Outcomes, and Cost of Care Using Large Databases: The Sixth Regenstrief Conference.” To see a complete list of the articles included in this supplement, please view its Table of Contents. Grant Support: In part by grant HS 07719-03 from the Agency for Health Care Policy and Research, contracts NO1-LM-4-3410 and NO1-LM-6-3456 from the National Library of Medicine, and grant 92196-H from the John A. Hartford Foundation of New York. Requests for Reprints: Clement J. McDonald, MD, Department of Medicine, Regenstrief Institute for Health Care, Indiana University Medical Center, 5th floor RHC, 1001 West 10th Street, Indianapolis, IN 46202. Current Author Addresses: Drs. McDonald, Overhage, Dexter, and Takesue: Department of Medicine, Regenstrief Institute for Health Care, Indiana University School of Medicine, 1001 West 10th Street, Indianapolis, IN 46202. Dr. Dwyer: Epidemiology and Disease Control Program, Maryland Department of Health and Mental Hygiene, 201 West Preston Street, Room 325, Baltimore, MD 21201.

    Abstract

    The pressure to improve health care and provide better care at a lower cost has generated the need for efficient capture of clinical data. Many data sets are now being defined to analyze health care. Historically, review and research organizations have simply determined what data they wanted to collect, developed forms, and then gathered the information through chart review without regard to what is already available institutionally in computerized databases. Today, much electronic patient information is available in operational data systems (for example, laboratory systems, pharmacy systems, and surgical scheduling systems) and is accessible by agencies and organizations through standards for messages, codes, and encrypted electronic mail. Such agencies and organizations should define the elements of their data sets in terms of standardized operational data, and data producers should fully adopt these code and message standards. The Health Plan Employer Data and Information Set and the Council of State and Territorial Epidemiologists in collaboration with the Centers for Disease Control and Prevention and the Association of State and Territorial Public Health Laboratory Directors provide examples of how this can be done.

    The pressure to improve health care and provide better care at a lower cost has created new needs to access clinical data for outcome analysis [1], quality assessment, guideline development [2], utilization review, pharmacoepidemiology [3], public health, benefits management, and other purposes. These needs are usually identified as data sets (that is, predefined lists of clinical questions or observations).

    Data sets are not new to the health care industry. The UB92 hospital billing form and UB82, its progenitor, from The Health Care Financing Administration (HCFA) have been around for some time. Recently, however, the number and richness of clinical data sets have grown dramatically. New data sets have been established by the National Center for Vital Health Statistics [4] and the National Committee for Quality Assurance [5]. The HCFA piloted an 1800-element quality-assurance data set called the Uniform Clinical Data Set System from 1989 to 1993 [6] and is working on a simpler version called the Medicare Quality Indicator System. Other HCFA data sets include the Resident Assessment Instrument for long-term health care [7] and a draft Outcome and Assessment Information Set for providers of home health care [8]. The U.S. Centers for Disease Control and Prevention (CDC) has developed Data Elements for Emergency Department Systems (DEEDS) for reporting information on visits to emergency departments [9]; the National Immunization Program for reporting data on immunizations [10]; and, in collaboration with the Council of State and Territorial Epidemiologists (CSTE) and the Association of State and Territorial Public Health Laboratory Directors (ASTPHLD), a data set that reports laboratory findings on communicable diseases [11]. Other national data sets include the Trauma Registry of American College of Surgeons [12], the Cardiovascular Data Standards for coronary arteriography [13], the Cooperative Project for coronary artery bypass graft surgery [14], and the Musculoskeletal Outcomes Data Evaluation and Management System for knee and hip replacements [15]. Cancer registries, hospitals, group practices, managed care providers, researchers, and pharmaceutical manufacturers are developing additional clinical data sets. We refer to the databases that carry data sets as analytic databases because they are usually designed for direct statistical analysis.

    As the need formal data sets has burgeoned, so has the use of computers to process patient information in direct support of patient care. Operational systems in the laboratory, pharmacy, patient registration area, surgical suites, and electrocardiography carts (to name a few) now include most data on laboratory procedures, prescriptions, demographics and appointments, surgical logs, and electrocardiographic measurements. Unfortunately, the two developments are occurring in independent orbits with little interaction. With a few important exceptions, developers of national data sets do not consider operational systems as sources for the contents of their data sets. Developers can find the information they want by abstracting charts. However, chart abstraction is prone to error and expensive. In one study, chart reviewers could not find 10% of the laboratory test results that were in the charts [16] and commercial chart reviews cost between $10 and $15 per admission, depending on the amount of data retrieved (Kriss E. Personal communication. Boston, MA: MediQual). Chart reviews remain the only option for retrieving some kinds of information. However, when information exists in the databases of health care providers, manually extracting it from reports that are printed from one database and reentering the information into another database is time-consuming and inefficient.

    In this article, we review the barriers to the direct flow of operational data into analytic databases and the technical developments that have minimized these barriers. We also suggest specific actions that can unify the two orbits as the health care industry enters the computer age.

    The Difference between Operational and Analytic Databases

    Examples of operational databases are found in hospital pharmacies, laboratories, radiology departments, critical care units, and order-processing units. The first barriers to the direct use of operational system data in analytic databases are the differences in structure and detail that obscure similarities in the content of their information. A laboratory system would typically dedicate an entire record to each observation (for example, clinical measurement or laboratory test result). An ordering or pharmacy system would do the same for each item or prescription that is ordered. Table 1 shows the structure of an operational database for a clinical reporting system.

    Table 1. Operational Database: One Record per Observation*

    In contrast, analytic databases typically carry all variables of interest (for example, the most recent hemoglobin value, whether the patient is anemic, the number of units of blood transfused, and the lowest systolic blood pressure) in a single record that describes one patient, patient encounter, or patient procedure. Table 2 shows an analytic database analogue to the operational database of Table 1. In analytic databases, the variable is identified by the name of the field (for example, most recent cholesterol level) in which its value is stored, and all variables of interest are stored horizontally as separate fields in one record. The variables in an operational database are usually defined by a code or name stored in one field (with a name such as “observation ID” as shown in the third column of Table 1) and their values are stored in another field (with a name such as “value” as shown in the fourth column of Table 1). Different variables are stacked vertically in separate records.

    Table 2. Revised Model of an Analytic Database: One Record per Patient Event*

    Operational databases often contain repeated measurements (for example, all recent hemoglobin values for a patient), whereas analytic databases often contain a single measurement (for example, the lowest hemoglobin value during the first 24 hours of a hospital stay or the first Glasgow coma score during an emergency department visit). Operational databases usually carry many items of information about each value reported (for example, its units, date and time, and where the measurement was taken) as separate fields in the same record, whereas analytic databases usually contain only the variable's value. However, analytic databases may contain slightly more information. For example, an analytic database may have the value and date of the last measurement of diastolic blood pressure.

    Operational databases usually contain raw data [for example, the hemoglobin value], whereas analytic databases frequently carry conclusions or “yes” or “no” answers to questions, such as “is the patient anemic?”). Finally, the identifying codes in operational databases tend to be more detailed than the corresponding codes in analytic databases. For example, an operational database in the pharmacy might identify a prescription by the National Drug Code (NDC), which identifies the brand name, dose, and bottle size. In comparison, the corresponding variable in an analytic database might identify drugs by a more generalized code that identifies only the generic drug (such as propranolol) or drug class (such as β-blockers).

    In many cases, operational data can be converted into analytic variables. Three simple conversion rules are worth emphasizing. First, a continuous variable, such as the hemoglobin value or cholesterol level, can be converted into a binary diagnostic variable (such as specifying “yes” or “no” to the presence of anemia) and be given a numeric threshold that defines the diagnosis (for example, a hemoglobin value < 12). Second, detailed codes can be converted into more generalized codes by using simple cross-links (for example, converting NDC codes into generic drug codes). Finally, repeated values of a variable can be converted into a single value. Conversion occurs by selecting the first, last, or worst of a series of repeated values or by combining all occurrences on the basis of some rule. Examples include taking the mean value (as might be done for blood pressure levels), the sum (as might be done for determining chemotherapy drug doses), or the count (as might be done for records of blood transfusions). It is easy to imagine more complicated conversion rules. For example, a variable that specifies “yes” or “no” for the presence of diabetes might be defined in terms of thresholds on fasting blood sugar and hemoglobin A1c or for the current use of insulin or oral agents.

    Variations in the Codes and Structures of Operational Systems

    Until recently, a second barrier to the use of operational databases has been the lack of standards for reporting data from operational systems. Each vendor structured and reported the contents of its products differently. In some cases, each implementation of a vendor's product also varied. In addition, each laboratory and medical records department tended to define its own unique and idiosyncratic codes for identifying observations and findings. This cacophony presented an enormous barrier to the use of operational databases by external agencies.

    Today, standard message structures and formats exist for exporting patient information from operational systems. Message standards specify a uniform structure for electronically reporting clinical data from source databases to other databases. These standards also specify the format for reporting dates, times, names, numeric values, and codes. For example, the standard for date formats is CCYYMMDD (century, year, month, date). Therefore, 12 April 1979 is recorded as 19790412 and not as 4-12-79, 12-apr-79, or any other option.

    The American National Standards Institute Health Level 7 (HL7) standard is the most relevant to this discussion. This standard is widely used to transmit patient registration data; orders; and such clinical information as vital signs, primary complaint, and diagnostic test results (for example, electrolyte concentrations and the results of obstetric ultrasonography) [17]. The HL7 standard also specifies messages for referral information, clinical trial data, and many other kinds of operational transactions. The standard is supported by most vendors of medical information systems and has been adopted by most medium-sized and large hospitals, group practices, and commercial laboratories. Several programs within the CDC are using HL7, and others are developing or planning projects that will use it. The Veterans Administration uses HL7 as its system-wide communication standard, as do many institutions in Germany, The Netherlands, Australia, New Zealand, South Korea, Japan, Singapore, and Canada. A practical subset of the HL7 standard is the E1238-94 standard developed by the American Society for Testing and Materials (ASTM) [18].

    The HL7 message structure for observations corresponds very closely with the database structure for operational systems as shown in Table 1. Specifically, the message includes a record with fields for storing an identifier for a given observation (for example, serum potassium concentration, diastolic blood pressure, or primary complaint), its data type (reported as a number or a code), value (4.1 mmol/L serum potassium concentration, 85 mm Hg diastole, or primary complaint of chest pain), unit of measurement when the value is a number (for example, mg or mm Hg), normal range (3.5 to 5.5 mmol/L, 60 to 80 mm Hg diastole), date and time, and site of production (name and location of vendor). This standard provides the machinery for sending clinical information from operational systems to institutional databases, including outcomes management systems, electronic medical records, and external agencies (such as the HCFA quality review system, public health departments, and health maintenance organizations).

    Other important message standards are the Data Interchange Standards Association's Accredited Standards Committee X12 for insurance enrollment, payment, and other administrative messages [19]; the Digital Imaging and Communications in Medicine (DICOM) standard for diagnostic images [20]; the National Council for Prescription Drug Programs telecommunications standard for community pharmacy transactions [21]; and the Institute of Electrical Engineers' Medical Information Bus for patient-connected devices in critical care units [22]. Appendix Tables 1 and 2 and the Duke standards Web site (http://www.mcis.duke.edu/standards/guide.htm) give more information about these and other standards discussed in this article.

    Appendix Table 1. Code Systems: Additional Information*
    Appendix Table 2. Message Standards: Additional Information*

    Code System Standards

    A code transmitted in HL7 is always paired with an identifier of the code system from which it was drawn. Such flexibility allows health care organizations to use their local code systems, which were the only kind available when HL7 was initiated. Use of identifiers also eases transition from one standard code to another, such as conversion from International Classification of Diseases version 9 (ICD-9) to version 10 (ICD-10).

    However, an external reviewing organization would be unable to use submitted data unless every contributing organization followed the same set of code systems. Some data have always been represented as standard codes. For example, in the United States, discharge diagnoses are reported as ICD-CM codes [23], procedures as Current Procedural Terminology (CPT95) codes [24], and drugs as the Food and Drug Administration's NDC directory [25]. However, until recently, standard code systems have not been available for such clinical data as test results, clinical observations, units of measure, symptoms, problems, and infectious organisms [26].

    The Logical Observations Identifier Names and Codes (LOINC) database fills an important gap in the code armamentarium. This database includes codes, names, and synonyms for more than 12 000 observations, including laboratory tests, vital signs, electrocardiographic measurements, intake and output measures, critical care, overall clinical impressions, discharge summary, and operations report headers. The LOINC database was specifically created to provide codes for the observation identifier field in the HL7 observation reporting message, which is analogous to information in the third column of Table 1. However, it can serve the same role in other standards, such as ASTM and DICOM.

    The LOINC database is being adopted by the largest commercial laboratories, including the Laboratory Corporation of America, Quest Diagnostics (formerly Corning Clinical Laboratories), the Associated Regional and University Pathologists, and LifeChem. Together, these laboratories account for more than 30% of the nation's commercial laboratory testing [27]. The database is also being adopted by many computer system vendors and health care providers, including Kaiser Permanente, the Veterans Administration, the U.S. Department of the Navy, Intermountain Health Care, the province of Ontario in Canada [28], Partners Healthcare System of Boston, and Clarian of Indianapolis. In addition, LOINC has been endorsed by the American Clinical Laboratory Association. The database and a program that maps local code systems to LOINC is being distributed free on the Duke standards Web site (http://www.mcis.duke.edu/standards/termcode/loinc.htm).

    Standard code systems are also available for other fields in operational databases, including units of measure (required for numeric results), coded values, drugs, and medical devices. Table 3 lists the American Medical Informatics Association's candidates for these fields [29].

    Table 3. Concepts That Are Coded in Health Care Information Systems*

    When reviewers want to compare long-term patient outcomes across health care organizations, universal provider and patient identifiers are also important. The Department of Health and Human Services's National Provider Identifier will soon be available to fill the need for a universal provider identifier (more information can be obtained from http://www.hcfa.gov/stats/nproidov.htm). The Health Insurance Portability and Accountability Act of 1996 (PL104-191) also mandates a universal patient identifier, but availability of the identifier depends on the outcome of ongoing debates on patient privacy [30].

    Finally, industry standards are now available for secure transmission of patient information across organizations. The Internet Engineering Task Force (IETF) has produced most of these standards (for information, contact the IETF Secretariat, c/o Corporation for National Research Initiatives, 1985 Preston White Drive, Suite 100, Reston, VA 20191). These standards include Secure Sockets Layer, which is an encryption scheme that prevents anyone who accesses the World Wide Web from deciphering messages on the wire [31], and EDI [Electronic Data Interchange] over Internet, which encrypts electronic mail (e-mail) and provides electronic signatures [32]. The latter provides a perfect delivery mechanism for reporting to external agencies. When using EDI over Internet, health care organizations can create an HL7 or other standard message with the required content, encrypt the message, and send it to the external agency through standard e-mail systems. To minimize the cost and work required by reporting organizations, all external agencies should support these mechanisms for delivering electronic messages.

    Recommendations

    The machinery is now in place to automatically capture and deliver some of the information required by external agencies. To take advantage of these opportunities and to prepare for the increasing spectrum of clinical information that will be electronically stored in the future, we make the following recommendations.

    Health care organizations should map the local codes they use in their operational systems to such standard codes as LOINC for observation identifiers, International Organization for Standardization (ISO+) [33] for units of measure, and Systematized Nomenclature of Medicine (SNOMED) for coded findings. Operational systems would then include the standard as codes (as a lingua franca) in all electronic reports so that every receiving system could understand them. In the case of test results, little time (possibly less than one person-week) would be required for a health care organization to map or cross-link the few local laboratory codes required by a given data set to standard test codes. Substantially more time (3 to 4 person-months) would be required to map all the tests to the standard codes in a sophisticated laboratory. However, the time that is invested would in turn provide content that could be understood by the databases of all laboratory clients. Indeed, the health care organizations previously mentioned are investing in LOINC coding for exactly these reasons.

    Second, reviewers should define variables in terms of such standardized operational content (when feasible) and rules for translating the operational content into analytic variables. For example, a reviewer or an analyst could define a “yes” or “no” variable for anemia on the basis of a list of tests that can confirm anemia and establish the threshold at which each test defines anemia. At least three laboratory measurements would be included in this list: blood hemoglobin (LOINC code 718-7), spun hematocrit (LOINC codes 4544-3), and hematocrit calculated by automated counters (LOINC code 4545-0). The reviewer might set the threshold for anemia as below 12 mg/dL for hemoglobin and below 36 for hematocrit. This simple example could be extended to include sex-based thresholds and other operational variables, such as billing diagnoses, as needed by the external agency.

    The Health Plan Employer Data and Information Set (HEDIS) defines the numerators and denominators for many of its performance standards in terms of similar definitions on operational administrative data. For example, the numerator for the measure for screening patients for breast cancer is defined as the patient with a CPT4 code from 76090 to 76092, revenue code 401 or 403, or ICD-9 procedure 87.37 or 87.36.

    Some organizations (such as the CDC, CSTE, and ASTPHLD) have gone much further. To automate reporting of communicable and other reportable conditions from laboratories, the CDC, CSTE, and ASTPHLD have jointly defined a formal mapping strategy from standardized laboratory results to reportable diseases. For each reportable disease, these organizations have created a table of the LOINC laboratory tests that could indicate each disease and the criteria for deciding which results should be reported to public health agencies. For example, a Brucella-specific agglutination titer greater than 1:160; an immunoglobulin antibody test that yields positive results for Brucella species; or a culture with growth of Brucella abortus, Brucella canis, or Brucella melitensis would be reportable results. The specifications adopted by these organizations are defined as two related spreadsheets. Figure 1 shows a small portion of these spreadsheets (for additional information, contact Diane Dwyer, MD, at the address at the end of this text). The disease in column A of Figure 1 is defined by the tests listed in column B (the LOINC code) when they satisfy the criteria given in column C. In the case of tests with positive or negative results or with numeric results, the reporting criteria are included in column C. In the case of such laboratory procedures as cultures whose values are the names or codes for infectious organisms, column C contains a list of organisms whose elements can be found in spreadsheet 2 of Figure 1

    Figure 1. LOINC = Logical Observations Identifier Names and Codes; SNOMED = Systematized Nomenclature of Medicine.
    View larger version:
      Figure 1. LOINC = Logical Observations Identifier Names and Codes; SNOMED = Systematized Nomenclature of Medicine. Portions of two related spreadsheets that were jointly developed by the Centers for Disease Control and Prevention, the Council of State and Territorial Epidemiologists, and the Association of State and Territorial Public Health Laboratory Directors.

      These definitions are precise and complete and can be converted for automated retrieval by many clinical systems. The use of spreadsheets to define analytic variables from operational data could be extended to many other kinds of analytic variables. For example, columns could be added that specify which repeated observations to retrieve (such as the first and last or the minimum or maximum length of a hospital stay) or how to combine the observations (as a mean, a count, or a total).

      The HEDIS 3.0 data set asks health care organizations to select the local operational results, convert them to analytic variables, and then send the variables to reviewers in a special format. For public health reporting, states and jurisdictions ask laboratories to select only reportable results and send them in a standardized from (that is, HL7 messages that contain LOINC and SNOMED codes [11]). The receiving system then converts the standardized results into analytic variables. We encourage data set developers to conceptualize their requirements in terms of standardized contents of operational data systems. Converting the operational data to analytic variables is less important.

      The joint approach of the CDC, CSTE, and ASTPHLD is to make the receiving organization responsible for conversion because quality can be more easily assured by the receiving organization than by the numerous sending organizations. Because more than 5000 hospitals and 38 000 laboratories operate in the United States, the number of senders could be substantial. More important, many operational databases already have the ability to retrieve operational results and send them as HL7 messages to other databases. These HL7 mechanisms can be extended to external agencies without requiring the special programming that is necessary to generate special analytic variables and report them as agency-specified formats. Finally, by requiring the sending organizations to submit complete results, the receiving organization can revise conversion criteria as necessary without losing the ability to apply new criteria to old data for historical comparisons. Similar arguments apply to sending the standardized but detailed codes of operational databases rather than the more generalized codes of analytic databases (for example, a detailed NDC code instead of a drug category code).

      One argument against requiring full reporting is the need to store all the data. However, this argument is not compelling because the cost of storage is low (currently less than $500 for 1 billion bytes of online disk storage and $30 for 1 billion bytes of offline storage) and is declining.

      Once agencies have implemented a mechanism for drawing some of their required data from operational databases without using manual labor, they will face questions about the variables that still require chart abstraction. First, can agencies find surrogates in databases for the variables they obtain through chart abstraction? A recent study [34] suggests so. A few clinical and laboratory variables can predict outcomes and diagnoses that are abstracted from charts. Second, if the variable is worth collecting retrospectively to assess quality, why not collect it prospectively (perhaps through a nurse assessment system [35]) to assure quality? Similarly, perhaps the impression of imaging studies should be prospectively coded by the image readers rather than by chart reviewers. Curiously, payers require ICD-9 codes for the diagnostic impression of an office examination, which might cost $30, but not for magnetic resonance imaging, which might cost $900.

      We propose a pathway rather than a panacea. The benefits to public health agencies could be immediate and substantial because their data requirements can be met by operational systems that are readily available, such as laboratory databases. However, specialized data sets that contain functional status and specialized clinical measurements are not widely available in operational databases. Furthermore, coding standards still need to be defined or adopted (or both) for such subject matter. Therefore, this approach does not presently satisfy all requirements of reviewers.

      Regardless, most external agencies could adopt portions of this framework. They could define survey and clinical questions as a set of spreadsheets (similar to those used by the CDC, CSTE, and ASTPHLD) along with any variables that exist in operational data systems. The agencies could ask senders to report the variables that they collect manually along with the results they obtain automatically in the same HL7 format. In addition, agencies could accept this information as standard encrypted and certified e-mail messages [31]. These actions would establish a common framework for defining and delivering review data to external agencies and would encourage the more widespread linkage of analytic data sets to operational data sources. Such a standardized framework would ease the work of reporting organizations and pave the way for automatically capturing data from a whole generation of new clinical databases (for example, those containing physician orders and notes and structured reports of such special studies as endoscopy and obstetric ultrasonography). Finally, this framework bridges the differences between the databases of health care providers and those of external agencies and could bring the data collection activities of both into the same orbit.

      References

      1. 1.
      2. 2.
      3. 3.
      4. 4.
      5. 5.
      6. 6.
      7. 7.
      8. 8.
      9. 9.
      10. 10.
      11. 11.
      12. 12.
      13. 13.
      14. 14.
      15. 15.
      16. 16.
      17. 17.
      18. 18.
      19. 19.
      20. 20.
      21. 21.
      22. 22.
      23. 23.
      24. 24.
      25. 25.
      26. 26.
      27. 27.
      28. 28.
      29. 29.
      30. 30.
      31. 31.
      32. 32.
      33. 33.
      34. 34.
      35. 35.
      « Previous | Next Article »Table of Contents