Share on Facebook Share on Twitter Email
Answers.com

Record linkage

 

Record linkage is the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area). In 1946, H. L. Dunn of the United States National Bureau of Statistics introduced the term in this way: "Each person in the world creates a Book of Life. This Book starts with birth and ends with death. Record linkage is the name of the process of assembling the pages of this Book into a volume" (Dunn, 1946). Computerized record linkage was first undertaken by the Canadian geneticist Howard Newcombe and his associates in 1959. Newcombe recognized the full implications of extending the principle to the arrangement of personal files and into family histories. Computerized record linkage has the advantages of quality control, speed, consistency, reproducibility of results, and the ability to handle large volumes of data. For its actual implementation, Newcombe prepared a handbook in 1988.

Sir Donald Acheson established the Oxford Record Linkage Study in Oxford, England, in 1962. This medical record linkage system connects birth, morbidity, and mortality data for an entire community. This type of system links morbidity and mortality data and provides information for studies of health care utilization and for descriptive epidemiology of disease as analyzed by characteristics of time, place, and event.

There are several different approaches to linkage. At the crudest level, linkage may be based on agreement on one or more variables—this is referred to as deterministic linkage. Decision tables, a hierarchy of rules, and a variety of different sets of matching criteria may also be used to bring record pairs together. Although a "unique numerical identifier," such as a health card number, can be used, this number may have been issued more than once, changed over time, or recorded incorrectly. Checking and verifying associated names is prudent when using numerous identifiers.

A mathematical theory of probabilistic linkage was developed by I. P. Fellegi and A. B. Sunter in 1969. In the subsequent generalized record-linking software developed, there are three main phases in linkage: searching, decision-making, and grouping. Conceptually, each record on one file is compared to each record on another file to form record pairs of all possible comparisons. In practice, in the searching phase, the files are blocked using identifiers (e.g., the phonetic code of the surname and gender code) to limit the number of potential pairs of records compared. In the decision-making phase, evidence contained in different records is compared to determine the probability, or "weight," that the records relate to the same entity. Record agreement with a rare name such as "Quigley," for example, has more weight than agreement of a common name such as "Smith." For convenience, record pairs are commonly classified in three areas: (1) definite "linked" pairs; (2) definite "nonlinked" pairs; and (3) "possible" links, where the inference cannot be made without further evidence. In the final grouping phase, a group of appropriate records relating to the same individual or entity is formed. Records may have just one link to another record, or they may have several links. Two major types of errors may be made in classifying a record pair: The pairs may be either falsely linked; or they may be incorrectly unlinked (nonlinked pairs that indeed refer to the same entity).

The potential for linkage varies greatly between countries according to how information is collected and identified. The National Death Index in the United States and the Canadian Mortality Data Base have facilitated linkages at a national level. National birth and cancer data are also available in Canada.

Agencies need to develop explicit policies and mechanisms for the review and approval process for record linkage projects so that no individual will be harmed in the linkage process, either by false linkages or by the release of confidential information. Distinctions should be made for linkages done for statistical research purposes, where only aggregate statistics are released. Where possible, informed individual consent should be obtained, and the nature of the "public good" to be served should be assessed and reviewed.

Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify underreporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries.

Record linkage is also used to create health indicators. For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data.

Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer.

In addition, record linkage can aid in developing recommendations about regulatory standards at the national and international levels. A good example can be seen in the work of the United Nations Scientific Committee on the Effects of Atomic Radiation, which provides evaluations of the sources of ionizing radiation and the effects of exposures. This committee assesses the consequences to human health of a wide variety of doses of ionizing radiation and estimates the dose people receive all over the world from natural and man-made radiation sources. Linkage of a variety of data sources is required, including health, exposure, and outcome information (e.g., cancer and deaths).

(SEE ALSO: Confidentiality; Data Sources and Collection Methods; Epidemiology; Information Technology; Informed Consent; Privacy; Registries; Statistics for Public Health; Vital Statistics)

Bibliography

Baldwin, J. A.; Acheson, E. D.; and Graham, W. J., eds. (1987). Textbook of Medical Record Linkage. Oxford, UK: Oxford University Press.

Chong, N. (1998). "Computerized Record Linkage in Cancer Registries." In Automated Data Collection in Cancer Registration, eds. R. J. Black, L. Simonato, H. H. Storm, and E. Démaret. Lyon: IARC, Technical Reports No. 32:7–11.

Duncan, G. T.; Jabine, T. B.; and de Wolf, V. A., eds. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics. Washington, DC: National Academy Press.

Dunn, H. L. (1946). "Record Linkage." American Journal of Public Health 36:1412–1416.

Federal Committee on Statistical Methodology (1997). Record Linkage Techniques–1997 Proceedings of an International Workshop and Exposition. Washington, DC:U.S. Office of Management and Budget.

Fellegi, I. P., and Sunter, A. B. (1969). "A Theory of Record Linkage." Journal of the American Statistical Association 40:1183–1210.

Howe, G. R. (1998). "Use of Computerized Record Linkage in Cohort Studies." Epidemiologic Reviews 20:112–121.

Newcombe, H. B. (1988). Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business. Oxford, UK: Oxford University Press.

Newcombe, H. B.; Fair, M. E.; and Lalonde, P. (1992). "The Use of Names for Linking Personal Records." Journal of the American Statistical Association 87:1193–1208.

Newcombe, H. B.; Kennedy, J. M.; Axford, S. J.; and James, A. P. (1959). "Automatic Linkage of Vital Records." Science 130:954–959.

Smith, M. E., and Newcombe, H. B. (1980) "Automated Follow-up Facilities in Canada for Monitoring Delayed Health Effects." American Journal of Public Health 70(12):1261–1268.

Statistics Canada (2000). Generalized Record Linkage System. Concepts, Research and General Systems. Ottawa: Author.

— MARTHA E. FAIR



Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
Wikipedia: Record linkage
Top

Record linkage (RL) refers to the task of finding entries that refer to the same entity in two or more files. Record linkage is an appropriate technique when you have to join data sets that do not have a unique database key in common. A data set that has undergone record linkage is said to be linked.

Record linkage is a useful tool when performing data mining tasks, where the data originated from different sources or different organizations. Most commonly, performing RL on datasets involves joining records of persons based on name, since no National identification number or similar is recorded in the data. In mathematical graph theory, record linkage can be seen as a technique of resolving bipartite graphs.

Contents

Naming conventions

Record linkage is the term used by statisticians, epidemiologists and historians among others. Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Computer scientists often refer to it as "data matching" or as the "object identity problem". Other names used to describe the same concept include "entity resolution", "entity disambiguation", "duplicate detection", "record matching", "instance identification", "deduplication", "coreference resolution", "reference reconciliation" and "database hardening". This confusion of terminology has led to few cross-references between these research communities.[1][2]

Methods

There are several approaches to record linkage. The most straightforward is a rules-based approach, in which reasonable rules are developed and then refined as common exceptions are found. The advantage to this approach is that it is possible to get a good deal of accuracy without needing a lot of labeled data to train or test the rules on. The disadvantage is that to obtain very high accuracy, more and more exceptions and special cases would need to be handled, and eventually the list of rules gets too complex to be built by hand.

A very popular approach has been Probabilistic Record Linkage (PRL). In this approach, a large set of pairs of records are human-labeled as being matching or differing pairs. Then statistics are calculated from the agreement of fields on matching and differing records to determine weights on each field. During execution, the agreement or disagreement weight for each field is added to get a combined score that represents the probability that the records refer to the same entity. Often there is one threshold above which a pair is considered a match, and another threshold below which it is considered not to be a match. Between the two thresholds a pair is considered to be "possibly a match", and dealt with accordingly (e.g., human reviewed, linked, or not linked, depending on the application).

In recent years, a variety of machine learning techniques have been used in record linkage. It has been recognized that Probabilistic Record Linkage is equivalent to the "Naive Bayes" algorithm in the field of machine learning, and suffers from the same assumption of the independence of its features, which is typically not true. Higher accuracy can often be achieved by using various other machine learning techniques, including a single-layer Perceptron.

Regardless of whether rule-based, PRL or machine learning techniques are used, normalization of the data is very important. Names are often spelled differently in different sources (e.g., "Wm. Smith", "William Smith", "William J. Smith", "Bill Smith", etc.), dates can be recorded various ways ("1/2/73", "1973.1.2", "Jan 2, 1973"), and places can be recorded differently as well ("Berkeley, CA", "Berkeley, Alameda, California, USA", etc.). By normalizing these into a common format and using comparison techniques that handle additional variation, much more consistency can be achieved, resulting in higher accuracy in any record linkage technique.

History of RL theory

The initial idea goes back to Halbert L. Dunn in 1946[3]. In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.

In 1969, Ivan Fellegi and Alan Sunter formalized these ideas and proved that the probabilistic decision rule they described was optimal when the comparison attributes are conditionally independent. Their pioneering work "A Theory For Record Linkage"[4] is, still today, the mathematical tool for many record linkage applications.

Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter (FS) theory. Several researchers have reported that the conditional independence assumption of the FS algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality.[citation needed]

Mathematical model

In an application with two files, A and B, denote the rows (records) by α(a) in file A and β(b) in file B. Assign K characteristics to each record. The set of records that represent identical entities is defined by

 M = \left\{ (a,b); a=b; a \in A; b \in B \right\}

and the complement of set M, namely set U representing different entities is defined as

 U = \{ (a,b); a \neq b; a \in A, b \in B \} .

A vector, γ is defined, that contains the coded agreements and disagreements on each characteristic:

 \gamma \left[ \alpha ( a ), \beta ( b ) \right] = \{ \gamma^{1} \left[ \alpha ( a ) , \beta ( b ) \right] ,...,	\gamma^{K} \left[ \alpha ( a ), \beta ( b ) \right] \}

where K is a subscript for the characteristics (sex, age, marital status, etc.) in the files. The conditional probabilities of observing a specific vector γ given (a, b) \in M, (a, b) \in U are defined as


 m(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in M \right\} =
 \sum_{(a, b) \in M} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
                 P \left[ (a, b) | M\right]

and


 u(\gamma) = P \left\{ \gamma \left[ \alpha (a), \beta (b) \right] | (a,b) \in U \right\} =
 \sum_{(a, b) \in U} P \left\{\gamma\left[ \alpha(a), \beta(b) \right] \right\} \cdot
                 P \left[ (a, b) | U\right],
respectively.

Applications in historical research

Record linkage is important to social history research since most data sets, such as census records and parish registers were recorded long before the invention of National identification numbers. When old sources are digitized, linking of data sets is a prerequisite for longitudinal study. This process is often further complicated by lack of standard spelling of names, family names that changes according to place of dwelling, changing of administrative boundaries and problems of checking the data against other sources. Record Linkage was among the most prominent themes in the History and computing field in the 1980s, but has since been subject to less attention in research.

Applications in medical practice and research

Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify underreporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries. Record linkage is also used to create health indicators. For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data. Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer.

References

  1. ^ Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9
  2. ^ Elmagarmid, Ahmed; Panagiotis G. Ipeirotis, Vassilios Verykios (January 2007). "Duplicate Record Detection: A Survey" (PDF). IEEE Transactions on Knowledge and Data Engineering 19 (1): pp. 1–16. doi:10.1109/TKDE.2007.9. http://archive.nyu.edu/handle/2451/27823. Retrieved 2009-03-30. 
  3. ^ Dunn, Halbert L. (December 1946). "Record Linkage" (PDF). American Journal of Public Health 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412. http://www.ajph.org/cgi/reprint/36/12/1412. Retrieved 2008-05-31. 
  4. ^ Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage". Journal of the American Statistical Association 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061.. 

External links

Software implementations


 
 

 

Copyrights:

Encyclopedia of Public Health. Encyclopedia of Public Health. Copyright © 2002 by The Gale Group, Inc. All rights reserved.  Read more
Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Record linkage" Read more