Explore chapters and articles related to this topic
Record Linkage
Published in Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane, Big Data and Social Science, 2020
The purpose of a record linkage algorithm is to examine pairs of records and make a prediction as to whether they correspond to the same underlying entity. (There are some sophisticated algorithms that examine sets of more than two records at a time [Steorts et al., 2014], but pairwise comparison remains the standard approach.) At the core of every record linkage algorithm is a function that compares two records and outputs a “score” that quantifies the similarity between those records. Mathematically, the match score is a function of the output from individual field comparisons: agreement in the first name field, agreement in the last name field, etc. Field comparisons may be binary—indicating agreement or disagreement—or they may output a range of values indicating different levels of agreement. There are a variety of methods in the statistical and computer science literature that can be used to generate a match score, including nearest-neighbor matching, regression-based matching, and propensity score matching. The probabilistic approach to record linkage defines the match score in terms of a likelihood ratio (Fellegi and Sunter, 1969).
Schema on read modeling approach as a basis of big data analytics integration in EIS
Published in Enterprise Information Systems, 2018
Slađana Janković, Snežana Mladenović, Dušan Mladenović, Slavko Vesković, Draženko Glavić
The main task of data integration, regardless of whether it is traditional or Big Data integration, batch or real-time data integration, is to download the required data from their current warehouse, to change their format in order to be compatible with the destination warehouse and to place them at the target location (Loshin 2013). It is the challenges which data integration has to address that have changed. The three main steps in data integration include schema alignment, record linkage and data fusion. Schema alignment should respond to the challenge of semantic ambiguity, enabling the identification of attributes with the same meaning as well as those without it. Record linkage should find out which records refer to the same entity and which do not. Data fusion should enable the identification of accurate data in an integrated data set in cases when different sources offer conflicting values.