مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Verion

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

video

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

sound

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Version

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View:

547
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Download:

0
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Cites:

Information Journal Paper

Title

Record Linkage with Machine Learning Methods

Pages

  1-24

Abstract

 Introduction With the advent of big data in the last two decades, to exploit and use this type of data, the need to integrate databases for building a stronger evidence base for policy and service development is felt more than ever. Therefore, familiarity with data linkage methodology as one of the data integration methods and Machine learning methods to facilitate the process of recording records is essential. Material and Methods The Record linkage process has five major steps including data pre-processing, indexing, record pair comparison, classification and evaluation step. There are two key methods (exact and probabilistic Record linkage) for linking records. Exact linkage involves using a unique identifier that is present on both files to link records. In the presence of a unique identity number in a different data source, Record linkage is easy to implement. Where a unique identifier is not available, or is not of sufficient quality, it is made by probabilistic Record linkage which check the similarity of the features of each record that are common to both files to find records that are likely to belong to the same person. Classifying the compared record pairs based on their comparison vectors is a two-class (match or a non-match) or three-class (match, non-match or potential matches) classification task. In traditional data integration approaches, record pairs are classified into one of three classes, rather than only matches and non-matches and a manual clerical review is required to decide the final match status. Most research in Record linkage in the past decade has concentrated on improving the classification accuracy of record pairs. Various Machine learning techniques have been investigated, both unsupervised and supervised. In this paper, in addition to introducing the Record linkage process and some related methods, Machine learning algorithms are used to increase the speed of database integration, reduce costs and improve Record linkage performance. Most classification techniques such as support vector machine, decision tree and bagging method, classify each compared record pair individually and independently from all other record pairs. From the classification point of view, each compared record pair is represented by its comparison vector that contains the individual similarity values that were calculated in the comparison step. These comparison vectors correspond to the feature vectors that are employed to train a classification model, and to classify record pairs with unknown match status. Results and Discussion In this paper, two databases of the Statistical Center of Iran and the Social Security Organization are linked. Three classification techniques including support vector machine, decision tree and bagging method, were used for data integration. In addition, ROC curves were plotted to find the best method of classification. The results showed that the support vector machine and decision tree method performed better than the bagging method. Conclusion Statistical organizations are challenged by the need to integrate diverse sets of inconsistent data and produce stable outputs. Instead of making the best possible statistics from a single data source, finding the best combination of sources is necessary to deliver the indicators or statistics that most efficiently satisfy the users’,needs.

Cites

  • No record.
  • References

  • No record.
  • Cite

    APA: Copy

    Aghamohammadi, Z., & REZAEI GHAHROODI, Z.. (2022). Record Linkage with Machine Learning Methods. JOURNAL OF STATISTICAL SCIENCES, 16(1 ), 1-24. SID. https://sid.ir/paper/1021419/en

    Vancouver: Copy

    Aghamohammadi Z., REZAEI GHAHROODI Z.. Record Linkage with Machine Learning Methods. JOURNAL OF STATISTICAL SCIENCES[Internet]. 2022;16(1 ):1-24. Available from: https://sid.ir/paper/1021419/en

    IEEE: Copy

    Z. Aghamohammadi, and Z. REZAEI GHAHROODI, “Record Linkage with Machine Learning Methods,” JOURNAL OF STATISTICAL SCIENCES, vol. 16, no. 1 , pp. 1–24, 2022, [Online]. Available: https://sid.ir/paper/1021419/en

    Related Journal Papers

  • No record.
  • Related Seminar Papers

  • No record.
  • Related Plans

  • No record.
  • Recommended Workshops






    Move to top
    telegram sharing button
    whatsapp sharing button
    linkedin sharing button
    twitter sharing button
    email sharing button
    email sharing button
    email sharing button
    sharethis sharing button