مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Verion

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

video

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

sound

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Version

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View:

847
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Download:

0
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Cites:

Information Journal Paper

Title

A’ laam Corpus: A Standard Corpus of Named Entity for Persian Language

Pages

  127-140

Abstract

Named Entity Recognition (NER) is a Natural language Processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e. g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpus-based NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruled-based or machine-learning methods. Corpus-based NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus-A’ laam corpus-for Persian language. A’ laam corpus contains about 250, 000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A’ laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has part-of-speech (POS) tags at word level. Totally, about 8, 400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250, 000 tokens of A’ laam Corpus. This corpus included words, POS tags, and named entity tags. To evaluate A’ laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92. 94% Precision and 78. 48% Recall.

Cites

  • No record.
  • References

  • No record.
  • Cite

    APA: Copy

    hosseinnejad, shadi, SHEKOFTEH, YASSER, & EMAMI azadi, TAHEREH. (2017). A’ laam Corpus: A Standard Corpus of Named Entity for Persian Language. SIGNAL AND DATA PROCESSING, 14(3 (serial 33) ), 127-140. SID. https://sid.ir/paper/160701/en

    Vancouver: Copy

    hosseinnejad shadi, SHEKOFTEH YASSER, EMAMI azadi TAHEREH. A’ laam Corpus: A Standard Corpus of Named Entity for Persian Language. SIGNAL AND DATA PROCESSING[Internet]. 2017;14(3 (serial 33) ):127-140. Available from: https://sid.ir/paper/160701/en

    IEEE: Copy

    shadi hosseinnejad, YASSER SHEKOFTEH, and TAHEREH EMAMI azadi, “A’ laam Corpus: A Standard Corpus of Named Entity for Persian Language,” SIGNAL AND DATA PROCESSING, vol. 14, no. 3 (serial 33) , pp. 127–140, 2017, [Online]. Available: https://sid.ir/paper/160701/en

    Related Journal Papers

    Related Seminar Papers

  • No record.
  • Related Plans

  • No record.
  • Recommended Workshops






    Move to top
    telegram sharing button
    whatsapp sharing button
    linkedin sharing button
    twitter sharing button
    email sharing button
    email sharing button
    email sharing button
    sharethis sharing button