مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Verion

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

video

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

sound

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Persian Version

مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

View:

106
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Download:

0
مرکز اطلاعات علمی Scientific Information Database (SID) - Trusted Source for Research and Academic Resources

Cites:

Information Journal Paper

Title

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

Pages

  175-188

Keywords

Natural Language Processing (NLP) 

Abstract

 The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc. ). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create this tool is to identify and recognize the units that are known as independent semantic units in Persian language. This tool detects word boundaries in texts and converts the text into a sequence of words. In the English language, many activities have been done in the field of text tokenization and many tools have been development, such as: Stanford, Ragel, ANTLR, JFLex, JLex, Flex and Quex. In recent decades, valuable researches have also been conducted in the field of tokenization in Persian language that all of them have worked on the lexical and syntactic layer. In the current research, we tried to focus on the semantic layer in addition to those two layers. Persian texts usually have two simple but important problems. The first problem is multi-word tokens that result from connecting one word to the next. Another problem is polysyllabic units, which result from the separation of words that together form a lexical unit. Tokenizer is one of the language preprocessing tools that is widely used in text analysis. This component recognizes the center of words in texts and turns it into a sequence of words for later analysis. Variety in Persian script and non-observance of the rules of separation and spelling of words on the one hand and the lexical complexities of Persian language on the other hand, different language processing such as tokenization face many challenges. Therefore, in order to obtain the optimal performance of this tool, it is necessary to first specify the computational linguistics considerations of tokenization in Persian and then, based on these considerations, provide a data set for training and testing. In this article, while explaining the mentioned considerations, we tried to prepare a data set in this regard. The prepared data set contains 21. 183 tokens and the average length of sentences is 40. 28.

Multimedia

  • No record.
  • Cites

  • No record.
  • References

  • No record.
  • Cite

    Related Journal Papers

  • No record.
  • Related Seminar Papers

  • No record.
  • Related Plans

  • No record.
  • Recommended Workshops






    Move to top
    telegram sharing button
    whatsapp sharing button
    linkedin sharing button
    twitter sharing button
    email sharing button
    email sharing button
    email sharing button
    sharethis sharing button