Recently, human action recognition in videos has become an interesting area of research due to its variety of important applications such as intelligent security supervisions, smart environments, education, health-care monitoring systems, data mining, etc. There are, however, number of challenges that makes the development of these systems a bit harder than the common machine vision systems, both in accuracy and efficiency: changes in illumination, moving background, cluttered backgrounds, camera motions, complexity of the actions, to name a few. One of the commonly used methods for automatic human action recognition is to, firstly, extract some feature points within the video frames, then describe those points locally, and finally, code (cluster) them to feed a learning algorithm to build an action recognition model. In this paper, we aim to increase the accuracy of these methods by introducing the use of texture information extracted using a human retina-inspired algorithm (FREAK) together with the appearance-based information of the moving objects. In order to increase the efficacy and reduces the overhead of furthered texture information in the model building phase and, of course, in hope of increasing the accuracy as well, we propose to use a cascade approach to build the desired model. Experiments on two large datasets namely UCF101 and HMDB51, confirm that the proposed method achieves a very comparable results with the state-of-the-art methods.