Packet classification is a fundamental process in network processors. In this process, input packets are classified into distinct set of flows via matching against a set of filters. Software implementation of packet classification algorithms, though having lower cost and more scalability as compared with hardware implementations, are slower. In this paper, we use parallel processing capabilities of the graphical processors to accelerate Hierarchical-Trie packet classification algorithm and propose different scenarios based on the architecture of their global and shared memories. Results of implementing these scenarios, conforming computed time and memory complexities, show that the performance of the scenarios that divide the filter set into sub-trees, equal to/ smaller than the shared memory and copy them to it, is lower than that of a scenario which keeps the total data structure in the global memory. The performance of these scenarios increases by decreasing the number of sub-trees and duplicated filters. Moreover, a scenario that can keep hierarchical tree and corresponding filters in shared memory, without any partitioning, is the best scenario. The experimental results show that, on a same GPU, this scenario attains a throughput of approximately 2.1 times compared to the existing methods.