In this paper, a model of spoken word recognition is proposed. This model is particularly concerned with extraction of cues from the signal leading to a specification of a word in terms of bundles of distinctive features, which are assumed to be the building blocks of words. In the model proposed, auditory input is chunked into a set of successive time slices. It is assumed that the derivation of the underlying word pattern proceeds in three layers: Features, phonemes, words. The feature layer has a complete set of feature detectors at every time slice. In this layer, the detection of the underlying pattern of distinctive features from the speech signal proceeds 'in three steps. In the first step, numerical values for features are obtained measuring acoustic attributes in each time slice. The acoustic attributes are either acoustic landmarks corresponding to articulator-free features which are identified, based on amplitude changes in various energy bands, or acoustic cues in the vicinity of the landmarks corresponding to articulator-bound features. Continuous perceptual feature values are, then processed into a much more structured representation, namely phonological surface structure. This is carried out in Perception Grammar as suggested by Boersma (1998). In the third step, a further processing is carried out. to turn the discrete representation into an abstract one yielding the underlying pattern of distinctive features. The next layer of the model has a complete set of phoneme detectors for every three time slices, but each set spans six time slices so the sets overlap. This means that the detection of adjacent phonemes will also overlap; this is supposed to simulate coarticulation The top layer has a complete set of word detector centered on every three time slices; again, the sets overlap, the number of time slices per word detector is variable because it depends on the length of each individual word.