The most critical attribute of human language is its unbounded combinatorial nature: smaller elements can be combined into larger structures based on a grammatical system, resulting in a hierarchy of linguistic units, e.g., words, phrases, and sentences. Mentally parsing and representing such structures, however, poses challenges for speech comprehension. In speech, hierarchical linguistic structures do not have boundaries clearly defined by acoustic cues and must therefore be internally and incrementally constructed during comprehension. Here we demonstrate that during listening to connected speech, cortical activity of different time scales concurrently tracks the time course of abstract linguistic structures at different hierarchical levels, e.g. words, phrases, and sentences. Critically, the neural tracking of hierarchical linguistic structures is dissociated from the encoding of acoustic cues as well as from the predictability of incoming words. The results demonstrate that a hierarchy of neural processing timescales underlies grammar-based internal construction of hierarchical linguistic structure.