The amount of digital video data is increasing over the world. It highlights the need for efficient algorithms that can index, retrieve and browse this data by content. This can be achieved by identifying semantic description captured automatically from video structure. Among these descriptions, text within video is considered as rich features that enable a good way for video indexing and browsing. Unlike most video text detection and extraction methods that treat video sequences as collections of still images, we propose in this paper spatiotemporal. video-text localization and identification approach which proceeds in two main steps: text region localization and text region classification. In the first step we detect the significant appearance of the new objects in a frame by a split and merge processes applied on binarized edge frame pair differences. Detected objects are, a priori, considered as text. They are then filtered according to both local contrast variation and texture criteria in order to get the effective ones. The resulted text regions are classified based on a visual grammar descriptor containing a set of semantic text class regions characterized by visual features. A visual table of content is then generated based on extracted text regions occurring within video sequence enriched by a semantic identification. The experimentation performed on a variety of video sequences shows the efficiency of our approach.