This paper presents a generic model to describe the content and structure of video and image data. A video model is used to create a set of hierarchic groups that describe the structure of a video, allowing semantic concepts, such as scenes, to be represented. The description of the video content includes temporal features extracted from the video shots and spatial features of the key-frames. We describe the method of creating a complete description for video material at three different levels of abstraction: group, shot and key-frame. We also describe a heuristic process to extract the key-frames from a video shot.