1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

      Preprint
      , ,

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of \(110k\) samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.

          Related collections

          Author and article information

          Journal
          07 April 2025
          Article
          2504.04781
          776d9669-fa20-4a1a-b12f-2495a34048e8

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          This work has been accepted to the Multimodal Algorithmic Reasoning (MAR) Workshop at CVPR 2025
          cs.CV

          Computer vision & Pattern recognition
          Computer vision & Pattern recognition

          Comments

          Comment on this article

          Related Documents Log