8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Generalizable stereo depth estimation with masked image modelling

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Generalizable and accurate stereo depth estimation is vital for 3D reconstruction, especially in surgery. Supervised learning methods obtain best performance however, limited ground truth data for surgical scenes limits generalizability. Self‐supervised methods don't need ground truth, but suffer from scale ambiguity and incorrect disparity prediction due to inconsistency of photometric loss. This work proposes a two‐phase training procedure that is generalizable and retains the high performance of supervised methods. It entails: (1) performing self‐supervised representation learning of left and right views via masked image modelling (MIM) to learn generalizable semantic stereo features (2) utilizing the MIM pre‐trained model to learn robust depth representation via supervised learning for disparity estimation on synthetic data only. To improve stereo representations learnt via MIM, perceptual loss terms are introduced, which improve the model's stereo representations learnt by explicitly encouraging the learning of higher scene‐level features. Qualitative and quantitative performance evaluation on surgical and natural scenes shows that the approach achieves sub‐millimetre accuracy and lowest errors respectively, setting a new state‐of‐the‐art. Despite not training on surgical nor natural scene data for disparity estimation.

          Abstract

          This research develops a novel stereo depth estimation method, integrating self‐supervised and supervised learning. It begins with masked image modelling for stereo‐semantic feature learning, then refines it through supervised training on synthetic data for disparity estimation. Enhanced by perceptual loss and model design, the method achieves sub‐millimeter accuracy in surgical and natural scenes, setting a new benchmark without requiring real‐world data.

          Related collections

          Most cited references30

          • Record: found
          • Abstract: not found
          • Book: not found

          Perceptual losses for real‐time style transfer and super‐resolutionEuropean Conference on Computer Vision

            Bookmark
            • Record: found
            • Abstract: not found
            • Book: not found

            Masked autoencoders are scalable vision learnersProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              An image is worth 16x16 words: transformers for image recognition at scale. In

                Bookmark

                Author and article information

                Contributors
                samyakh.tukra17@imperial.ac.uk
                Journal
                Healthc Technol Lett
                Healthc Technol Lett
                10.1049/(ISSN)2053-3713
                HTL2
                Healthcare Technology Letters
                John Wiley and Sons Inc. (Hoboken )
                2053-3713
                23 December 2023
                Apr-Jun 2024
                : 11
                : 2-3 , Special Issue: Papers from the 17th Joint Workshop on Augmented Environments for Computer Assisted Interventions at MICCAI 2023 ( doiID: 10.1049/htl2.v11.2-3 )
                : 108-116
                Affiliations
                [ 1 ] Hamlyn Centre of Robotic Surgery, Department of Surgery and Cancer Imperial College London London UK
                [ 2 ]Present address: Imperial College London Exhibition Rd, South Kensington Campus London UK
                Author notes
                [*] [* ] Correspondence

                Samyakh Tukra, Hamlyn Centre of Robotic Surgery, Department of Surgery and Cancer, Imperial College London, London, UK.

                Email: samyakh.tukra17@ 123456imperial.ac.uk

                Author information
                https://orcid.org/0000-0003-4317-7458
                Article
                HTL212067
                10.1049/htl2.12067
                11022219
                38638493
                3c498f44-1b67-47d1-90ab-aa7bba1305ce
                © 2023 The Authors. Healthcare Technology Letters published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

                This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

                History
                : 28 November 2023
                : 04 December 2023
                Page count
                Figures: 6, Tables: 3, Pages: 9, Words: 4795
                Funding
                Funded by: NIHR Imperial Biomedical Research Centre , doi 10.13039/501100013342;
                Funded by: Royal Society , doi 10.13039/501100000288;
                Award ID: RGF\EA\180084
                Award ID: UF140290
                Categories
                Letter
                Letters
                Custom metadata
                2.0
                April-June 2024
                Converter:WILEY_ML3GV2_TO_JATSPMC version:6.4.0 mode:remove_FC converted:17.04.2024

                computer vision,convolutional neural nets,learning (artificial intelligence),neural nets,stereo image processing

                Comments

                Comment on this article