Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.