This paper describes a parallel implementation of a matrix/vector library for C++ for a large distributed-memory multicomputer. The library is “self-optimising” by exploiting lazy evaluation: execution of matrix operations is delayed as much as possible. This exposes the context in which each intermediate result is used. The run-time system extracts a functional representation of the values being computed and optimises data distribution, grain size and scheduling prior to execution. This exploits results in the theory of program transformation for optimising parallel functional programs, while presenting an entirely conventional interface to the programmer. We present details of some of the simple optimisations we have implemented so far and illustrate their effect using a small example.