We present bilinear CNNs, an architecture that efficiently represents an image as a pooled outer product of two CNN features, that is effective at fine-grained recognition tasks. These models capture localized part-feature interactions similar to those in part-based models, but can also be seen as an orderless texture representation. Based on this observation we derive a family of end-to-end trainable bilinear models that generalize classical image representations, such as the second-order pooling, Fisher-vectors, vector-of-locally-aggregated descriptors, and bag-of-visual-words. This allows domain-specific fine-tuning and visualization of the learned models by approximate inversion. Through a number of experiments we show that these models offer better accuracy, speed, and memory trade-offs compared to prior work on various fine-grained, texture, and scene recognition datasets. The source code for the complete system is available at http://vis-www.cs.umass.edu/bcnn