The purpose of the paper is to do one-shot visual place localization. The database used is the INDECS database. The authors present a development of partitioned Texton histogram features - Adaptive Partitioning of Texture Histograms (APTH) and Adaptive Partitioning of Colour Texture Histograms (APCTH) that are supposed to better encode the distribution of texton features in an image. They use a linear regression classifer operating on these features, to classify different locations in the INDECS database and another database that they have compiled themselves. They have compared their results with other state of the art features.
Overall, the paper reads well, and the motivation is clear : try and improve texton histograms so that they are more discriminative for one-shot visual place recognition. The authors have also provided all their code and their datasets in an online link for others to reproduce their results, which is great.
However, the paper achieves inferior results compared to the one-shot visual place localization results in the Pronobis et al. 2006 paper associated with the INDECS database, and referred to by the authors in their paper as .
Specifically, the best result on INDECS achieved by the authors, for rooms classification (images taken from many places in a single room under different lighting conditions - one class per room) is by their APCTH method, and this result, in terms of percentage correct/recall is 48.10 ± 0.53%.
In the original paper that uses the same INDECS dataset , Receptive Field Histogram features, trained with an SVM classifier show a classification rate of 80.41% and 81.78% with sunny and night test data respectively, when trained with cloudy images (Figure 4c in ).
Why is there this much of a difference - 30% - between a method from close to 10 years ago, and the authors’ method, when using the same database?
The classifier required may be non-linear, and perhaps the authors need to try a different classifier, like a Support Vector Machine as used in , rather than the linear regression classifier that they have used.
The authors have also developed their own database, the 3 Rooms dataset. However, the data is captured from only one position in each room, by panning the camera around and capturing 100+ images from the same location in each room.
Contrast this with the INDECS dataset , which is captured over 5 rooms, with multiple locations in each room (a minimum of 9 for the 1-person office, and a maximum of 32 for the corridor), with 12 images per location. These images are taken in 3 different lighting conditions - sunny, cloudy and night, and captured over a period of 3 months.
So the authors need to be more methodical and meticulous in devising their own dataset, for a fair comparison.
The authors also provide evidence for the better performance of their method on the room localization task:
“Our proposed model (APTH) as well as the variant running on colored Textons (APCTH) performs well on all three place categorization tasks. For room classification (figure 3 left) we achieve 48.10 ± 0.53% correct on APCTH, which outperforms the colored Texton approach of 46.34 ± 0.56% and lies well above all other control models. Both runs were performed with n = 2 partitions.”
But the difference between their feature (APCTH) and the next best feature (coloured texton histograms) is less than a percent (0.67%) if you consider the error bars (48.10-0.53 = 47.57 and 46.34 + 0.56 = 46.90), and only 2 partitions have been used. Does this really indicate that their adaptive partitioning of texton features is doing anything useful, especially given the fact that it is actually much inferior to other methods (GIST and PyrHist) in the view task (same location, same perspective, different light)?
In conclusion, this paper needs major revisions. The authors need to explain why their results are much worse than the 2006 result on one-shot visual place recognition  achieved on the same dataset (INDECS). Their own dataset (3 Rooms) needs to be improved to make it more comparable to the original dataset. And finally, they need to more conclusively prove that their adaptive partitioning of texton histograms is doing something better than the previously published non-adaptive partitioning of texton histograms.