In this paper the author presents an evaluation of different algorithms for the task of one-shot localization at different level of granularity. That is, given a training set of images taken from different rooms, an algorithm should localize from which view (same image but different illumination and small shifts), location (roughly the same 3D location bu different viewpoints: the camera can rotate but not move) and room (each image where the camera is inside the same room) a picture has been taken.
The task and therefore the needed features are probably quite different from those needed by the well known SLAM, where the main task is to track local features to be able to build a 3D reconstruction of the environment.
However, the proposed task can highly help a SLAM algorithm in case the robot is "kidnapped", i.e. the robot gets lost and sees one image without any spatial correlation with the previous ones but it still has to be able to roughly localize itself (assuming that the location is in its training set).
The idea of evaluating some image classification techniques for this specific task seems quite interesting to me and it can be useful for researches in robotics and more specifically in SLAM.
However, in my opinion the paper is not ready yet. The introduction and related work are not very well organized (see detailed remarks) and in certain points are lacking of depth.
The technical novelty of the paper seems quite limited, essentially introducing some spatial capability to a previous method (Textons). The introduced dataset is too small for a proper evaluation and the justification for it's introduction is weak.
The experimental evaluation is limited and the evaluation protocol should be clarified.
The final discussion is quite general and not many clear conclusions can be drawn form this work.
+ interesting the different granularity in the evaluation of the localization
- introduction, motivation and related work should be rearranged and clarified to make it clear what is the problem and what are you proposing to solve it (see below).
- missing related work for image classification.
- the selection of the baseline and other methods methods to compare with seems a bit biased and not totally thorough.
- the evaluation protocol should be improved and more clearly explained
- End of Second paragraph:
The author explain how are the characteristics of the wanted features. However this depends on the exact task that he wants to achieve. In this sense, either the explanation of the task should be more detailed (with a definition of localization at image, location, or room level) or the introduction should be very general. In the second case then the author should not define the characteristics of the wanted features.
- 3rd paragraph:
The author should explain with a sentence what is SLAM (can move here the explanation in the related work).
At the end of the introduction I would expect a strong motivation about the motivation of this paper; instead in the last two paragraphs there is only a mention about 3 previous methods that use global descriptors, which in my opinion is a weak motivation.
Also, in the introduction the contribution of the paper should be introduced and clearly explained.
- In the related work the author should start from general methods for image retrieval and classification, which the introduced task is a specialization.
- The last part of the related work gives some motivation about the proposed approach and in my opinion it should go at the end of the introduction.
- Here the task is finally cast as classification. In my opinion it should be done form the beginning.
- Note that recall is not enough for evaluating classification. A commonly used measure is average precision, which is the average precision of a precision-recall curve.
- I like the idea of a fine-grained evaluation of the task, from view classification, where the task can be cast as image retrieval, to room classification where the task is clearly image classification. Here (or in related work) it should be important to remark differences and similarities with image retrieval and classification. For instance, image classification is often used for classifying object categories. In this paper the proposed task is quite different because different room can share very similar views (e.g. a white wall), while this does not generally happen for object categorization.
- Here again I would expect the author to guide the reader. He should mention why Textons make sense and are possibly better than other techniques.
- When possible give more intuition about the practical meaning of the used equations. It helps the reader to follow the reading without the need to stop and analyze each equation.
- I do not understand why Spatial Pyramid should be another subsection. I consider this a typo.
- I think that each method should be used in the same way as presented in the original paper or in improved versions. In this sense the spatial pyramid should be evaluated using intersection kernel.
- In section 2.4 I do not understand how classification is performed as regression. In my understanding classification and regression are two different tasks, one with discrete classes and the other one with continuous values. Also, the author should explain and show the formulation of the learning toolbox used (GURLS).
- I would call section 2.5 just datasets as the same datasets are used for test as well as for training.
- The split between training and test data, in my understanding should go in section 2.4.
- Why did you use only 10 images for training for room classification. In general in a split between training and test, more images are used for training. In this case among 216 images, only 10 are used for training. This seems quite strange and it should at least be explicitly justified. Same for the other tasks.
- It would be interesting to explain how the locations of images on the dataset are obtained. E.g visual inspection or external measurements obtained from another sensor.
- The 3Rooms dataset seems quite small and the justification for its introduction is quite weak. Also, as the dataset is introduced for the first time in this paper a more detailed description of it is expected.
The analysis of the proposed methods should not be limited to "performance" but also the computational cost of the methods, especially considering that the proposed approaches can be useful for robotics applications where the run-time of the method is very important.
In Fig. 3 and 4 the y-axes should have a clear definition of which value it represents. "Performance" is not a valid name for the y-axes. Also, it should be clear what is the exact evaluation protocol used. For instance, the methods are evaluated in a multi-class settings or in terms of ranking as in the well known VOC Pascal Challenge?
From the discussion I would expect few and clear conclusions coming form the experiments. I appreciate that the author tries to find some hypotheses to explain all the obtained results. However, in my opinion this should come after a simple and clear explanation of the main results obtained form the experiments.
- In section 2, 5th line form the bottom: "each pixel is assigned TO the cluster"
- Section 2.3 should start with a capital letter and not consider the title part of the sentence.