The convention standard for object detection uses a bounding box to represent each individual object instance.
However, it is not practical in the industry-relevant applications in the context of warehouses due to severe occlusions among groups of instances of the same categories.
For example, as shown in , it is extremely difficult to annotate the stacked dinner plates even by a well-trained annotator.
For example, as shown in [Fig. 1(g)](https://isrc.iscas.ac.cn/gitlab/research/locount-dataset/-/tree/master/Images/dataset-comparison.jpg), it is extremely difficult to annotate the stacked dinner plates even by a well-trained annotator.
Meanwhile, it is almost impossible for object detectors to detect all stacked dinner plates accurately, even for the state-of-the-art detectors.
Thus, it is necessary to rethink the definition of object detection in such scenarios.