Optimized convolutional neural network architectures for efficient on-device vision-based object detection | SpringerLink

This section provides a detailed review of the main milestones or most representative approaches developed in recent years to bring the process of object localization and classification to devices with limited computational and memory resources. We will start this review with a holistic view of the main detection frameworks collected from the related literature. In total, Table 1 lists thirty detectors conceived as CNNs with small size (number of parameters) and modest computational complexity (computational volume). Specifically, for each detection framework analyzed, we will examine in detail the different techniques or methods adopted at the architectural level for their construction, not only pointing out which main building blocks or network architectures are chosen as the main constituent elements for each of the different stages or components of its structure, but also comprehensively analyzing which specific topological adjustments or improvements were applied to each of the parts to achieve the desired efficiency-accuracy trade-off.

Table 1 Summary of most relevant architectural aspects of CNN-based lightweight frameworks for object detection

Full size table

Given the relevance of the backbone as the detection system’s key structural element, we will spend a large part of the section studying the main architectural approaches and design principles that have led to the development of lightweight yet expressive CNNs appropriate for the extraction of features with the quality required to effectively perform bounding box regression and class prediction, both of which are tasks involved in the detection process. However, this review will not be limited to the backbone networks listed in Table 1. Supported by a second table (Table 2), we will extend the discussion of those networks beyond the items presented in the first table, analyzing the most relevant general-purpose CNN architectures designed from scratch for mobile or embedded devices. Although, as we will see, most of them have not been used so far as part of any detection framework, they all represent perfectly valid approaches for this purpose. Moreover, due to their convolutional nature, they are based on structures and topological principles similar to those comprising the foundation of detectors, so their incorporation into the analysis will complement the global discussion, providing further relevant information, both at the micro and macroarchitectural level.

Table 2 Efficient CNN architectures for mobile and embedded vision applications

Full size table

3.1

Lightweight object detection frameworks

The data collected in Table 1 provide context to the current mobile scenario, chronologically locating recent research efforts focused on studying and creating lightweight object detectors in the last 5 years. In a first superficial inspection of the works analyzed in the table, focusing exclusively on the first three fields that provide more general data, it is possible to identify certain aspects of interest that outline the evolution of this new trend in the last few years. Specifically, the increasing number of related papers published (from only two in 2017, to fifteen in the last year and a half) clearly highlights a significant growing interest in the application of this new on-device paradigm to object detection. Those numbers further confirm the massive adoption of a single-stage pipeline configuration as the predominant architectural model, with ThunderNet [91] being the only two-stage detection framework of all the lightweight detectors and base detection frameworks listed.

Maintaining the same level of abstraction, but extending the analysis on Table 1 to the columns that contribute with specific data regarding each of the components that comprise the architecture of the different detection systems considered, we see that there is a marginal number of papers, namely MAOD [92], CornerNet-Squeeze [93], and LightDet [94], that explore the joint application of adjustments on backbone, neck, and head. The remaining majority is evenly split between work that explores enhancements on two of the elements that form the detection system in its different permutations [91, 95,96,97,98,99,100,101,102,103,104,105,106], and approaches that choose to focus on just one component [48, 107,108,109,110,111,112,113,114,115,116,117]. The main object of interest in the latter case is the neck, and, to a lesser extent, the backbone. If we delve deeper into this classification and extract the number of studies per individual component examined, it is possible to establish a ranking or prioritization of the three based on the level of attention they received in the different studies considered. The resulting list, in decreasing order of interest, is as follows: neck > backbone > head. Therefore, it is clear both that the emphasis on the development of specific approaches is aimed at improving the neck and the relative absence of actions focused on the detection head, whose structure, in general terms, is directly defined by the detection framework used as base macroarchitecture. The remainder of the section will include the main contributions made in relation to the three components in the last few years.

3.1.1

Neck-specific design considerations

We will now increase the level of detail of the analysis to focus the discussion on Table 1, dealing with specific architectural aspects of the different networks used as neck within the several ultra-compact detectors studied.

3.1.1.1

Classification according to the multiscale-detection-enabler mechanism used

We start the discussion with the Base Network field, which contains the most relevant CNN microarchitectures adopted as base structure for designing the final actual neck architecture. Setting aside the RPN intended for the synthesis of RoI within two-step detection frameworks and not for the enhancement of the representational power of the features involved, it is possible to group those microarchitectures into three differentiated categories or approach types according to the type of multiscale-detection-enabler mechanism used: (i) the exploitation of a pyramidal feature hierarchy, (ii) the recovery of high-resolution representations from low-resolution representations, and (iii) the maintenance of high-resolution representations throughout the entire network. This classification does not include HyperNet [87], more specifically, the Hyper Features extraction network used as neck. Although it relies on the fusion of different feature maps, thus being excluded from (i), the performed feature aggregation does not involve the generation of higher resolution representations, making it also non-categorizable in (ii) or (iii). Setting aside this exception, we will now discuss the specific features that characterize each of the three approach types indicated, and we will also address specific aspects concerning the different related architectures to clarify the underlying design principles.

i. Pyramidal feature hierarchy This is the simplest approach of the three categories identified, purely focused on detecting objects of interest with different sizes due to the exploitation of pyramid-shaped multiscale feature maps. It is represented in Table 1 by the SSD and SSDLite [48] detectors. SSD, as mentioned in Sect. 2, constitutes one of the more paradigmatic architectures within this approach, while SSDLite simply represents a lighter version of SSD, explicitly conceived for mobile devices by replacing conventional convolution operations with depth-wise separable convolutions, a practice that has been extensively adopted for alleviating the computational complexity of traditional CNNs, as shown below. Aside from this particular point, SSDLite does not introduce additional architectural concepts of interest beyond those already materialized in the original SSD architecture.

ii. Recovery of high-resolution representations This approach involves the fusion of multiscale feature maps to solve the lack of accuracy problem when detecting small objects, upsampling low-resolution representations to recover high-resolution representations progressively. This category possibly encompasses the most widely used neck-specific methods among those reviewed, covering mainstream pyramidal architectures such as FPN [86] or Hourglass [88], more compact alternatives such as Depth-wise FPN (D-FPN) [95]—which incorporates more efficient depth-wise convolutions into regular FPN—or YOLOv3-Tiny’s neck [120], and even more differentiated proposals such as NAS-FPN [111]—exploiting feature-fusion building blocks automatically derived using Neural Architecture Search (NAS)—or FSSD [89]—an improved version of the SSD architecture. Among the approaches just mentioned, FPN is undoubtedly the most representative architecture for pyramid-like feature representation generation in object detection. According to the data presented in Table 1, FPN represents the architectural option that has been selected by more authors (ten papers out of a total of fifteen, including the D-FPN and NAS-FPN variants) as the foundation for building the neck part in the lightweight detection proposals.

D-FPN and YOLOv3-Tiny’s neck are particularly interesting, since both of them follow the current on-device trend of exploring computer vision solutions tailored to low-power devices. D-FPN shares the same dual-path architecture as FPN (an initial downsampling stage followed by a second inverse stage) but it also succeeds in reducing upsampling path’s computational complexity by exploiting a more optimal structure consisting of a bilinear interpolation layer followed by a depth-wise convolution. With regard to YOLOv3-Tiny’s, the final implemented network results from a profound structural simplification, as is the case with the detector’s global architecture. That structural simplification is performed by means of aggressive optimization practices such as the significant reduction in both the number of considered scales and integrated layers on the original YOLOv3 network [77], or the fusion of only single-scale features, which is certainly to the detriment of the semantic richness of the extracted features and, ultimately, the detection accuracy.

iii. Maintenance of high-resolution representations Along the lines of the previous approach, this strategy pursues a convolutional architecture design aimed again at the generation of high-resolution representations; in this case, however, communicating those representations throughout the entire detection network in order to avoid transitions between high and low resolutions, common in multiscale approaches. Specifically, High-Resolution Net (HRNet) [126], the only architecture listed in Table 1 corresponding to this paradigm, goes beyond multilevel fusion and, as an alternative, proposes taking a high-resolution convolution stream directly as a starting point, subsequently connecting in parallel one stream per considered resolution, thus exchanging information between multiple streams. In this way, multi-resolution fusion can be performed recurrently, resulting in high-resolution representations with great semantic richness and spatial precision.

3.1.1.2

Classification according to the enhancement type produced

The architectural solutions space just discussed clearly indicates a strong presence of pyramidal CNN models, originally designed as building blocks of standard unified detectors, halfway between the current on-device approach and the more complex traditional architectures. Those models, although able to produce better representations than those generated by ultra-compact networks, typically feature both a complexity and size impracticable for systems with modest capabilities, as well as an accuracy level lower than what is commonly reached by two-step detection frameworks. In that regard, the use of a pre-existing base CNN architecture, even though it is a practice that can lighten and, in certain occasions, completely bypass the study and design of specific solutions, streamlining the design of new architectures for the neck, does not itself constitute an optimal solution. As noted in Sect. 2, a twofold effort to advance in the direction of an improved speed-accuracy trade-off, with emphasis on techniques for size and computational complexity reduction so as to consequently reduce latency (i), but also exploring methods toward more expressive networks and therefore with greater detection capacity (ii). Next, we present the key strategies and methods adopted in both directions to obtain a structure compliant to on-device paradigm’s efficiency principles, omitting overly specific design details in order to keep a desired level of abstraction to facilitate the applicability of the adjustments required in different architectures or future cases.

i. Size and latency reduction This approach pursues building network architectures that could result in models with fewer parameters and lower computational complexity, that is, with smaller size and higher inference speed. The use of factorized convolutional filters represents the most paradigmatic mechanism related to this approach [48, 91, 92, 95, 97, 108, 111], constituting in itself a CNN-compression-specific subcategory of techniques that encompasses several lighter and faster variants of the standard convolution operation such as depth [48, 91, 92, 95, 97, 111] and group convolutions [108]. In addition, this group also embraces techniques based on the integration into the architecture of building blocks such as attention modules, used in [93] to both reduce the number of pixels to be processed in region-based detection and thereby increase the speed of object detectors, and Fire modules, explored in [96, 108] to, again, lower the number of parameters while preserving accuracy. Along the same lines, supplementing the exploitation of such blocks, Wang et al. [106] report the application of the recent CSP design [125] to the various structural components of a detector as a highly beneficial alternative to the more traditional residual connections able to reduce the number of parameters, the computations and the inference time. Finally, this category also includes more simplistic techniques such as the direct reduction of the quantity of weights in the network, for instance, removing larger feature maps as in [107]; the use of layers based on 1×1 filters instead of fully connected layers to perform predictions [107]; or the simple optimization of the number of filters used, even if that involves breaking the ruling microarchitectural homogeneity in CNN building blocks, as in [96].

ii. Detection performance increase This is a significantly more heterogeneous approach than the one presented in (i) aimed at achieving better classification performance but mainly focused on increasing detection accuracy, especially in complex applications such as small target detection. It is possible, therefore, to identify, two differentiated strategy types: general-purpose methods [91, 92, 97, 107, 108] applicable for any CNN and grounded in concepts that shall emerge again in the discussion of the backbone; and object-detection-specific methods [91, 94, 97,98,99, 102,103,104,105, 109, 115, 117], primarily focused on improving localization tasks.

Since the emergence of CNNs, there has been a well-known and long-standing interest, regarding general-purpose methods, in improving the performance of vision-based systems in classification tasks, exploring solutions aimed primarily at increasing the representation capacity of the built networks and, consequently, improving learning and accuracy. Although many of the actions taken to that end, such as making the network deeper, are largely impractical in the on-device context, the underlying philosophy remains completely applicable, and it also constitutes the foundation of a considerable body of more specific approaches or subcategories seeking lightweight solutions. As shown in Table 1, various studies can be found in the literature, such as: [108], which explicitly seeks to provide CNNs with a better and more efficient representation learning capacity by leveraging group convolutions, as in the referenced work; studies that perform more simplistic practices, for instance, exploiting larger convolution filters [91] and removing subsampling layers [92], to achieve or maintain a large-sized receptive field, enabling the subsequent encoding of a larger volume of information; approaches [107, 108] that, for example, rely on the addition of shortcut connections (residual blocks) in the network architecture in order to alleviate the vanishing gradient problem; strategies for increasing nonlinearity, such as the use of 1×1 pointwise convolution operations [97] or the use of the Hard Swish (h-swish) activation function instead of a more standard option such as Rectified Linear Unit (ReLU) [92]; and, finally, mechanisms for better information flow, such as the aforementioned shortcut connections [107, 108].

In terms of detection-specific enhancement solutions, these are usually methods that, based on multiscale feature maps [98, 109] (essential for the detection of multiple targets with different sizes), aggregate low-level high-resolution features with high-level semantic features to achieve greater semantic richness [91, 94, 97,98,99, 102,103,104,105, 109, 115, 117] as a result. Apart from two specific contributions that propose efforts directly related to the exploitation of multiscale features—increasing the number of different scale levels considered for the output [98] and using encoder–decoder structures for feature generation at different levels [109]—we identify table methods in the lightweight-detection-architecture-devoted that are essentially located in the space of solutions aimed at obtaining more valuable features, semantically speaking. More specifically, data presented in the table in this respect create a scenario where the fusion of multiscale feature maps [94] constitutes the dominant approach and where related works primarily focus both on different information transfer and exchange structures, namely dense connections [103, 115] and inverted residual blocks [99, 104, 105], and on attention mechanisms, an approach primarily aimed at extracting more discriminative features, mainly channel-wise [98, 104, 105] but also simultaneously at the spatial and channel level [102]. Additionally, several other approaches that also seek to improve the network’s expressiveness can be identified, but, in this case, they are achieved by enriching intermediate features due to the use of the convolution on dimension-reduction blocks [97] or by using attention modules to adjust the feature distribution and thus facilitate the distinction between background and front features (spatial attention) [91].

3.1.2

Detection head architectural principles

Twenty-one [47, 95, 96, 99, 100, 102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117] out of the thirty papers reviewed show no action explicitly focused on the design of the head for lightweight detectors other than the study and selection of the base network architecture to be used. It is possible to go a little further and even state that there is no trace of apparent activity in this regard since the different microarchitectures used for this purpose (at least, the ones reported in Table 1) are merely the structures proposed as detection head of the corresponding base detection frameworks. This is even extended beyond this group of publications that do not address head-specific enhancements and remains as a constant throughout all the works listed, with the dual exception of BMNet, which does not provide head-specific information at all, and LightDet, which proposes its own microarchitecture conceived from scratch.

Regarding the different architectural alternatives used as a reference for the design of the head in lightweight object detectors, Table 1 shows a general scenario very similar to the one described in the previous point for the neck, dominated by networks initially conceived as structural elements of unified detectors, but a scenario that, in this particular case, features a slightly broader range of options. An initial superficial exploration of the data collected in the table makes it possible to infer a predominant network type or profile characterized not only by its unified architecture but also by its ability to detect objects with different scales and aspect ratios due to multiscale feature processing and the use of anchor boxes in the detection process. Thus, fitting the profile outlined, a total of six base head designs (originally part of SSD, SSDLite, YOLOv3, YOLOv3-Tiny, RetinaNet, and RefineDet) adopted in eighteen of the twenty-five works studied can be found in Table 1. There are also microarchitectural alternatives that do not, albeit almost marginally, conform to the well-known anchor-based approach, either because they have been implemented as a part of a two-stage detection pipeline (Faster R-CNN and Light-Head R-CNN), despite relying on anchors, or simply because they have been conceived as part of non-anchor-based detectors (YOLO, CornerNet, and FCOS [127]).

Regarding any modifications applied to the base architecture, the two-fold approach already identified during the neck-related discussion in Sect. 3.1.1 (and present in the backbone analysis as well) emerges again. Thus, it is possible to classify the neck-specific enhancement methods into the same two categories or approach types: a first group focused on size and latency reduction (i) and a second body of techniques with an emphasis on increasing or at least maintaining detection accuracy to make up for the potential harm caused in this respect by the techniques in the first category (ii). Overall, there is a major gap in terms of prevalence distribution between neck-related adjustments and those targeting the detection head. More specifically, the data presented in Table 1 show that there is an evident polarization of head-centered techniques into two distinct groups that did not emerge in the analysis of neck-related approaches. Thus, except for a couple of papers that can be simultaneously associated with the two different approach types considered [92, 94], every single modification can be located in one of the two indicated solution spaces. That divergence becomes even more pronounced if we take into account the size of those spaces: quite even for the two groups when it comes to the neck while significantly uneven when talking about the head. Furthermore, regarding the latter, the subgroup of methods that seek to lower computational and memory cost have an evident prominence (approach embodied by five publications [48, 91, 93, 98, 118] referenced in Table 1) compared to the method that encompasses accuracy-centric modifications (with only two representative works [97, 101] in the table).

Turning now to specific approaches, we identify in group (i) strategies that are mainly aimed at reducing the number of parameters in the models produced, and thus are able to initially reduce the models’ size, and consequently in many cases, their computational complexity as well. Among them we can distinguish techniques eminently focused on the inner configuration of layers and, therefore, on the modifications of filter-specific aspects: the use of depth-wise separable convolutions instead of standard convolutions [48], the decrease in the number of channels [91], or the use of smaller-sized convolutional filters [93, 94]. Also included in this parameter-reduction-oriented subgroup are methods that address more general layer-related considerations, such as replacing fully connected layers with convolutional layers [118] or simply omitting a subset of the layers that can be found in the original architecture [98]. Finally, to complement the different approaches just mentioned, we also associate to group (i) a different subcategory or approach type that directly pursues computational complexity reduction, represented in Table 1 by a single paper [92] that proposes the addition of dedicated layers for removing the background of the given input image (suppression of non-useful information) to reduce the number of pixels to be processed.

The second head-specific- category or group (ii) is monopolized by the exploitation of residual blocks as a constituent part of the head’s structure [92, 94, 97, 101]. Such an approach is able to both increase the detection accuracy and contribute to reducing the resulting network’s memory requirements. Used as an enhancement mechanism also for the neck’s architecture, as indicated in Sect. 3.1.1, this type of block integrates the so-called shortcut connections to allow the flow of information between shallow and late-stage layers and thus keeps the semantic richness of features [97]. The value of residual connection goes beyond its accuracy-enhancement ability; it has also been adopted as the base structure for conceiving architectural alternatives equally advantageous in terms of detection accuracy, such as the bottleneck residual block [101], capable of fusing high-level multi-scale features, the inverted residuals and linear bottlenecks [92] that enable increasing the representational power of channel-wise nonlinear transformations, and a lighter version [94] that, inspired by group convolutions and comprised of two different branches, leverages channel shuffle to allow information exchange between branches.

3.1.3

Efforts for a more efficient backbone

Specifically, concerning the architectures integrated as backbone in the detection frameworks under examination, the related data collected in the table confirm the predominant, but not exclusive, use of simplified CNN architectures. Excluding FRDet [100]—with no representative data reported in Table 1 about the network or architecture used as backbone—twenty-four frameworks of a total number of thirty use a lightweight subnetwork as backbone. Furthermore, in that group we can identify just ten distinct alternatives, a number that could be even lower if grouped into families of detectors: MobileNets [47, 48, 92, 99, 111,112,113, 115, 117], ShuffleNets [91, 94, 97, 104], SqueezeNet [96, 108, 118], PeleeNet [102, 107], and DarkNet-19 [103, 109, 114, 116]. Google’s MobileNets emerges as the most dominant lightweight architectural solution. This observation, although it ignores configuration or structural efficiency matters (they will be addressed in the next section), is entirely consistent with the evolutionary sequence of on-device vision models reported in the literature, where MobileNets, first introduced in 2017 [47] and with three different versions, stands as the most mature compact alternative as well as one of the main drivers of the growing attention generated by ultra-compact vision models in the research community during the last few years. Finally, regarding the rest of the backbone-specific architectures referred to in the table, apart from the lightweight alternatives, it is possible to identify a second group that encompasses five standard CNN architectures [93, 95, 98, 101, 110], where, beyond the mere intuition of a more specific nature, it is not possible to infer any pattern that might be of interest in this analysis.

In a joint review of the several architectures adopted as backbone and the adjustments applied to them, it is possible to extract observations that, while not backed by specific metrics and measurements, provide valuable intuition about the performance and, in general, the suitability of the architectural solutions proposed. In that sense, even though lightweight CNN architectures have been designed from scratch, bearing in mind the hardware limitations of the target devices, and have largely succeeded in deriving models of extremely reduced size and complexity, they may still be insufficient or inadequate solutions depending on various factors such as the hardware platform and the application domain. This reality is reflected in Table 1, where few studies report directly employing compact CNN architectures as backbone of the detector [47, 48, 97, 105, 107,108,109, 111, 113, 118], while a fair majority proposes specific enhancements or optimizations [91, 92, 94, 96, 99, 102,103,104, 106, 112, 114, 116, 117] for the architectures previously selected. Going into more detail, a closer look at the data on such modifications allows us to identify an approach that is fundamentally oriented at obtaining greater precision [91, 92, 94, 99, 103, 104, 112, 114, 117], which confirms the need to make up for the accuracy degradation typically resulting from the structural simplification or miniaturization of the network.

Limiting our focus to the modifications applied to the architecture selected as the starting point for building the backbone, it is possible to categorize the strategies and methods listed in the table into the same two groups we considered for this purpose in both Sects. 3.1.1 and 3.1.2. Hence, we identify once again a group of techniques on one side of the table that respond to a size and latency reduction approach, and, on the other side, a collection of methods focused on preserving ad increasing accuracy.

The first group contains techniques basically aimed at reducing the computational cost of the network. As we pointed out in relation to the mechanisms designed for enhancing the head’s structure, it is possible to identify two different types of solutions within this group according to the architectural level they operate on. In particular, in a first microarchitectural subgroup, we find (i) approaches based again on the exploitation of more efficient variants of the convolution operation, such as depth-wise convolutions [95, 98, 102], depth-wise separable convolutions [93, 116], and group convolutions [98]; and in the second group we find (ii) strategies that have a direct effect on the configuration of the convolution filters used in layers or blocks of the CNN, i.e., both methods targeting the number of filters—reducing the number of filters of the first layer of the network [112], selecting an optimal number of filters specific per building block [96], and the choice of a better compression rate for Fire modules [100]—and additional techniques focused on the number of channels—linearly increasing the number of channels as the network deepens [95], or assigning the same number of channels both to the input and output of residual blocks such as Res2Net [101]. Finally, in addition to the slimming strategies pointed out, there is a second collection of solutions that address the same problem, but, through a macro lens, exploring different design options such as the removal of some layers present in the original architecture, the CSP-ization of the network [106], the use of Fire modules [93, 103] (with greater ability to reduce the number of parameters), the thinning of layers and building blocks [93], and more appropriate distribution of subsampling layers across the network [93, 101].

Closing this review of the specific tweaks performed on the backbone, we can also identify in Table 1 a substantial number of studies that explore alternatives in the search for greater accuracy in object detection. Among the options listed, the residual block structure, based on shortcut connections, is revealed as the most versatile approach in this regard, constituting an effective solution for increasing accuracy both in classification and detection tasks and also the preferred option [98,99,100, 103, 114] among the several related alternatives presented in the table. Interest in the ability of residual connections to enable better feature propagation and guarantee maximum information flow across the network goes beyond the residual block. Thus, that type of connection has been successfully incorporated into other building blocks, being used, for example, as an upgrade of Fire modules [100] or as an integral part of the inverted residual blocks exploited in [99] to achieve better multi-scale detection. In addition to the detection-specific techniques for better accuracy just mentioned, the accuracy-focused approach also encompasses a second collection of methods primarily designed to provide better class predictions. In the table, we can distinguish the following: (i) approaches that seek to increase the size of the receptive field due to, among other practices, the use of larger convolution filters [91, 104, 117], the insertion of bottleneck layers for subsampling at different stages of the network [101], and the use of dilated convolutions in the network stem [92, 94]; (ii) studies such as [94] or [91], which by using either dilated convolutions or convolutional filters with a higher number of channels in early stages of the network, seek to extract and preserve more low-level features; and, finally, (iii) alternatives of a more punctual nature, already mentioned in the two previous analysis made correspondingly on the neck and head-oriented modifications, such as the use of the h-swish activation function [112], the inclusion of attention modules to increase the representation power of the network [92, 117], or the application of channel shuffle after group convolutions to enable information exchange between groups.

3.2

High-efficient CNN architectures for backbone build

The analysis of the mechanisms and design strategies adopted for conceiving lightweight detection frameworks reveals a constant interest in finding a better trade-off between accuracy and detection speed. In this context, the backbone constitutes the key component within the object detector architecture, not only because it lay downs the structural guidelines for detectors but also because it is the component responsible for processing input images in the first stage of the detection pipeline in order to extract the features that are supplied later on to the two remaining components of the detector. Backed by the data collected in Table 2, we extend the analysis performed in Sect. 3.1.3 with additional lightweight CNN architectures that, despite not having been used to date for building detection frameworks in the on-device context, have been entirely conceived under the design principles of this paradigm. As in the different subsections included in Sect. 3.1, we will address the structural specificities of the different CNN architectures considered, focusing our efforts on identifying the principal techniques and methods applied in each case.

In a first superficial review of the data included in the table, which was focused only on the first four columns, it is possible to derive several general points that add further detail to the on-device scenario so far presented. The architectural developments, with the exception of the study from 2016 by Iandola et al. [49], are temporally located between 2017 and 2021, just like the different detection frameworks above analyzed. Once again, it confirms the chronological parallelism between lightweight-CNN-specific and ultra-compact-detector-specific development approaches already pointed out in Sect. 3.1.3, and it also reinforces the key role that recent general computer vision progress has played in developing ultra-compact detection systems. Regarding the CNN architectures used as a reference for conceiving algorithmic solutions deployable on low-powered devices, except for a handful of authors working on conventional CNNs [47, 49, 54, 55, 101, 105, 107, 125, 128, 129], the mainstream focus has been on exploiting lightweight architectures as the starting point. Moving on to the detail of the specific architectures used for that purpose, a family-based grouping of the several approaches considered can be easily observed (MobileNets [48, 130,131,132,133,134], ShuffleNets [50, 52, 135], SqueezeNet [136], and CondenseNet [55]) bearing a strong similarity to the approach laid out in the previous section, emerging again as the most used lightweight architectural solution the MobileNets family. The residual structure also stands out as the design pattern with the more significant presence in these architectures, either directly as part of the base standard CNNs [54, 55, 101, 107, 125, 128, 129] or as the base structure of the novel building blocks resulting from the enhancement techniques and methods applied.

If we turn our attention to the various adjustments made, the first thing that stands out is the significantly higher number of enhancements listed in Table 2 for each study if we compare it with the number of entries that we can see for the same concept in Table 1. Those data highlight the complexity of the miniaturization and rational simplification of CNN architectures for use on edge devices, as well as the enormous research efforts undertaken in this line in recent years, which has made it possible to enhance and streamline the design of specific on-device solutions in different vision-related application domains, such as object detection. There is, however, a gap between the predominant backbone-focused adjustment type found in Table 1 and the type derived from Table 2. More specifically, in the first case, we mostly find techniques and methods that emphasize achieving greater precision for producing a lightweight backbone design (mainly defined by the base CNN architecture), yet with an effective expression capacity to properly act as a structuring element in detection systems. In contrast, for the adjustments listed in Table 2, the focus is placed on obtaining more efficient CNN architectures (in particular, in seventeen [47,48,49,50, 52, 54, 55, 104, 107, 125, 128,129,130,131, 133, 135, 136] of the twenty-one works under study), also considering new avenues of exploration in this respect such as the reduction of memory access cost or the exploitation of more efficient optimized-implementation-based operations at the code level.

In terms of scope at the architectural level, we can make a first classification of the enhancement techniques and methods considered in two different types of approach: those operating at the microarchitectural level, i.e., at the inner level of layers and modules; and those working at the macroarchitecture level, defining arrangement-specific aspects regarding the different modules or layers within the CNN architecture. Beyond the data collected in the Architectural scope field in Table 2, which aims to capture the general essence of the several related works listed, a more detailed analysis of required adjustments provides a much more accurate picture of the trend in terms of structural design, especially with such an important body of information as the one presented in the table. Thus, although a majority of microarchitectural adjustments can already be noted from the data in the Architectural scope field, that becomes even more evident when the data included in the Adjustments column are incorporated into the study. Numerically speaking, only fifteen macroarchitectural adjustments are identified compared to the fifty observed at the microarchitecture level. More specifically, the first group of approaches encompasses strategies to enhance the CNN’s overall architecture by (i) replacing a certain type of layers or building blocks with more lightweight alternatives [49, 131, 132] or variants with greater capacity to maintain or even increase the expressiveness of the network [101], (ii) implementing guidelines governing how certain network properties or elements evolve as it becomes deeper [49, 55], and (iii) appropriately configuring the connections between layers or modules [48, 55, 105, 125, 129]. Within the group of micro approaches, we find a wide range of options that can be categorized into two distinct subgroups: an initial collection of techniques that focus on convolutional-filter-specific aspects or properties such as the number of filters [107], the size of these in the spatial dimension [49], the number of channels [49, 52, 101, 105, 107, 130], the communication between them [50, 54], or the number of channel groups [101]; and a second subgroup encompassing methods targeting the internal structure of layers or modules such as the exploitation of alternative operations to convolution [47, 48, 50, 52, 54, 105, 107, 128, 130, 131, 133,134,135,136], the replacement [48] or omission [133] of nonlinearity, or the application of an attention mechanism [53, 132, 133].

Keeping structural consistency with the different subsections in Sect. 3.1, we establish a second categorization of the adjustments under consideration, according to the targeted network-accuracy-specific features or aspects. Thus, we identify techniques that respond to a size and latency reduction approach and as well as methods focused on preserving and increasing accuracy as much as possible.

In the first category, the usage of less costly convolutions—depth-wise convolution [47, 48, 54, 105], separable convolution [136], and depth-wise separable convolution [53]—stands out again as the most common approach, extended in this case by the exploration of other practices that revolve around additional efficient operations: replacing costly standard convolutions with memory shift operations [50, 128] for information fusion, information exchange between channels, and channel concatenation; replacing 1×1 group convolutions with a less complex channel split operation [52]; using simpler linear operations for partially generating feature maps [133] instead of fully using convolutions for that; or designing a novel building block to encode spatial and channel information with higher efficiency than depth-wise separable convolutions [135]. Precisely in relation to channels and, specifically, to the introduction of sparsity in the connections, we identify a second large group of adjustments that encompasses some of the strategies already observed in previous analyses such as the replacement of pointwise convolutions with group convolutions [54, 132], but also unseen related techniques, such as the replacement of pointwise convolutions with channel wise more sparse convolutions [131], or the conception of a novel type of convolution that extends group convolution and, in contrast to the latter, allows an output channel to depend on an arbitrary subset of input channels, thus obtaining greater computational efficiency and reducing the number of parameters. Also related to channels, but in this case, focused on the number of channels handled, we identify additional enhancement strategies that pursue channel reduction [49, 136] and some others that lead to interesting guidelines about what the ratio between the number of input and output channels should be in order to lower the computational cost [105, 107] or the memory access cost [52]. Rounding this collection of efficiency-focused adjustments, there are several solutions of a more precise nature, such as exploiting convolutions with a more efficient software implementation [107, 136], using more efficient residual structures [125, 129] or downsampling strategies [105], omitting h-swish nonlinearity due to its high latency [133], merging successive element-wise operations, and thus lowering this type of costly operation in terms of memory access [52], removing redundant connections [55], using smaller-sized convolutional filters [49], or replacing heavy layers with a lighter alternative [49].

Finally, regarding refinement approaches expressly designed to preserve or increase accuracy, it is possible to distinguish a considerable range of different techniques, which are, however, practically evenly distributed. Specifically, except for just one of the adjustments in consideration, it is possible to cluster the options listed into seven distinct groups, each comprising two specific strategies or methods. We have identified the following groups in the table: (i) approaches aiming to prevent feature map size reduction in order to avoid harming the network expressiveness, either by delaying subsampling, i.e., moving subsampling layers or blocks to deeper stages of the network [49], or by using transition layers—composed of convolution and pooling operations—without compression [107]; (ii) methods that, like channel shuffle [54] already introduced in Sect. 3.1.3, enable information exchange between channels, either via more efficient memory shift operations [128], or in a more straightforward way replacing point-wise group convolutions with alternatives that do not block the above-mentioned information exchange between groups [53]; (iii) techniques based on the exploitation of the residual block structure, adding to the network architecture not only the already well-known shortcut connections [48], but also dense connections to boost feature reuse [55]; (iv) strategies that rely on the gradual increase of the growth rate in dense-connection-based networks [55, 107] to cost-effectively increase feature expressiveness; (v) approaches focused on the receptive field that, in line with some of the practices already identified for neck and head refinement, aim to both increase its size [134] and also generate variations with different scales [107]; (vi) the integration of attention mechanisms [132, 133] to boost representational power; and even (vii) simpler practices such as removing nonlinearities in shallow layers to preserve the representativity of the network as well [48].