Keywords

1 Introduction

Machine Learning (ML) and Deep Learning (DL) techniques have widely spread across the most diverse areas in the last decade, achieving state of the art in several fields. Convolutional Neural Networks (CNNs) models—e.g., the ResNet [29] or the Inception [51]—and their recent extensions with Vision Transformers [24, 55], provide a classification of input images. Similar architectures—e.g., the Yolo [47], can also identify the position of the detected objects. Other examples of DL applications are natural language processing [14, 62], anomaly detection [52], event-based cameras [10], and autonomous driving [11, 12].

All DL techniques share some common characteristics. At first, the huge number of parameters. To give an idea, the ResNet CNN has 11 to 60 million parameters, whereas the Inception-V3 24 to 43 million. The image and language modeling architectures based on transformers have one or more order of magnitude more parameters, e.g., the ViT architectures [24] have 86 to 632 million parameters, BERT [14] 110 to 345 million, and GPT-3 [8] 175 billion. It is straightforward that only the memory required by the parameters ranges from dozens of megabytes to various gigabytes. Secondly, the DL training—i.e., the optimization problem to find the parameters’ values—is time-consuming (and extremely energy-hungry), requiring days or weeks on high-performance computers with clusters of GPUs or TPUs. For instance, [48] roughly calculates 4.5 GPU-years to complete the optimal training of their DL models, whereas [5] estimates 16 GPU-weeks to train the Inception-V3 CNN. Finally, although the inference—i.e., the processing of one possibly unseen input by a trained DL model—is significantly simpler than training, the number of multiply-add operations required is huge. As an example, the ResNet CNN requires 5 to 11 million multiplications to classify one image.

Following the terrific spread of pervasive devices—Embedded Systems, Internet of Things (IoT), and MicroController units (MCU)—the intelligence for pervasive units emerged as a novel, challenging, and breakthrough research direction and application of ML and DL. On the one hand, pervasive devices are nowadays used in a broad range of applications, such as autonomous driving, smart cities, and smart industries. On the other hand, moving the (machine and deep learning-based) intelligent processing—e.g., fault detection or change detection algorithms [22]—as close as possible to data generation is crucial to support real-time applications, prolong the system lifetime, and increase the Quality-of-Service [2, 50, 53].

In contrast to these benefits of intelligence for pervasive devices, there are the constraints on memory, computation, and energy characterizing such devices. On IoT units, the memory capacity ranges from Kilobytes to Megabytes [21], and the power consumption should be in the order of milliWatts [44]. Instead, MCUs rarely have 1 Megabyte of memory and require their application to run in the microWatts area [38]. For instance, the most powerful MCU is the STM32H743ZI MCU (of STMicroelectronics), which has a 480 MHz Cortex M7 processor and 1024 KB of RAM that is further split into five blocks with different speeds and purposes. The majority of MCUs have instead 96 to 256 KB of RAM.

This work addresses the challenges mentioned above by designing deep learning-based intelligent techniques for pervasive devices, i.e., DL models whose requirements in terms of memory footprint, computational load, and energy consumption do not overcome the corresponding technological constraints of the IoT or microcontroller unit where they are executed. In the related literature, Tiny Machine Learning (TML) [16, 27, 38, 49] is a newborn research direction with the same goals. In particular, TML takes into account all the three main approaches developed in the related literature to reduce DL model’s complexity (see Sect. 2) and provides a unified and comprehensive approach to the problem [2, 20]. More in detail, almost all the TML works focus on allowing the ML or DL inference on IoT units or MCUs, with very few works tackling and proposing on-device learning [9, 19, 20]. On-device learning is a demanding and tough task, but it allows to adapt the TML model over time from the incoming novel data and in response to concept drift, i.e., statistical variations of the data generation process. Concept drift is widespread in real-world scenarios—typical causes are sensor faults, aging or seasonality effects, and user behavior changes—and, if not correctly addressed, might result in catastrophic impacts on the TML model [22].

In the aforementioned scenario, this work proposes a methodology to design and deploy Deep and Wide Tiny Machine Learning solutions for the considered tiny device(s), i.e., IoT units, embedded systems, or MCUs. Specifically, the methodology deal with the problem from two different perspectives. The first one, presented in Sect. 3, details how to design inference-based Deep TML solutions, i.e., TML solutions based on Deep Learning. Section 4 extends such an approach by introducing on-device learning mechanisms. Section 5 discusses the second approach, namely Wide (Deep) TML, that splits the DL computation over a set of possibly heterogeneous tiny devices.

2 Related Literature

The related literature addresses the problem of reducing the DL models’ complexity (with or without the goal of matching the technological constraints of some target devices) in three main ways, detailed in the following paragraphs. Please refer to [15] for a detailed version of this analysis.

DL Hardware Design. The design of dedicated hardware solutions provides the best power consumptions and inference times, at the expense of a complex design phase and reduced flexibility. Examples of newly designed hardware are Tensor Processing Units (TPUs) and Neural Processing Units or Accelerators [4, 34].

Approximating DL. The two main approximation techniques for DL models are task dropping (or pruning) and precision scaling [2, 40, 45]. Pruning techniques are organized into structured algorithms that prune the structure of DL models (e.g., the layers) and unstructured ones that can also trim the parameters within a level without maintaining the structure [23, 37, 43, 59]. Precision scaling mechanisms aim instead at changing the classical 32-bit floating-point representation of parameters with 8-bit data types, novel or fixed-point data-types, as well as ternary or binary ones [33, 64].

Other approaches encompass the definition of optimized architectures or layers—e.g., the depth-wise convolutions in MobileNet [31], or the groups of convolutions in ShuffleNet [61]–, the definition of sparse architectures [39, 42], and the introduction of Early-Exits [7, 17].

Distributing DL. In this approach, the DL computation is distributed over a set of heterogeneous tiny devices, without introducing any approximation but adding as a new layer of complexity the data transmission [32, 54, 63].

Machine Learning in Presence of Concept Drift. The literature about ML or DL in presence of concept drift organizes the works into two main classes [22, 28]: passive solutions [25, 36] always adapt the model independently from the evidence of a concept drift; whereas active approaches [18, 57, 60] employ techniques to detect changes and then adapt the model.

3 Deep Tiny Machine Learning

Let \(\mathcal {D}\) be a DL model that provides an output \(\hat{y}\) for an input x, hence \(\hat{y} = \mathcal {D} \left( x\right) \). Its processing pipeline is a sequence of M layers \(\varphi _{\theta _\ell }^{(\ell )}\)s, for each \(\ell \in \left\{ 1, \ldots , M\right\} \), with parameters \(\theta _{\ell }\), where each layer can be for instance a convolution, a pooling operator, or a non-linearity. Such pipeline is then formalized as \( y = \mathcal {D}\left( x\right) = \varphi _{\theta _M}^{(M)} \left( x_{M-1}\right) , \) with \( x_{\ell } = \varphi _{\theta _\ell }^{(\ell )} \left( x_{\ell -1}\right) , \) and where \(x_0\) degenerates to the input x.

Fig. 1
figure 1

A sketch of the proposed mechanisms to reduce the memory footprint of a Deep Learning model \(\mathcal {D}\)

Let \(\varOmega \) be a transformation function that takes as inputs a DL model \(\mathcal {D}\) and a set of approximation parameters, and provides as output the corresponding approximated DL model \(\hat{\mathcal {D}}\). The definition of such a function, as well as the set of possible approximation parameters, strictly depend on the considered approximation techniques. For the purpose of this presentation, two approximation techniques are employed and shown in Fig. 1:

  • task dropping (or structured pruning), that is defined as the skipping of some tasks to save memory and computation [45]. In our case, task dropping, parameterized by \(\tilde{m} \le M\), consists in keeping the first \(\tilde{m}\) layers of the original DL model \(\mathcal {D}\), i.e., \( \hat{\mathcal {D}} = \varOmega \left( \mathcal {D}, \left\{ \tilde{m},\mathcal {K}\right\} \right) = \mathcal {K}_{\theta _\mathcal {K}}\left( \varphi _{\theta _{\tilde{m}}}^{(\tilde{m})} \left( x_{\tilde{m}-1}\right) \right) , \) where \(\mathcal {K}\) is an additional layer (e.g., a classifier) that may be added on top of the approximated pipeline to maintain the original DL model scope. Other task dropping techniques can be considered as well, such as considering non-consecutive layers in the approximated pipeline, unstructured pruning techniques, or layer-specific mechanisms (e.g., pruning filters within convolutional layers);

  • precision scaling that consists in changing the representation of the parameters \(\theta _{\ell }\) of a DL model \(\mathcal {D}\). The transformation is parameterized by q and is defined as \( \hat{\mathcal {D}} = \varOmega \left( \mathcal {D}, \left\{ q\right\} \right) = \varphi _{\tilde{\theta }_M}^{(M)} \left( \ldots \left( \varphi _{\tilde{\theta }_{1}}^{(1)} \left( x\right) \right) \right) , \) where \(\tilde{\theta _\ell }\) represents the approximation of parameters \(\theta _{\ell }\). In the simplest version, q represents the number of bits at which quantize the parameters (e.g., 1, 2, or 8 bits). Still, custom data types or different quantization schemes per each DL model layer could be considered as well.

Finally, let \(\bar{m}\) and \(\bar{c}\) be the memory and computational constraints of a target tiny device, respectively. The definition of energy constraints is more complex since every DL inference affects the energy budget, whereas the memory and the computation remain fixed. As a consequence, energy constraints are not employed in the following with the device assumed plugged into an energy source.

The methodology. In this scenario, a methodology \(\mathcal {M}\) to automatically design Deep Tiny Machine Learning solutions, i.e., DL-based TML models satisfying the technological constraints of the tiny device executing them, is defined as the function \( \mathcal {M} \left( \mathcal {D}, \varOmega , \mathcal {T}, \bar{m}, \bar{c} \right) , \) as in [2] and with \(\mathcal {T}\) being a validation set. Such a methodology operates in two steps. The first step identifies the set \(\mathcal {F}_{all}\) of all the feasible solutions. Formally, \( \mathcal {F}_{all} = \left\{ \hat{\mathcal {D}} \text { s.t. } m_{\hat{\mathcal {D}}} \le \bar{m} \text { and } c_{\hat{\mathcal {D}}} \le \bar{c}\right\} , \) where \(m_{\hat{\mathcal {D}}}\) and \(c_{\hat{\mathcal {D}}}\) represent the memory and the computational requirement of the approximation \(\hat{\mathcal {D}}\), respectively. The set is generated by applying the transformation \(\varOmega \) for all the considered approximations and the range of their parameters (in our case \(\tilde{m} = \left\{ 1, \ldots , M\right\} \), a set of possible layers \(\mathcal {K}\)s, and a set of precision scaling mechanisms qs). A trade-off between the granularity of explored approximations and the corresponding exploration time is hence expected.

The methodology’s second step takes as input the set of feasible solutions \(\mathcal {F}_{all}\) and the validation set \(\mathcal {T}\) and provides as output a Pareto Front \(\mathcal {F}\) of those solutions. More in detail, the front is defined by the triples \(\langle \hat{\alpha }_i, m_{\hat{\mathcal {D}}_i}, c_{\hat{\mathcal {D}}_i} \rangle \), computed after training (and evaluating it with the metric \(\hat{\alpha }\)) only layer \(\mathcal {K}\) (in a transfer-learning fashion [58]) by relying on disjoint subsets of \(\mathcal {T}\).

Selecting the target solution: a discussion. Since the methodology provides as output a Pareto front, the choice of the optimal solution for each application scenario is left to the methodology’s user. As an example, real-time scenarios suggest choosing the solution with a smaller computational complexity (probably paying in terms of accuracy). In contrast, in scenarios where accuracy is crucial, this reasoning no longer holds.

A second aspect to consider is the generalization of the chosen solution besides the validation set \(\mathcal {T}\) (as it happens in autoML tools [26, 56]). Although not investigated for the proposed methodology, we expect that the introduced technological constraints will mitigate the problem, not allowing the creation of over-complex (and prone to overfitting) DL models.

4 On-Device Tiny Machine Learning

The need for adaptation. The methodology briefly introduced in Sect. 3 proposes a set of solutions whose inference is feasible on the desired tiny device. Consequently, a deployed solutions can process novel unseen inputs x and provide corresponding outputs \(\hat{y}\), but it will remain fixed over time. The latter is a major limitation in real scenarios for two main reasons. At first, the data needed to train (more precisely, fine-tune) the DL solution might be extremely limited or available incrementally. Hence, it is desirable to deploy a first (approximated) DL model on the tiny device and then consider some incremental learning mechanisms to improve the model on newly arrived data [19]. The second reason is the intrinsic non-stationarity of the environment in which the DL model operates due to seasonality, faults in the sensors, or aging effects, to make a few examples. A model that does not adapt in response to an environmental change—formally, a concept drift that can be viewed as a statistical variation of the data-generation process’ behavior—is likely to become useless quickly [22].

The proposed solution, presented in [19, 20], exploits the freedom the methodology discussed in Sect. 3 allows in the choice of the additional layer \(\mathcal {K}\) to be added on top of the approximated DL model \(\hat{\mathcal {D}}\). Although a straightforward approach to the problem suggests to rely on classical DL-based classifiers (e.g., fully-connected layers) or machine learning parametric and supervised classifiers (e.g., Support Vector Machines [13] or Random Forests [30]), here the choice is to rely on the non-parametric k-Nearest Neighbor [3].

The kNN algorithm predicts according to the majority voting of the k samples of the input x within its knowledge (or training) set \(\mathcal {T}\), where k is typically set as the ceil of the square root of the cardinality of \(\mathcal {T}\) [1]. The main advantages of choosing the kNN are the absence of a training phase (providing a training set is enough) and the cost-free adaptation procedure, where it is sufficient to modify the training set \(\mathcal {T}\) to update the kNN classifier as well. The drawbacks are two: the training set has to be kept entirely into memory and the prediction time increases with the cardinality of the training set.

Fig. 2
figure 2

The architecture of a DL model \(\mathcal {K} \circ \hat{\mathcal {D}}\) that can learn on device

Figure 2 shows the resulting architecture \(\mathcal {K}\circ \hat{\mathcal {D}}\) of a DL model that can learn on a tiny device, with an approximated DL model \(\hat{\mathcal {D}}\) followed by the kNN classifier \(\mathcal {K}\). More in detail, Fig. 2a shows the training of such architecture, that consists in providing a training set \(\mathcal {T}\) to the kNN, whereas Fig. 2b shows the architecture during the operational life, with an adaptation module that, in a test-then-train scenario [18, 22], receives as input the classification \(\hat{y}=\mathcal {K}\circ \hat{\mathcal {D}}\left( x\right) \) of an input x and then the true label y.

Adaptation module. The adaptation module can adapt the model over time according to different policies (see Sect. 2 to understand nomenclature):

  • Incremental Approach [19]. This approach adds every supervised sample to the kNN’s training set. It may be useful when a few data are available at the training stage.

  • Passive Update [20]. By definition, a passive approach adapts the model at every incoming (supervised) sample. In particular, in this scenario, only the misclassified samples are added to the kNN’s training set since they are assumed to bring more information about the underlying problem.

  • Active Update [20]. Active approaches rely on an explicit mechanism to detect a concept drift, namely, Change Detection Tests (CDTs). Examples of CDTs can be found in [6, 41, 46, 60]. Once CDTs detect a change, the adaptation takes place in two steps, assuming that the algorithm kept a (history) window of the last seen samples. The first one aims at identifying the samples within the history window representing the new knowledge after the concept drift. Then, those samples are used to update the model (here, they entirely replace the kNN’s training set).

  • Hybrid Update [20]. The hybrid approach fuses a passive and active update to combine the benefits and mitigate the drawbacks of each approach. More in detail, the active update aims at detecting the concept drift and thus removing obsolete knowledge from the model. In contrast, the passive update continuously updates the model (also in stationary conditions). In such a way, the passive updates can alleviate the case where the active approach does not detect a concept drift, whereas the active approach should keep under control the memory, which in this passive approach indefinitely grows.

Limitations. The main limitation of the proposed approach is the definition of supervised on-device mechanisms. Introducing semi-supervised or unsupervised adaptation mechanisms will broaden the range of applications since supervised information is rarely or sometimes available in many real scenarios.

5 Wide (Deep) Tiny Machine Learning

Sections 3 and 4 enable the inference and the on-device training of DL models on tiny devices by relying on approximation mechanisms to reduce the memory and computational requirements of the DL model until the technological constraints in terms of memory and computation of the tiny device are met. This approach comes at the expense of a drop in the DL model evaluation metric. A different approach to the problem aims at distributing the computation over a set of possibly heterogeneous tiny devices. In this way, the original DL model is split into smaller tasks—e.g., each DL layer can be a task—that are feasible for the tiny devices executing them, but since no approximation is introduced, the DL model is expected to behave in the same way as if it is not distributed. Tiny devices are referred to as IoT units or units in the following.

The methodology. A methodology working in this scenario requires to define:

  • The IoT System model. A sample IoT system comprises C data-acquisition units \(\{s_1, \ldots , s_C\}\), C target units \(\{f_1, \ldots , f_C\}\), and a set \(\mathbb {N}_N = \{1, \dots , N\}\) of N IoT computing units. Each unit has a memory capacity \(\bar{m}_i\) and available computation \(\bar{c_i}\), for each \(i \in \mathbb {N}_N\). See Fig. 3b and c as an example.

    Such a system is modeled as a graph with nodes \(V=\{\mathbb {N}_N \cup \left\{ s_1,\dots ,s_C,f\right\} \}\) and arcs connecting units within the range of the transmission technology. In this formalization, the distance \(d_{i_1,i_2}\), for each \(i_1,i_2 \in V\), is the shortest communication path between units \(i_1\) and \(i_2\) within the graph.

  • The split-policy. The simplest way to split a DL model is layer-wise. Other policies, e.g., also splitting within the DL layers (e.g., the convolutional filters as in [63]), could be considered as well.

    In the most general formulation, the methodology assumes to distribute C DL models, i.e., one per each source. The u-th DL model, for each \(u \in \mathbb {N}_C= \{1, \ldots , C\}\), has \(M_u\) layers, each of which has: memory demand \(m_{u,j}\); computation \(c_{u,j}\); and activations of size \(K_{u,j}\), for each \(j \in \{1, \ldots , M_u\}\).

  • The latency. The transmission latency \(t^{(t)}_{u,i,k,j}\) (between units \(i, k \in \mathbb {N}_N\)) models the time required to transmit the activations \(K_{u,j}\) of DL model u’s layer j and may include possible delays (e.g., due to transmission failures or queues). The processing latency \(t^{(p)}_{u,i,j}\) models the time the unit i requires to carry out the computation \(c_{u,j}\) of j-th layer of DL model u.

In these settings, the methodology [21] is a quadratic binary optimization problem on variables \(\alpha _{u,i,j}\) (that are ones if unit i runs the layer j of DL model u) that minimizes the decision-making latency [2] (i.e., from source to target):

$$\begin{aligned} \sum _{u=1}^{C}\sum _{i=1}^{N}\sum _{k=1}^{N}\sum _{j=1}^{M-1} \alpha _{u,i,j}\cdot \alpha _{u,k,j+1}\cdot t^{(t)}_{u,i,k,j} + t^{(t)}_s + t^{(t)}_f + \sum _{u=1}^{C}\sum _{i=1}^{N}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot t^{(p)}_{u,i,j}, \end{aligned}$$

where \(M=max\{M_us\}\); the first term models the transmission times among IoT units; \(t^{(t)}_s\) and \(t^{(t)}_f\) “hidden” the transmission times from sources and to targets; and the last term models the processing time. Please refer to [21] for details.

Fig. 3
figure 3

Distributing two 5-layer DL models over an IoT system with or without shared layers. The squares are Raspberry Pi 3B+ and the circles STM32H7 MCUs. The transmission range is shown by the green dash-dotted line

Such optimization problem is subject to (at least) three set of constraints, two ensuring the memory and computational technological constraints, and one guaranteeing that each DL layer is assigned exactly once:

$$\begin{aligned} \forall i \in \mathbb {N}_N, \sum _{u=1}^{C}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot m_{u,j} \le \bar{m}_i \quad \text {and}\quad \forall i \in \mathbb {N}_N, \sum _{u=1}^{C}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot c_{u,j} \le \bar{c}_i, \end{aligned}$$
$$\begin{aligned} \forall u \in \mathbb {N}_C, \forall j \in \mathbb {N}_M, \quad \sum _{i=1}^{N} \alpha _{u,i,j} = 1 \left[ \text {if } j \le M_u,\text { } 0 \text { otherwise}\right] . \end{aligned}$$

Other constraints can be introduced to model specific situations as explained in [21]. For instance, Fig. 3 shows an application of the methodology where two 5-layer DL models (Fig. 3a) has to be placed in the IoT system shown in Fig. 3b and c. In particular, to ensure that the first two layers are shared between the two DL models in Fig. 3c, additional constraints force the first two layers of the DL models to the same IoT unit and avoid that those layers are counted twice in the memory or computational budget [21].

Limitations. The proposed methodology is a quadratic binary optimization problem that is NP-complete since it can be converted to a binary linear program, which is one of Karp’s 21 NP-complete problems [35].

The second main limitation is that the optimization problem does not consider energy constraints for the reasons stated in Sect. 3.

6 Conclusions

This work addressed the problem of designing DL models in a TML scenario, i.e., DL models whose requirements in terms of memory, computation, and energy match with the corresponding technological constraints introduced by the IoT unit, or the microcontroller on which they are deployed. To achieve this goal, two main approaches have been proposed: the first one relies on approximation techniques to reduce the memory and computational requirements, with the possibility of defining solutions that can learn directly on the device; the second one does not introduce any approximation in the DL model, but splits it into smaller tasks and distributes them over a set of possibly heterogeneous devices.

Future work encompasses the definition of semi-supervised on-device mechanisms, the introduction of energy constraints, and a different approach for the methodology in Sect. 5 that is not NP-complete.