Deep and Wide Tiny Machine Learning

Disabato, Simone

doi:10.1007/978-3-031-15374-7_7

Simone Disabato ORCID: orcid.org/0000-0001-5991-2651²

Part of the book series: SpringerBriefs in Applied Sciences and Technology ((BRIEFSPOLIMI))

1778 Accesses
2 Citations

Abstract

In the last decades, on the one hand, Deep Learning (DL) has become state of the art in several domains, e.g., image classification, object detection, and natural language processing. On the other hand, pervasive technologies—Internet of Things (IoT) units, embedded systems, and Micro-Controller Units (MCUs)—ask for intelligent processing mechanisms as close as possible to data generation. Nevertheless, memory, computational, and energy requirements characterizing DL models are three or more orders of magnitude larger than the corresponding memory, computation, and energy capabilities of pervasive devices. This work aims at introducing a methodology to address this issue and enable pervasive intelligent processing. In particular, by defining Tiny Machine Learning (TML) solutions, i.e., machine and deep learning models that take into account the constraints on memory, computation, and energy of the target pervasive device. The proposed methodology addresses the problem at three different levels. In the first approach, the methodology devices inference-based Deep TML solutions by approximation techniques, i.e., the TML model runs on the pervasive device but was trained elsewhere. Then, the methodology introduces on-device learning for TML. Finally, the third approach develops Wide Deep TML solutions that split and distribute the DL processing over connected heterogeneous pervasive devices.

You have full access to this open access chapter, Download chapter PDF

Is Tiny Deep Learning the New Deep Learning?

TinyML: Tools, applications, challenges, and future research directions

Article 09 September 2023

Elements of TinyML on Constrained Resource Hardware

Keywords

1 Introduction

Machine Learning (ML) and Deep Learning (DL) techniques have widely spread across the most diverse areas in the last decade, achieving state of the art in several fields. Convolutional Neural Networks (CNNs) models—e.g., the ResNet [29] or the Inception [51]—and their recent extensions with Vision Transformers [24, 55], provide a classification of input images. Similar architectures—e.g., the Yolo [47], can also identify the position of the detected objects. Other examples of DL applications are natural language processing [14, 62], anomaly detection [52], event-based cameras [10], and autonomous driving [11, 12].

All DL techniques share some common characteristics. At first, the huge number of parameters. To give an idea, the ResNet CNN has 11 to 60 million parameters, whereas the Inception-V3 24 to 43 million. The image and language modeling architectures based on transformers have one or more order of magnitude more parameters, e.g., the ViT architectures [24] have 86 to 632 million parameters, BERT [14] 110 to 345 million, and GPT-3 [8] 175 billion. It is straightforward that only the memory required by the parameters ranges from dozens of megabytes to various gigabytes. Secondly, the DL training—i.e., the optimization problem to find the parameters’ values—is time-consuming (and extremely energy-hungry), requiring days or weeks on high-performance computers with clusters of GPUs or TPUs. For instance, [48] roughly calculates 4.5 GPU-years to complete the optimal training of their DL models, whereas [5] estimates 16 GPU-weeks to train the Inception-V3 CNN. Finally, although the inference—i.e., the processing of one possibly unseen input by a trained DL model—is significantly simpler than training, the number of multiply-add operations required is huge. As an example, the ResNet CNN requires 5 to 11 million multiplications to classify one image.

Following the terrific spread of pervasive devices—Embedded Systems, Internet of Things (IoT), and MicroController units (MCU)—the intelligence for pervasive units emerged as a novel, challenging, and breakthrough research direction and application of ML and DL. On the one hand, pervasive devices are nowadays used in a broad range of applications, such as autonomous driving, smart cities, and smart industries. On the other hand, moving the (machine and deep learning-based) intelligent processing—e.g., fault detection or change detection algorithms [22]—as close as possible to data generation is crucial to support real-time applications, prolong the system lifetime, and increase the Quality-of-Service [2, 50, 53].

In contrast to these benefits of intelligence for pervasive devices, there are the constraints on memory, computation, and energy characterizing such devices. On IoT units, the memory capacity ranges from Kilobytes to Megabytes [21], and the power consumption should be in the order of milliWatts [44]. Instead, MCUs rarely have 1 Megabyte of memory and require their application to run in the microWatts area [38]. For instance, the most powerful MCU is the STM32H743ZI MCU (of STMicroelectronics), which has a 480 MHz Cortex M7 processor and 1024 KB of RAM that is further split into five blocks with different speeds and purposes. The majority of MCUs have instead 96 to 256 KB of RAM.

This work addresses the challenges mentioned above by designing deep learning-based intelligent techniques for pervasive devices, i.e., DL models whose requirements in terms of memory footprint, computational load, and energy consumption do not overcome the corresponding technological constraints of the IoT or microcontroller unit where they are executed. In the related literature, Tiny Machine Learning (TML) [16, 27, 38, 49] is a newborn research direction with the same goals. In particular, TML takes into account all the three main approaches developed in the related literature to reduce DL model’s complexity (see Sect. 2) and provides a unified and comprehensive approach to the problem [2, 20]. More in detail, almost all the TML works focus on allowing the ML or DL inference on IoT units or MCUs, with very few works tackling and proposing on-device learning [9, 19, 20]. On-device learning is a demanding and tough task, but it allows to adapt the TML model over time from the incoming novel data and in response to concept drift, i.e., statistical variations of the data generation process. Concept drift is widespread in real-world scenarios—typical causes are sensor faults, aging or seasonality effects, and user behavior changes—and, if not correctly addressed, might result in catastrophic impacts on the TML model [22].

In the aforementioned scenario, this work proposes a methodology to design and deploy Deep and Wide Tiny Machine Learning solutions for the considered tiny device(s), i.e., IoT units, embedded systems, or MCUs. Specifically, the methodology deal with the problem from two different perspectives. The first one, presented in Sect. 3, details how to design inference-based Deep TML solutions, i.e., TML solutions based on Deep Learning. Section 4 extends such an approach by introducing on-device learning mechanisms. Section 5 discusses the second approach, namely Wide (Deep) TML, that splits the DL computation over a set of possibly heterogeneous tiny devices.

2 Related Literature

The related literature addresses the problem of reducing the DL models’ complexity (with or without the goal of matching the technological constraints of some target devices) in three main ways, detailed in the following paragraphs. Please refer to [15] for a detailed version of this analysis.

DL Hardware Design. The design of dedicated hardware solutions provides the best power consumptions and inference times, at the expense of a complex design phase and reduced flexibility. Examples of newly designed hardware are Tensor Processing Units (TPUs) and Neural Processing Units or Accelerators [4, 34].

Approximating DL. The two main approximation techniques for DL models are task dropping (or pruning) and precision scaling [2, 40, 45]. Pruning techniques are organized into structured algorithms that prune the structure of DL models (e.g., the layers) and unstructured ones that can also trim the parameters within a level without maintaining the structure [23, 37, 43, 59]. Precision scaling mechanisms aim instead at changing the classical 32-bit floating-point representation of parameters with 8-bit data types, novel or fixed-point data-types, as well as ternary or binary ones [33, 64].

Other approaches encompass the definition of optimized architectures or layers—e.g., the depth-wise convolutions in MobileNet [31], or the groups of convolutions in ShuffleNet [61]–, the definition of sparse architectures [39, 42], and the introduction of Early-Exits [7, 17].

Distributing DL. In this approach, the DL computation is distributed over a set of heterogeneous tiny devices, without introducing any approximation but adding as a new layer of complexity the data transmission [32, 54, 63].

Machine Learning in Presence of Concept Drift. The literature about ML or DL in presence of concept drift organizes the works into two main classes [22, 28]: passive solutions [25, 36] always adapt the model independently from the evidence of a concept drift; whereas active approaches [18, 57, 60] employ techniques to detect changes and then adapt the model.

3 Deep Tiny Machine Learning

Let $\mathcal {D}$ be a DL model that provides an output $\hat{y}$ for an input x, hence $\hat{y} = \mathcal {D} \left( x\right) $. Its processing pipeline is a sequence of M layers $\varphi _{\theta _\ell }^{(\ell )}$s, for each $\ell \in \left\{ 1, \ldots , M\right\} $, with parameters $\theta _{\ell }$, where each layer can be for instance a convolution, a pooling operator, or a non-linearity. Such pipeline is then formalized as $ y = \mathcal {D}\left( x\right) = \varphi _{\theta _M}^{(M)} \left( x_{M-1}\right) , $ with $ x_{\ell } = \varphi _{\theta _\ell }^{(\ell )} \left( x_{\ell -1}\right) , $ and where $x_0$ degenerates to the input x.

Let $\varOmega $ be a transformation function that takes as inputs a DL model $\mathcal {D}$ and a set of approximation parameters, and provides as output the corresponding approximated DL model $\hat{\mathcal {D}}$. The definition of such a function, as well as the set of possible approximation parameters, strictly depend on the considered approximation techniques. For the purpose of this presentation, two approximation techniques are employed and shown in Fig. 1:

task dropping (or structured pruning), that is defined as the skipping of some tasks to save memory and computation [45]. In our case, task dropping, parameterized by $\tilde{m} \le M$, consists in keeping the first $\tilde{m}$ layers of the original DL model $\mathcal {D}$, i.e., $ \hat{\mathcal {D}} = \varOmega \left( \mathcal {D}, \left\{ \tilde{m},\mathcal {K}\right\} \right) = \mathcal {K}_{\theta _\mathcal {K}}\left( \varphi _{\theta _{\tilde{m}}}^{(\tilde{m})} \left( x_{\tilde{m}-1}\right) \right) , $ where $\mathcal {K}$ is an additional layer (e.g., a classifier) that may be added on top of the approximated pipeline to maintain the original DL model scope. Other task dropping techniques can be considered as well, such as considering non-consecutive layers in the approximated pipeline, unstructured pruning techniques, or layer-specific mechanisms (e.g., pruning filters within convolutional layers);
precision scaling that consists in changing the representation of the parameters $\theta _{\ell }$ of a DL model $\mathcal {D}$. The transformation is parameterized by q and is defined as $ \hat{\mathcal {D}} = \varOmega \left( \mathcal {D}, \left\{ q\right\} \right) = \varphi _{\tilde{\theta }_M}^{(M)} \left( \ldots \left( \varphi _{\tilde{\theta }_{1}}^{(1)} \left( x\right) \right) \right) , $ where $\tilde{\theta _\ell }$ represents the approximation of parameters $\theta _{\ell }$. In the simplest version, q represents the number of bits at which quantize the parameters (e.g., 1, 2, or 8 bits). Still, custom data types or different quantization schemes per each DL model layer could be considered as well.

Finally, let $\bar{m}$ and $\bar{c}$ be the memory and computational constraints of a target tiny device, respectively. The definition of energy constraints is more complex since every DL inference affects the energy budget, whereas the memory and the computation remain fixed. As a consequence, energy constraints are not employed in the following with the device assumed plugged into an energy source.

The methodology. In this scenario, a methodology $\mathcal {M}$ to automatically design Deep Tiny Machine Learning solutions, i.e., DL-based TML models satisfying the technological constraints of the tiny device executing them, is defined as the function $ \mathcal {M} \left( \mathcal {D}, \varOmega , \mathcal {T}, \bar{m}, \bar{c} \right) , $ as in [2] and with $\mathcal {T}$ being a validation set. Such a methodology operates in two steps. The first step identifies the set $\mathcal {F}_{all}$ of all the feasible solutions. Formally, $ \mathcal {F}_{all} = \left\{ \hat{\mathcal {D}} \text { s.t. } m_{\hat{\mathcal {D}}} \le \bar{m} \text { and } c_{\hat{\mathcal {D}}} \le \bar{c}\right\} , $ where $m_{\hat{\mathcal {D}}}$ and $c_{\hat{\mathcal {D}}}$ represent the memory and the computational requirement of the approximation $\hat{\mathcal {D}}$, respectively. The set is generated by applying the transformation $\varOmega $ for all the considered approximations and the range of their parameters (in our case $\tilde{m} = \left\{ 1, \ldots , M\right\} $, a set of possible layers $\mathcal {K}$s, and a set of precision scaling mechanisms qs). A trade-off between the granularity of explored approximations and the corresponding exploration time is hence expected.

The methodology’s second step takes as input the set of feasible solutions $\mathcal {F}_{all}$ and the validation set $\mathcal {T}$ and provides as output a Pareto Front $\mathcal {F}$ of those solutions. More in detail, the front is defined by the triples $\langle \hat{\alpha }_i, m_{\hat{\mathcal {D}}_i}, c_{\hat{\mathcal {D}}_i} \rangle $, computed after training (and evaluating it with the metric $\hat{\alpha }$) only layer $\mathcal {K}$ (in a transfer-learning fashion [58]) by relying on disjoint subsets of $\mathcal {T}$.

Selecting the target solution: a discussion. Since the methodology provides as output a Pareto front, the choice of the optimal solution for each application scenario is left to the methodology’s user. As an example, real-time scenarios suggest choosing the solution with a smaller computational complexity (probably paying in terms of accuracy). In contrast, in scenarios where accuracy is crucial, this reasoning no longer holds.

A second aspect to consider is the generalization of the chosen solution besides the validation set $\mathcal {T}$ (as it happens in autoML tools [26, 56]). Although not investigated for the proposed methodology, we expect that the introduced technological constraints will mitigate the problem, not allowing the creation of over-complex (and prone to overfitting) DL models.

4 On-Device Tiny Machine Learning

The need for adaptation. The methodology briefly introduced in Sect. 3 proposes a set of solutions whose inference is feasible on the desired tiny device. Consequently, a deployed solutions can process novel unseen inputs x and provide corresponding outputs $\hat{y}$, but it will remain fixed over time. The latter is a major limitation in real scenarios for two main reasons. At first, the data needed to train (more precisely, fine-tune) the DL solution might be extremely limited or available incrementally. Hence, it is desirable to deploy a first (approximated) DL model on the tiny device and then consider some incremental learning mechanisms to improve the model on newly arrived data [19]. The second reason is the intrinsic non-stationarity of the environment in which the DL model operates due to seasonality, faults in the sensors, or aging effects, to make a few examples. A model that does not adapt in response to an environmental change—formally, a concept drift that can be viewed as a statistical variation of the data-generation process’ behavior—is likely to become useless quickly [22].

The proposed solution, presented in [19, 20], exploits the freedom the methodology discussed in Sect. 3 allows in the choice of the additional layer $\mathcal {K}$ to be added on top of the approximated DL model $\hat{\mathcal {D}}$. Although a straightforward approach to the problem suggests to rely on classical DL-based classifiers (e.g., fully-connected layers) or machine learning parametric and supervised classifiers (e.g., Support Vector Machines [13] or Random Forests [30]), here the choice is to rely on the non-parametric k-Nearest Neighbor [3].

The kNN algorithm predicts according to the majority voting of the k samples of the input x within its knowledge (or training) set $\mathcal {T}$, where k is typically set as the ceil of the square root of the cardinality of $\mathcal {T}$ [1]. The main advantages of choosing the kNN are the absence of a training phase (providing a training set is enough) and the cost-free adaptation procedure, where it is sufficient to modify the training set $\mathcal {T}$ to update the kNN classifier as well. The drawbacks are two: the training set has to be kept entirely into memory and the prediction time increases with the cardinality of the training set.

Figure 2 shows the resulting architecture $\mathcal {K}\circ \hat{\mathcal {D}}$ of a DL model that can learn on a tiny device, with an approximated DL model $\hat{\mathcal {D}}$ followed by the kNN classifier $\mathcal {K}$. More in detail, Fig. 2a shows the training of such architecture, that consists in providing a training set $\mathcal {T}$ to the kNN, whereas Fig. 2b shows the architecture during the operational life, with an adaptation module that, in a test-then-train scenario [18, 22], receives as input the classification $\hat{y}=\mathcal {K}\circ \hat{\mathcal {D}}\left( x\right) $ of an input x and then the true label y.

Adaptation module. The adaptation module can adapt the model over time according to different policies (see Sect. 2 to understand nomenclature):

Incremental Approach [19]. This approach adds every supervised sample to the kNN’s training set. It may be useful when a few data are available at the training stage.
Passive Update [20]. By definition, a passive approach adapts the model at every incoming (supervised) sample. In particular, in this scenario, only the misclassified samples are added to the kNN’s training set since they are assumed to bring more information about the underlying problem.
Active Update [20]. Active approaches rely on an explicit mechanism to detect a concept drift, namely, Change Detection Tests (CDTs). Examples of CDTs can be found in [6, 41, 46, 60]. Once CDTs detect a change, the adaptation takes place in two steps, assuming that the algorithm kept a (history) window of the last seen samples. The first one aims at identifying the samples within the history window representing the new knowledge after the concept drift. Then, those samples are used to update the model (here, they entirely replace the kNN’s training set).
Hybrid Update [20]. The hybrid approach fuses a passive and active update to combine the benefits and mitigate the drawbacks of each approach. More in detail, the active update aims at detecting the concept drift and thus removing obsolete knowledge from the model. In contrast, the passive update continuously updates the model (also in stationary conditions). In such a way, the passive updates can alleviate the case where the active approach does not detect a concept drift, whereas the active approach should keep under control the memory, which in this passive approach indefinitely grows.

Limitations. The main limitation of the proposed approach is the definition of supervised on-device mechanisms. Introducing semi-supervised or unsupervised adaptation mechanisms will broaden the range of applications since supervised information is rarely or sometimes available in many real scenarios.

5 Wide (Deep) Tiny Machine Learning

Sections 3 and 4 enable the inference and the on-device training of DL models on tiny devices by relying on approximation mechanisms to reduce the memory and computational requirements of the DL model until the technological constraints in terms of memory and computation of the tiny device are met. This approach comes at the expense of a drop in the DL model evaluation metric. A different approach to the problem aims at distributing the computation over a set of possibly heterogeneous tiny devices. In this way, the original DL model is split into smaller tasks—e.g., each DL layer can be a task—that are feasible for the tiny devices executing them, but since no approximation is introduced, the DL model is expected to behave in the same way as if it is not distributed. Tiny devices are referred to as IoT units or units in the following.

The methodology. A methodology working in this scenario requires to define:

The IoT System model. A sample IoT system comprises C data-acquisition units $\{s_1, \ldots , s_C\}$, C target units $\{f_1, \ldots , f_C\}$, and a set $\mathbb {N}_N = \{1, \dots , N\}$ of N IoT computing units. Each unit has a memory capacity $\bar{m}_i$ and available computation $\bar{c_i}$, for each $i \in \mathbb {N}_N$. See Fig. 3b and c as an example.

Such a system is modeled as a graph with nodes $V=\{\mathbb {N}_N \cup \left\{ s_1,\dots ,s_C,f\right\} \}$ and arcs connecting units within the range of the transmission technology. In this formalization, the distance $d_{i_1,i_2}$, for each $i_1,i_2 \in V$, is the shortest communication path between units $i_1$ and $i_2$ within the graph.
The split-policy. The simplest way to split a DL model is layer-wise. Other policies, e.g., also splitting within the DL layers (e.g., the convolutional filters as in [63]), could be considered as well.

In the most general formulation, the methodology assumes to distribute C DL models, i.e., one per each source. The u-th DL model, for each $u \in \mathbb {N}_C= \{1, \ldots , C\}$, has $M_u$ layers, each of which has: memory demand $m_{u,j}$; computation $c_{u,j}$; and activations of size $K_{u,j}$, for each $j \in \{1, \ldots , M_u\}$.
The latency. The transmission latency $t^{(t)}_{u,i,k,j}$ (between units $i, k \in \mathbb {N}_N$) models the time required to transmit the activations $K_{u,j}$ of DL model u’s layer j and may include possible delays (e.g., due to transmission failures or queues). The processing latency $t^{(p)}_{u,i,j}$ models the time the unit i requires to carry out the computation $c_{u,j}$ of j-th layer of DL model u.

In these settings, the methodology [21] is a quadratic binary optimization problem on variables $\alpha _{u,i,j}$ (that are ones if unit i runs the layer j of DL model u) that minimizes the decision-making latency [2] (i.e., from source to target):

$$\begin{aligned} \sum _{u=1}^{C}\sum _{i=1}^{N}\sum _{k=1}^{N}\sum _{j=1}^{M-1} \alpha _{u,i,j}\cdot \alpha _{u,k,j+1}\cdot t^{(t)}_{u,i,k,j} + t^{(t)}_s + t^{(t)}_f + \sum _{u=1}^{C}\sum _{i=1}^{N}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot t^{(p)}_{u,i,j}, \end{aligned}$$

where $M=max\{M_us\}$; the first term models the transmission times among IoT units; $t^{(t)}_s$ and $t^{(t)}_f$ “hidden” the transmission times from sources and to targets; and the last term models the processing time. Please refer to [21] for details.

Such optimization problem is subject to (at least) three set of constraints, two ensuring the memory and computational technological constraints, and one guaranteeing that each DL layer is assigned exactly once:

$$\begin{aligned} \forall i \in \mathbb {N}_N, \sum _{u=1}^{C}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot m_{u,j} \le \bar{m}_i \quad \text {and}\quad \forall i \in \mathbb {N}_N, \sum _{u=1}^{C}\sum _{j=1}^{M} \alpha _{u,i,j} \cdot c_{u,j} \le \bar{c}_i, \end{aligned}$$

$$\begin{aligned} \forall u \in \mathbb {N}_C, \forall j \in \mathbb {N}_M, \quad \sum _{i=1}^{N} \alpha _{u,i,j} = 1 \left[ \text {if } j \le M_u,\text { } 0 \text { otherwise}\right] . \end{aligned}$$

Other constraints can be introduced to model specific situations as explained in [21]. For instance, Fig. 3 shows an application of the methodology where two 5-layer DL models (Fig. 3a) has to be placed in the IoT system shown in Fig. 3b and c. In particular, to ensure that the first two layers are shared between the two DL models in Fig. 3c, additional constraints force the first two layers of the DL models to the same IoT unit and avoid that those layers are counted twice in the memory or computational budget [21].

Limitations. The proposed methodology is a quadratic binary optimization problem that is NP-complete since it can be converted to a binary linear program, which is one of Karp’s 21 NP-complete problems [35].

The second main limitation is that the optimization problem does not consider energy constraints for the reasons stated in Sect. 3.

6 Conclusions

This work addressed the problem of designing DL models in a TML scenario, i.e., DL models whose requirements in terms of memory, computation, and energy match with the corresponding technological constraints introduced by the IoT unit, or the microcontroller on which they are deployed. To achieve this goal, two main approaches have been proposed: the first one relies on approximation techniques to reduce the memory and computational requirements, with the possibility of defining solutions that can learn directly on the device; the second one does not introduce any approximation in the DL model, but splits it into smaller tasks and distributes them over a set of possibly heterogeneous devices.

Future work encompasses the definition of semi-supervised on-device mechanisms, the introduction of energy constraints, and a different approach for the methodology in Sect. 5 that is not NP-complete.

References

Alippi, C., Boracchi, G., Roveri, M.: Just-in-time classifiers for recurrent concepts. IEEE Trans. Neural Netw. Learn. Syst. 24(4), 620–634 (2013)
Article Google Scholar
Alippi, C., Disabato, S., Roveri, M.: Moving convolutional neural networks to embedded systems: the AlexNet and VGG-16 case. In: 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 212–223. IEEE, Porto (2018)
Google Scholar
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
MathSciNet Google Scholar
Ando, K., Ueyoshi, K., Orimo, K., Yonekawa, H., Sato, S., Nakahara, H., Takamaeda-Yamazaki, S., Ikebe, M., Asai, T., Kuroda, T., et al.: Brein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w. IEEE J. Solid-State Circuits 53(4), 983–994 (2017)
Google Scholar
Badre, P., Bandiwadekar, S., Chandanshive, P., Chaudhari, A., Jadhav, M.S.: Automatically identifying animals using deep learning. Int. J. Recent Innov. Trends Comput. Commun 6(4), 194–197 (2018)
Google Scholar
Basseville, M., Nikiforov, I.V., et al.: Detection of Abrupt Changes: Theory and Application, vol. 104. Prentice Hall Englewood Cliffs (1993)
Google Scholar
Bolukbasi, T., Wang, J., Dekel, O., Saligrama, V.: Adaptive neural networks for efficient inference. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70. pp. 527–536 (2017)
Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Cai, H., Gan, C., Zhu, L., Han, S.: Tinytl: reduce memory, not parameters for efficient on-device learning. Adv. Neural Inf. Process. Syst. 33 (2020)
Google Scholar
Cannici, M., Ciccone, M., Romanoni, A., Matteucci, M.: Attention mechanisms for object recognition with event-based cameras. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1127–1136. IEEE (2019)
Google Scholar
Chen, C., Seff, A., Kornhauser, A., Xiao, J.: Deepdriving: Learning affordance for direct perception in autonomous driving. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730 (2015)
Google Scholar
Chen, D., Krähenbühl, P.: Learning from all vehicles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17222–17231 (2022)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Article MATH Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Disabato, S.: Deep and Wide Tiny Machine Learning (2022)
Google Scholar
Disabato, S., Canonaco, G., Flikkema, P.G., Roveri, M., Alippi, C.: Birdsong detection at the edge with deep learning. In: 2021 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 9–16. IEEE (2021)
Google Scholar
Disabato, S., Roveri, M.: Reducing the computation load of convolutional neural networks through gate classification. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Google Scholar
Disabato, S., Roveri, M.: Learning convolutional neural networks in presence of concept drift. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
Google Scholar
Disabato, S., Roveri, M.: Incremental on-device tiny machine learning. In: Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, pp. 7–13 (2020)
Google Scholar
Disabato, S., Roveri, M.: Tiny machine learning for concept drift (2021). arXiv preprint arXiv:2107.14759
Disabato, S., Roveri, M., Alippi, C.: Distributed deep convolutional neural networks for the internet-of-things. IEEE Trans. Comput. (2021)
Google Scholar
Ditzler, G., Roveri, M., Alippi, C., Polikar, R.: Learning in nonstationary environments: a survey. IEEE Comput. Intell. Mag. 10(4), 12–25 (2015)
Article Google Scholar
Dong, X., Chen, S., Pan, S.J.: Learning to prune deep neural networks via layer-wise optimal brain surgeon. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4860–4874 (2017)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 (2021)
Google Scholar
Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 22(10), 1517–1531 (2011)
Article Google Scholar
Fabris, F., Freitas, A.A.: Analysing the overfit of the auto-sklearn automated machine learning tool. In: International Conference on Machine Learning, Optimization, and Data Science, pp. 508–520. Springer (2019)
Google Scholar
Fedorov, I., Stamenovic, M., Jensen, C., Yang, L.C., Mandell, A., Gan, Y., Mattina, M., Whatmough, P.N.: Tinylstms: efficient neural speech enhancement for hearing aids (2020). arXiv preprint arXiv:2005.11138
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Article MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
Google Scholar
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint arXiv:1704.04861
Hu, D., Krishnamachari, B.: Fast and accurate streaming cnn inference via communication compression on the edge. In: 2020 IEEE/ACM Fifth International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 157–163. IEEE (2020)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(1), 6869–6898 (2017)
MathSciNet MATH Google Scholar
Jouppi, N.P., Young, C., Patil, N., Patterson, D.: A domain-specific architecture for deep neural networks. Commun. ACM 61(9), 50–59 (2018)
Article Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of computer computations, pp. 85–103. Springer (1972)
Google Scholar
Krawczyk, B., Woźniak, M.: One-class classifiers with incremental learning and forgetting for data streams with concept drift. Soft. Comput. 19(12), 3387–3400 (2015)
Article Google Scholar
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: 5th International Conference on Learning Representations, ICLR 2017. Toulon, France (2017)
Google Scholar
Lin, J., Chen, W.M., Lin, Y., Cohn, J., Gan, C., Han, S.: Mcunet: Tiny deep learning on iot devices. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 11711–11722. Curran Associates, Inc (2020)
Google Scholar
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Google Scholar
Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. In: 7th International Conference on Learning Representations, ICLR 2019. New Orleans, LA, USA (2019)
Google Scholar
Lorden, G., et al.: Procedures for reacting to a change in distribution. Ann. Math. Stat. 42(6), 1897–1908 (1971)
Article MathSciNet MATH Google Scholar
Lu, Y., Lu, G., Zhang, B., Xu, Y., Li, J.: Super sparse convolutional neural networks. Proc. AAAI Conf. Artif. Intell. 33(01), 4440–4447 (2019)
Google Scholar
Luo, J.H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neural network compression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5058–5066 (2017)
Google Scholar
Martinez, B., Monton, M., Vilajosana, I., Prades, J.D.: The power of models: modeling power consumption for iot devices. IEEE Sens. J. 15(10), 5777–5789 (2015)
Article Google Scholar
Mittal, S.: A survey of techniques for approximate computing. ACM Comput. Surv. (CSUR) 48(4), 1–33 (2016)
Google Scholar
Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954)
Article MathSciNet MATH Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. CPVR ’16, IEEE (2016)
Google Scholar
Richards, L.E., Nguyen, A., Capps, R., Forsyth, S., Matuszek, C., Raff, E.: Adversarial transfer attacks with unknown data and class overlap. In: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, pp. 13–24 (2021)
Google Scholar
Rusci, M., Fariselli, M., Capotondi, A., Benini, L.: Leveraging automated mixed-low-precision quantization for tiny edge microcontrollers. In: IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning, pp. 296–308. Springer (2020)
Google Scholar
Sanchez-Iborra, R., Skarmeta, A.F.: Tinyml-enabled frugal smart objects: challenges and opportunities. IEEE Circuits Syst. Mag. 20(3), 4–18 (2020)
Article Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Tack, J., Mo, S., Jeong, J., Shin, J.: Csi: Novelty detection via contrastive learning on distributionally shifted instances. In: Advances in Neural Information Processing Systems, vol. 33, pp. 11839–11852 (2020)
Google Scholar
Tang, J., Sun, D., Liu, S., Gaudiot, J.L.: Enabling deep learning on iot devices. Computer 50(10), 92–96 (2017)
Article Google Scholar
Teerapittayanon, S., McDanel, B., Kung, H.T.: Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328–339. IEEE (2017)
Google Scholar
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic transformers for efficient image recognition. Adv. Neural Inf. Proc. Syst. 34 (2021)
Google Scholar
Wong, C., Houlsby, N., Lu, Y., Gesmundo, A.: Transfer learning with neural automl. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8366–8375 (2018)
Google Scholar
Yang, Z., Al-Dahidi, S., Baraldi, P., Zio, E., Montelatici, L.: A novel concept drift detection method for incremental learning in nonstationary environments. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 309–320 (2019)
Article Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems, pp. 3320–3328 (2014)
Google Scholar
Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V.I., Han, X., Gao, M., Lin, C.Y., Davis, L.S.: Nisp: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203 (2018)
Google Scholar
Zambon, D., Alippi, C., Livi, L.: Concept drift and anomaly detection in graph streams. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5592–5605 (2018)
Article MathSciNet Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Google Scholar
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: Ernie: Enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451 (2019)
Google Scholar
Zhao, Z., Barijough, K.M., Gerstlauer, A.: Deepthings: distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37(11), 2348–2359 (2018)
Article Google Scholar
Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X., Yan, J.: Towards unified int8 training for convolutional neural network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1979 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Politecnico di Milano, Milan, Italy
Simone Disabato

Authors

Simone Disabato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Disabato .

Editor information

Editors and Affiliations

DEIB, Politecnico di Milano, Milano, Italy
Carlo G. Riva

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Disabato, S. (2023). Deep and Wide Tiny Machine Learning. In: Riva, C.G. (eds) Special Topics in Information Technology. SpringerBriefs in Applied Sciences and Technology(). Springer, Cham. https://doi.org/10.1007/978-3-031-15374-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-15374-7_7
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15373-0
Online ISBN: 978-3-031-15374-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Deep and Wide Tiny Machine Learning

Abstract

Similar content being viewed by others

Is Tiny Deep Learning the New Deep Learning?

TinyML: Tools, applications, challenges, and future research directions

Elements of TinyML on Constrained Resource Hardware

Keywords

1 Introduction

2 Related Literature

3 Deep Tiny Machine Learning

4 On-Device Tiny Machine Learning

5 Wide (Deep) Tiny Machine Learning

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Deep and Wide Tiny Machine Learning

Abstract

Similar content being viewed by others

Is Tiny Deep Learning the New Deep Learning?

TinyML: Tools, applications, challenges, and future research directions

Elements of TinyML on Constrained Resource Hardware

Keywords

1 Introduction

2 Related Literature

3 Deep Tiny Machine Learning

4 On-Device Tiny Machine Learning

5 Wide (Deep) Tiny Machine Learning

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation