The trend of building new complex systems by integrating low-cost, inherently unreliable Commercial Off-The-Shelf (COTS) components is one of today’s challenges in the design, analysis and development of systems exhibiting a certain degree of dependability. In fact, the last decade has seen the complexity of electronic systems growing faster and faster, thanks to the decrease of the components’ size and cost. However, the use of low-cost execution resources to achieve high performance makes modern electronic devices more and more unreliable, because of the increasing susceptibility of such components to faults. In fact, the number of hard faults, as well as soft ones, is growing due mainly to the shrinking of components themselves, to the variations in the manufacturing process and to the exposition of devices to radiations and noise fluctuations.
Therefore, when adopting the COTS-based design approach for the realization of modern and pervasive electronic systems, reliability has become one of the main optimization goals, together with performance. Nevertheless, in non-critical environments, reliability must be leveraged in order not to introduce too high costs, associated with not so stringent requirements; moreover, in many situations, the need for reliability may change during the activity depending on the specific working scenario. For these reasons, we claim that there is a need for a new way to dynamically “tune” fault management properties based on the working scenario, thus finding a satisfying trade-off between benefits and costs at run-time.
We have taken into account multicore and manycore architectures, working on the definition of run-time resource management strategies aimed at improving the system reliability while fulfilling performance constraints. The proposed solutions aim at mitigating soft errors as well as permanent one; moreover, we also tackle the issue of optimal resource exploitation in order to extend the system lifetime by mitigating wearout effects.
Recent publications:
- [DATE2014]
A. Das, A. Kumar, B. Veeravalli, A. Miele,C. Bolchini, “An adaptive approach for online fault management in many-core architectures,” in Proc. IEEE Design, Automation and Test in Europe, DATE, pp. 1429-1432, Mar. 2014, doi: http://dx.doi.org/10.1109/DATE.2012.6176589 - [JETTA2013a]
C. Bolchini, M. Carminati, A. Miele, “Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems,” in Springer Journal of Electronic Testing, Vol. 29, No. 2, pp. 159-175, Apr 2013, doi: http://dx.doi.org/10.1007/s10836-013-5367-y - [DFT2013]
C. Bolchini, M. Carminati, A. Miele, A. Das, A. Kumar, B. Veeravalli, “Run-time mapping for reliable many-cores based on energy/performance trade-offs,” in Proc. IEEE Intl Symp. Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT, pp. 58-64, Oct 2013, doi: http://dx.doi.org/10.1109/DFT.2013.6653583 - [DATE2012]
C. Bolchini, A. Miele, D. Sciuto”An adaptive approach for online fault management in many-core architectures,” in Proc. Design Automation & Test in Europe, DATE, pp. 1429-1432, Mar 2012, ee: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=6176589
This work partially fits under the umbrella of the EU COST Action IC1103 MEDIAN (website), for which I am one of the five initial proponents, started in Dec. 2011 and running until Nov. 2015. The action fosters collaborations between participants by supporting short term scientific missions and other similar activities.