

#### POLITECNICO MILANO 1863

A fault-injection methodology for the system-level reliability analysis of computing systems modeled in SystemC

Antonio Miele – antonio.miele@polimi.it

 Widespread adoption of complex computing systems (e.g. multi-cores) in mission-/safety-critical applications









Images downloaded from google

- Widespread adoption of complex computing systems (e.g. multi-cores) in mission-/safety-critical applications
- Necessity to perform a reliability-aware design of computing systems



Images downloaded from google



- Widespread adoption of complex computing systems (e.g. multi-cores) in mission-/safety-critical applications
- Necessity to perform a reliability-aware design of computing systems
- Fault injection is the common tool used for reliability analysis



Images downloaded from google

Limitations in common practices for fault injection:

- Performing fault injection only in the late design stages



Limitations in common practices for fault injection:

- Performing fault injection only in the late design stages
- Analyzing manly final results



Limitations in common practices for fault injection:

- Performing fault injection only in the late design stages
- Analyzing manly final results
- When performing monitoring on internal memory elements, analyzing raw traces



A framework and a methodology for **system-level faultinjection-based reliability analysis** in multi-cores specified in SystemC/TLM

Key points:

- Support for an accurate definition of the fault campaign
- Capability to perform error monitoring at both architecture and application level
- Customizable error analysis and classification approach

#### **Reference system**

- The hardware is composed of one or various processors and HW modules
- The application is organized into tasks/functions
- Tasks are mapped on the various units



#### **Background: ReSP**

#### 

#### ReSP is a system-level simulation platform for multicores

- HW components modeled in SystemC/TLM
- Features a functional model generator for microprocessors



#### **Background: ReSP**

ReSP is a system-level simul

- Python offers introspection and scripting capabilities:
  - non-intrusive
    visibility into the
    components
  - Run-time
    composition and
    management of
    the specification







The console stops the simulation, injects the fault and resumes the simulation

ReSP - test.py File Modifica Visualizza Terminale Aiuto / /\ / /\ / /:/\_ /:/\_ /::\ /::\ / /://\ /:/\:\ / /://\ /:/\:\ / /:/ /:/ /:/~/:/ /:/ /::\ /:/~/:/ / /:/ /:/ /:/ /:/ /\ / /:/ /:/\:\ 1:1 1:1 \:\/:/~/: \:\/:/ \::/~~~~ \::/ /:/ \::/ \::/ /:/ \:\ \:\/:/ \:\ \/ /:/ / /:/ \::/ /\_/:/ / /:/ \ \/ - \/ \ \/ - \/ v0.3.2 - Politecnico di Milano, European Space Agency This tool is distributed under the GPL License Type show commands() to get the list of available commands >>> load architecture('architectures/test/test.py') File architectures/test/test.py correctly loaded >>> run simulation(500) >>> leonProc.npc.read() 88 3 >>> leonProc.npc.write(84) >>> run simulation() Program exited with value 0 SystemC: simulation stopped by user. Real Elapsed Time (seconds): 0.01 Simulated Elapsed Time (nano-seconds): 1307.0 >>>

Έςνιςο Μίι αν

1863





- Processor models expose a configurable debugging interface
- Custom C++/Python functions can analyze the execution
- Application Binary Interface (ABI) can be exploited to interpret raw data (in particular, on function calls/returns)



- Probes (similar to saboteurs) can be used to analyze transmitted data
- Custom Python functions can be used to monitor the internal status of components (called every scheduler delta cycle)





**POLITECNICO** MILANO 1863

...

| applic<br>source<br>! |                  | ecture<br>fication                                  |                       |                                                                                            |                                         |  |  |
|-----------------------|------------------|-----------------------------------------------------|-----------------------|--------------------------------------------------------------------------------------------|-----------------------------------------|--|--|
|                       |                  |                                                     | <u> </u>              |                                                                                            |                                         |  |  |
| Function              | Memory addresses | Parameters                                          |                       |                                                                                            |                                         |  |  |
|                       |                  | Name                                                | Position              | Туре                                                                                       | Locatio                                 |  |  |
| main                  | 0x1498 – 0x1640  |                                                     |                       |                                                                                            |                                         |  |  |
| rgb2gray              | 0xea0 – 0xf64    | inputImg<br>outputImg<br>width<br>height            | 0<br>1<br>2<br>3      | char* (dim = 17,280)<br>char* (dim = 5760)<br>unsigned<br>unsigned                         | reg0<br>reg1<br>reg2<br>reg3            |  |  |
| edgeDetector          | 0xf68 – 0x12fc   | inputImg<br>outputImg<br>width<br>height            | 0<br>1<br>2<br>3      | char* (dim = 5760)<br>char* (dim = 5760)<br>unsigned<br>unsigned                           | reg0<br>reg1<br>reg2<br>reg3            |  |  |
| edgeOverlapping       | 0x1300 – 0x1480  | inputImg<br>edgeImg<br>outputImg<br>width<br>height | 0<br>1<br>2<br>3<br>4 | char* (dim = 17,280)<br>char* (dim = 5760)<br>char* (dim = 17,280)<br>unsigned<br>unsigned | reg0<br>reg1<br>reg2<br>reg3<br>SP + 0x |  |  |
| readBitmap            | HW               | inputImg<br>width<br>height                         | 0<br>1<br>2           | char* (dim = 17,280)<br>unsigned<br>unsigned                                               | reg0<br>reg1<br>reg2                    |  |  |
| writeBitmap           | HW               | outputImg<br>width<br>height                        | 0<br>1<br>2           | char* (dim = 17,280)<br>unsigned<br>unsigned                                               | reg0<br>reg1<br>reg2                    |  |  |

dependability report



- A semi-automated analysis of components' SystemC specification
  - Identifies injection locations, and
  - Supports the definition of fault models



- A fault-free run is performed to characterize the golden execution
  - At architecture level:
  - Collect relevant traces

 At application level: Execution flow graph





- Fault list is defined according to
  - The liveness analysis on raw traces



• The function to corrupt





- Definition of custom architecture/application-level monitoring and classification functionalities in C++/Python
- Example of classifier for the edge detection application





- Pretty standard execution flow of the fault injection campaign
- "Reacher" results allow more in-depth analysis
- Analysis results may provide a feedback for a refinement of the classification strategy



 Reliability analysis of a thread-level TMRed edge detector running onto a multicore



- First analysis performed on the main results
- Classification:
  - No effect
  - Errors
  - Exception/timeout

|           | Memory (%) | RBO-RB1 (%) | RB2-RB3 (%) | Link Reg. (%) | SP Reg. (%) | PC Reg. (%) |
|-----------|------------|-------------|-------------|---------------|-------------|-------------|
| No effect | 92.7       | 35.2        | 79.9        | 22.0          | 30.6        | 22.5        |
| Errors    | 2.0        | 30.2        | 7.7         | 0             | 1.4         | 2           |
| Exc./T.O. | 5.3        | 34.6        | 12.4        | 78            | 68          | 75.5        |
|           |            |             |             |               |             |             |
|           |            |             |             |               |             |             |
|           |            |             |             |               |             |             |
|           |            | 7           |             |               |             |             |
| ···· I/O  | MEM        |             |             |               |             |             |





### Conclusions

The various aspects of the methodology have been presented in three scientific papers:

- A. Miele: A fault-injection methodology for the system-level dependability analysis of multiprocessor embedded systems. In Journal of Microprocessors and Microsystems, Elsevier, August 2014
- G. Beltrame, C. Bolchini, A. Miele: Multi-level Fault Modeling for Transaction-level Specifications. In Proc. of (GLSVLSI), 2009
- C. Bolchini, A. Miele, D. Sciuto: Fault Models and Injection Strategies in SystemC Specifications. In Proc. of IEEE Euromicro DSD, 2008



# ... questions?

Contact: antonio.miele@polimi.it