Reliability against Soft Errors

We proposed a novel, assisted re-execution based technique, FluidCheck (TACO'16) to efficiently detect the occurence of soft errors in many-core processors. The primary contribution is the proposed processor architecture that firstly, allows dynamic and flexible migration of leader and checker threads, and secondly, reduces the pressure of execution assistance on the NOC. The secondary contribution is a family of scheduling heuristics that govern the dynamic migration of leader and checker threads based on the behavior of the multiprogrammed workload running on the processor. We demonstrated that 16-core processors can be protected against soft errors with a mere 27% loss in performance, surpassing traditional schemes by 42%.

Security against Hardware Trojans

We proposed a novel hardware trojan detection scheme, SecCheck (ISVLSI'16) for systems containing ICs designed and/or fabricated by untrusted parties. SecCheck advocates employing slow, trusted, home-grown cores to verify the functioning of fast, untrusted, third-party cores. The verification is done by adapting traditional reliability techniques. The overhead of verification is reduced by exploiting opportunities of temporally overlapping computation and verification. We demonstrated that the technique can ideally provide security against hardware trojans with a mere 10-17% penalty in performance. We additionally proposed heuristics to quickly compute an efficient schedule of the application on the processor cores. The heuristics improve the speed of scheduling 500 times over, with a < 1% degradation in the quality of the schedule.

Accountability in 3PIP-containing SOCs

SoCs are increasingly becoming heterogeneous, incorporating designs from various design houses. When an SoC miscomputes or underperforms, none of the design houses want to be implicated for the error, as this affects their credibility in the market. A perfectly accountable system is one where the error-causing component can be identified unambiguously. Debugging complex SoCs requires run-time information to help the debug engineers re-create the scenario under which the error occurred. Such logging is typically done through specialized circuitry placed by the system integrator (SI). The SI, however, cannot be deemed as a trustworthy organization (the IEEE DASC standard 1735 even describes how third-party IP vendors can protect themselves from the SI stealing their design). Thus, having the SI log the run-time information is unfair to the third-party IP (3PIP) vendors -- the SI may simply fabricate logs to shift the blame towards a 3PIP. I proposed that fair information collection, required for the desired accountability, can be achieved by having a trusted online audit system that authenticates all communication between components designed by different design houses. If two such on-chip components communicate, the contents, the time of sending and the time of receipt of the message is certified. This ensures that the collected logs are authentic, and the debugging performed using it is fair. Realizing a fair, trustworthy, efficient audit system is a challenge, and I have proposed a range of alternatives for the same.

I proposed a new model of representing SoCs, and classified SoCs based on this model. This helps better understand the accountability problem and organize the various solutions to it. The most complete solution, that makes the least assumptions about the scenario, employs redundancy to achieve accountability. Both communicating components have an embedded meter, and every communicated message is certified by both meters.

When a component miscomputes or underperforms, it would like to sabotage the audit process such that it is not implicated. Given such an adversarial environment, any auditing solution proposed must be proven resilient against such attempts of sabotage. I proposed a game-theory based approach to formally prove the resilience of an auditing solution, and used this approach to prove the proposed solutions.

The paradigm of heterogeneous SoCs is the way forward. However, accountability must be enforced, and that too economically, if the paradigm is to survive. I designed and evaluated the redundant metering solution described earlier, and demonstrated that accountability can indeed be achieved with meager area (0.194%) and performance (0.49%) overheads.

I presented an introductory version of this work as "Sec-X" (ISVLSI'15). A more refined and thorough treatment has been submitted to the journal "ACM Transactions on Embedded Computing Systems" (TECS). The reviews are awaited.

Hardware Implementation of the kCAS Synchronization Primitive

Modern processors provide atomic instructions such as compare-and-set (CAS) that work on exactly one address. The implementation of lock-free data structures can be greatly simplified if the hardware provides a kCAS instruction -- an instruction that performs the compare-and-set on k addresses. Along with a Masters student, I proposed a design to achieve the kCAS instruction. We believe we are the first to attempt this. The design is essentially an extension of the coherence protocol, with additional memory per core to maintain the state of an ongoing kCAS instruction. We also proposed an optimization that performs an early back-off when we can determine that the ongoing kCAS is going to fail. This reduces the time spent on kCASes and hence improves performance. The kCAS instruction not only enables the programmer to implement and debug parallel data structures easily, but also improves the performance of the application by 4X on average. This is largely due to the simplification of the data structures' implementation, which is enabled by kCAS. Our design has a minimal area overhead of 0.0016%.

This work was presented at "Design, Automation and Test in Europe" ( DATE'17) conference.

Tejas Architectural Simulator

Our research group has developed an architectural simulator, Tejas (PATMOS'15), that is capable of simulating state-of-the-art multi-core processor SOCs. It has been validated against real hardware, and demonstrated to be at par with other cycle accurate simulators in terms of simulation speed. Tejas has been made open source under the Apache v2 license. It is currently used in 51 countries by over 950 users, for both teaching and research purposes. I am among the chief designers, developers and maintainers of Tejas. Please visit the Tejas web page for more information.

Techniques to Improve the Performance of an Architectural Simulator

One way to accelerate the simulation of parallel benchmarks is to have multiple simulator threads, each simulating a different core (or set of cores) of the simulated architecture. There are two issues with this approach however. Firstly, accesses to shared structures like the last level cache or the directory requires synchronization between the threads. This brings down the performance. We proposed a novel lock-free implementation of the ports that govern the access to the shared structures. This helped reduce the penalty of synchronization to a large extent. Secondly, the individual simulator threads progress at different speeds depending upon the scheduling process. Consequently, at any point in time, the clocks of the structures they simulate have different values. Thus, the simulator threads need to be periodically synchronized to keep the clock skew within acceptable bounds. This synchronization, usually done using barriers, again brings down the performance. We proposed to use an advanced variant of a barrier known as a phaser for this purpose. Our techniques helped us achieve a performance gain of around 11X when employing 64 simulator threads. We have submitted this work to the journal "ACM Transactions on Modeling and Computer Simulation" (TOMACS), and have been asked to make a minor revision.

We proposed another technique to improve simulator performance, this time for serial benchmarks. The idea is based on the exploitation of the following fact: the architectural state at a certain point in the simulation, say at the xth instruction, can be approximately attained by simulating only the last &Delta instructions before the xth one. Every instruction, right from the first to the xth, need not be simulated. With this principle, the simulation can be broken down into disjoint, contiguous chunks of instructions. Each chunk can be simulated in parallel, and the results can be aggregated when all the chunks are simulated. We demonstrate that this technique can obtain speedups of up to 10X when a 16-way chunking is performed.