Chao Lu, Esha Telang, Aydin Aysu, Kanad BasuPreprint arXiv preprint arXiv:2401.01521
Quantum computing offers significant acceleration capabilities over its classical counterpart in various application domains. Consequently, there has been substantial focus on improving quantum computing capabilities. However, to date, the security implications of these quantum computing platforms have been largely overlooked. With the emergence of cloud-based quantum computing services, it is critical to investigate the extension of classical computer security threats to the realm of quantum computing. In this study, we investigated timing-based side-channel vulnerabilities within IBM’s cloud-based quantum service. The proposed attack effectively subverts the confidentiality of the executed quantum algorithm, using a more realistic threat model compared to existing approaches. Our experimental results, conducted using IBM’s quantum cloud service, demonstrate that with just 10 measurements, it is possible to identify the underlying quantum computer that executed the circuit. Moreover, when evaluated using the popular Grover circuit, we showcase the ability to leak the quantum oracle with a mere 500 measurements. These findings underline the pressing need to address timing-based vulnerabilities in quantum computing platforms and advocate for enhanced security measures to safeguard sensitive quantum algorithms and data.
Anuj Dubey, Aydin AysuConference Paper IEEE International Test Conference (ITC)
Machine learning (ML) has recently emerged as an application with confidentiality needs. A trained ML model is indeed a high-value intellectual property (IP), making it a lucrative target for notorious side-channel attacks. Recent works have already shown the possibility of reverse engineering the model internals by exploiting the side channels like timing and power consumption. But the defenses are largely unexplored. Preventing ML IP theft is highly relevant given that the demand for ML will only increase in the coming years. Securing ML hardware against side-channel attacks requires analyzing the vulnerabilities in the current ML applications and developing full-stack countermeasures from the ground up, covering cryptographic proofs, circuit design, firmware support, architecture/microarchitecture integration, compiler extensions, software design, and physical testing. There is a need to work on all abstraction levels because focusing on just one or few level(s) cannot provide a complete solution to this nascent problem.
Our research achieves four key objectives to realize the first complete solution for side-channel protected ML. First, we analyze the side-channel vulnerabilities in the various hardware blocks of an ML accelerator and assess the feasibility of model parameter extraction. Second, we design provably-secure gadgets, implement them on FPGA, and empirically validate possible countermeasures. Third, we add usability and flexibility to the solution—the ability to support multiple ML architectures via
secure software APIs and compiler extensions on a RISC-V core. Fourth, we fabricate the final solution at Skywater 130nm node.
Emre Karabulut, Aydin AysuPreprint Cryptology ePrint Archive
Sampling random values from a discrete Gaussian distribution with high precision is a major and computationally intensive operation of upcoming or existing cryptographic standards. FALCON is one such algorithm that the National Institute of Standards and Technology chose to standardize as a next-generation, quantum-secure digital signature algorithm. The discrete Gaussian sampling of FALCON has both flexibility and efficiency needs—it constitutes 72% of total signature generation in reference software and requires sampling from a variable mean and standard deviation. Unfortunately, there are no prior works on accelerating this complete sampling procedure. In this paper, we propose a hardware-software co-design for accelerating FALCON’s discrete Gaussian sampling subroutine. The proposed solution handles the flexible computations for setting the variable parameters in software and executes core operations with low latency, parameterized, and custom hardware. The hardware parameterization allows trading off area vs. performance. On a Xilinx SoC FPGA Architecture, the results show that compared to the reference software, our solution can accelerate the sampling up to 9.83× and the full signature scheme by 2.7×. Moreover, we quantified that our optimized multiplier circuits can improve the throughput over a straightforward implementation by 60%
Furkan Aydin, Aydin AysuPreprint Cryptology ePrint Archive
Homomorphic encryption (HE) allows computing encrypted data in the ciphertext domain without knowing the encryption key. It is possible, however, to break fully homomorphic encryption (FHE) algorithms by using side channels. This article demonstrates side-channel leakages of the Microsoft SEAL HE library. The proposed attack can steal encryption keys during the key generation phase by abusing the leakage of ternary value assignments that occurs during the number theoretic transform (NTT) algorithm. We propose two attacks, one for -O0 flag non-optimized code implementation which targets addition and subtraction operations, and one for -O3 flag compiler optimization which targets guard and mul_root operations. In particular, the attacks can steal the secret key coefficients from a single power/electromagnetic measurement trace of SEAL’s NTT implementation. To achieve high accuracy with a single-trace, we develop novel machine-learning side-channel profilers. On an ARM Cortex-M4F processor, our attacks are able to extract secret key coefficients with an accuracy of 98.3% when compiler optimization is disabled, and 98.6% when compiler optimization is enabled. We finally demonstrate that our attack can evade an application of the random delay insertion defense.
Ahmet Can Mert, Ferhat Yaman, Emre Karabulut, Erdinc Ozturk, Erkay Savas, Aydin AysuBook International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation
This survey summarizes the software implementation knowledge of the Number Theoretic Transform (NTT)—a major subroutine of lattice-based cryptosystems. The NTT is a special type of Fast Fourier Transform defined over finite fields, and as such, NTT enables faster polynomial multiplication. There have been over a decade of implementations of NTT following different design methods (eg, CPU vs. GPU), aiming different optimization goals (eg, memory-footprint vs. high-throughput), and proposing different styles of optimizations at different abstraction levels (eg, arithmetic vs. assembly). At the same time, there are several techniques for evaluating and mitigating implementation attacks on NTT. Yet there is no quick guideline to help new developers/practitioners or future researchers given the continuing industry and academic efforts on NTT implementations. Our goal in this paper is to provide an overview of a decade of work. To that end, we survey NTT software implementations and categorize them based on their target platforms, optimization goals, and implementation security enhancements. We furthermore provide an executive summary of the key ideas proposed in related works. We hope this paper to be a designer pit stop into the NTT world and help them navigate to their destination.
Aydin Aysu, Scott R GrahamJournal Paper Digital Threats: Research and Practice
Digital threats to computing and network security continue their relentless advance, with malware growing in sophistication and frequently targeting the lower levels of the computing stack. Recently, these threats have evolved and started to target hardware vulnerabilities. Such attacks are hard to detect at the higher abstraction levels and even harder to mitigate given the challenges of changing the hardware infrastructure. To be effective, defensive measures must also consider the physical effects of computing at lower levels of the hardware stack, defending against hardware side-channel attacks, counterfeit chip production, untrusted foundries, and supply-chain security threats, as well as incorporating side-channel information and other hardware-layer behaviors in defensive tools.
This special issue invited submissions in this area, including novel research and experimentation results involving digital threats to hardware security. We selected six papers ranging from exploitation of hardware effects through software- and hardware-supported protections. Each submission was reviewed by at least three reviewers through multiple rounds, helping to strengthen and clarify the authors’ contributions. Below we outline all six outstanding articles that are accepted for publication in this special issue.
Anuj Dubey, Rosario Cammarota, Avinash Varna, Raghavan Kumar, Aydin AysuConference Paper IEEE International Symposium on Hardware Oriented Security and Trust (HOST)
Physical side-channel attacks are a major threat to stealing confidential data from devices. There has been a recent surge in such attacks on edge machine learning (ML) hardware to extract the model parameters. Consequently, there has also been work, although limited, on building corresponding defenses against such attacks. Current solutions take either fully software- or fully hardware-centric approaches, which are limited in performance and flexibility, respectively.
In this paper, we propose the first hardware-software co-design solution for building side-channel-protected ML hardware. Our solution targets edge devices and addresses both performance and flexibility needs. To that end, we develop a secure RISCV-based coprocessor design that can execute a neural network implemented in C/C++. Our coprocessor uses masking to execute various neural network operations like weighted summations, activation functions, and output layer computation in a side-channel secure fashion. We extend the original RV32I instruction set with custom instructions to control the masking gadgets inside the secure coprocessor. We further use the custom instructions to implement easy-to-use APIs that are exposed to the end-user as a shared library. Finally, we demonstrate the empirical side-channel security of the design up to 1M traces.
Emre Karabulut, Amro Awad, Aydin AysuConference Paper IEEE International Symposium on Circuits and Systems (ISCAS)
FPGAs are newly added to the cloud to offer energy-efficient acceleration. Multi-tenancy is an emerging phenomenon in cloud FPGAs to enable resource efficiency. In a multi-tenant scenario, multiple users can share the same FPGA fabric either spatially (i.e., tenants share different resources at the same time) or temporally (tenants share the same resources in different time slots). Undesired access or manipulation of other tenant’s data can cause security and safety issues. Although safety/security concepts in access control policies have been thoroughly studied in conventional cloud systems, they are relatively unknown for cloud FPGAs. Moreover, these concepts may not trivially extend to cloud FPGAs due to their different nature. This paper proposes an improved access control mechanism for multi-tenant cloud FPGAs. Compared to existing commercial tools, our solution allows dynamic configuration of access control privileges. Compared to earlier academic proposals with dynamic configuration, the results show that our proposal has three advantages: (i) enabling secure resource sharing of on-chip BRAMs to tenants, (ii) enabling safe sharing by resolving deadlocks and faulty access requests, and (iii) improvement in latency and throughput.
Ashley Calhoun, Erick Ortega, Ferhat Yaman, Anuj Dubey, Aydin AysuConference Paper Proceedings of the Great Lakes Symposium on VLSI 2022, Jun 2022
Hardware security for machine learning (ML) and artificial intelligence (AI) circuits is becoming a major topic within the cybersecurity framework. Although much research is ongoing on this front, the community omits the educational components. In this paper, we present a training module comprised of a set of hands-on experiments that allow teaching hardware security concepts to newcomers. Specifically, we propose 5 experiments and related training material that teach side-channel attacks and defenses on the hardware implementations of neural networks. We report the organization and the findings after testing these experiments with sophomore undergraduate students at North Carolina State University. The students first study the basics of neural networks and then build a neural network inference circuit on a breadboard. They then conduct a differential power analysis attack on the hardware to steal trained weights and a circuit-balancing (hiding) style defense to mitigate the attack. The students develop all related hardware and software codes to perform attacks and build defenses. The results show that such complex notions of digital circuits design, neural networks, and side-channel analysis can be instructed at the sophomore level with a well-thought set of experiments. Future extensions could include establishing an online infrastructure for remote teaching and efficient scaling to a broader audience.
Seetal Potluri, Shamik Kundu, Akash Kumar, Kanad Basu, Aydin AysuJournal Paper IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Aug 2022
Existing logic-locking attacks are known to successfully decrypt a functionally correct key of a locked combinational circuit. Extensions of these attacks to real-world Intellectual Properties (IPs, which are sequential circuits) have been demonstrated through the scan chain by selectively initializing the combinational logic and analyzing the responses. In this paper, we propose SeqL+ to mitigate a broad class of such attacks. The key idea is to lock selective functional-input/scan output pairs of flip-flops without feedback to cause attackers to decrypt an incorrect key, and to scramble flip-flops with feedback to increase key-length without introducing further vulnerabilities.
We conduct a formal study of the scan-locking and scan-scrambling problems and demonstrate automating our proposed defense on any given IP. This study reveals the first formulation and complexity analysis of Boolean Satisfiability (SAT)-based attack on scan-scrambling. We formulate the attack as a conjunctive normal form (CNF) using a worstcase reduction in terms of scramble-graph size n, making SAT based attack applicable and show that scramble equivalence classes are equi-sized and of cardinality 1. In order to defeat SAT-attack, we propose an iterative swapping-based scan-cell scrambling algorithm that has O(n) implementation time-complexity and SATdecryption time-complexity in terms of a user-configurable cost constraint α (0 < α ≤ 1).
We empirically validate that SeqL+ hides functionally correct keys from the attacker, thereby increasing the likelihood of the decrypted key being functionally incorrect. When tested on pipelined combinational benchmarks (ISCAS, MCNC), sequential benchmarks (ITC) and a fully fledged RISC-V CPU, SeqL+ gave 100% resilience to a broad range of state-of-the-art attacks including SAT , Double-DIP , HackTest , SMT , FALL , Shift-and-Leak , Multi-cycle , Scan-flushing , and Removal  attacks.
Gregor Haas, Aydin AysuConference Paper Accepted to the Design Automation Conference (DAC), San Francisco, USA, Jul 2022
Cryptographic instruction set extensions are commonly used for ciphers which would otherwise face unacceptable side channel risks. A prominent example of such an extension is the ARMv8 Cryptographic Extension, or ARM CE for short, which defines dedicated instructions to securely accelerate AES. However, while these extensions may be resistant to traditional “digital” side channel attacks, they may still vulnerable to physical side channel attacks.
In this work, we demonstrate the first such attack on a standard ARM CE AES implementation. We specifically focus on the implementation used by Apple’s CoreCrypto library which we run on the Apple A10 Fusion SoC. To that end, we implement an optimized side channel acquisition infrastructure involving both custom iPhone software and accelerated analysis code. We find that an adversary which can observe 5-30 million known-ciphertext traces can reliably extract secret AES keys using electromagnetic (EM) radiation as a side channel. This corresponds to an encryption operation on less than half of a gigabyte of data, which could be acquired in less than 2 seconds on the iPhone 7 we examined. Our attack thus highlights the need for side channel defenses for real devices and production, industry-standard encryption software.
Furkan Aydin, Emre Karabulut, Seetal Potluri, Erdem Alkim, Aydin AysuConference Paper Design, Automation and Test in Europe Conference, Antwerp, BELGIUM, Mar 2022
This paper demonstrates the first side-channel attack on homomorphic encryption (HE), which allows computing on encrypted data. We reveal a power-based side-channel leakage of Microsoft SEAL prior to v3.6 that implements the Brakerski/Fan-Vercauteren (BFV) protocol. Our proposed attack targets the Gaussian sampling in the SEAL’s encryption phase and can extract the entire message with a single power measurement.
Our attack works by (1) identifying each coefficient index being sampled, (2) extracting the sign value of the coefficients from control-flow variations, (3) recovering the coefficients with a high probability from data-flow variations, and (4) using a Blockwise Korkine-Zolotarev (BKZ) algorithm to efficiently explore and estimate the remaining search space. Using real power measurements, the results on a RISC-V FPGA implementation of the SEAL (v3.2) show that the proposed attack can reduce the plaintext encryption security level from 2^128 to 2^4.4. Therefore, as HE gears toward real-world applications, such attacks and related defenses should be considered.
Archit Gajjar, Priyank Kashyap, Aydin Aysu, Paul Franzon, Sumon Dey, Chris ChengConference Paper IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), New York, USA, Mar 2022
Advanced ensemble trees have proven quite effective in providing real-time predictions against ransomware detection, medical diagnosis, recommendation engines, fraud detection, failure predictions, crime risk, to name a few. Especially, XGBoost, one of the most prominent and widely used decision trees, has gained popularity due to various optimizations on gradient boosting framework that provides increased accuracy for classification and regression problems. XGBoost’s ability to train relatively faster, handling missing values, flexibility and parallel processing make it a better candidate to handle data center workload. Today’s data centers with enormous Input/Output Operations per Second (IOPS) demand a real-time accelerated inference with low latency and high throughput because of significant data processing due to applications such as ransomware detection or fraud detection.This paper showcases an FPGA-based XGBoost accelerator designed with High-Level Synthesis (HLS) tools and design flow accelerating binary classification inference. We employ Alveo U50 and U200 to demonstrate the performance of the proposed design and compare it with existing state-of-the-art CPU (Intel Xeon E5-2686 v4) and GPU (Nvidia Tensor Core T4) implementations with relevant datasets. We show a latency speedup of our proposed design over state-of-art CPU and GPU implementations, including energy efficiency and cost-effectiveness. The proposed accelerator is up to 65.8x and 5.3x faster, in terms of latency than CPU and GPU, respectively. The Alveo U50 is a more cost-effective device, and the Alveo U200 stands out as more energy-efficient.
Anuj Dubey, Emre Karabulut, Amro Awad, Aydin AysuConference Paper 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS)
This paper shows the first side-channel attack on neural network (NN) IPs through a remote power monitor. We demonstrate that a remote monitor implemented with time-to-digital converters can be exploited to steal the weights from a hardware implementation of NN inference. Such an attack alleviates the need to have physical access to the target device and thus expands the attack vector to multi-tenant cloud FPGA platforms. Our results quantify the effectiveness of the attack on an FPGA implementation of NN inference and compare it to an attack with physical access. We demonstrate that it is indeed possible to extract the weights using DPA with 25000 traces if the SNR is sufficient. The paper, therefore, motivates secure virtualization-to protect the confidentiality of high-valued NN model IPs in multi-tenant execution environments, platform developers need to employ strong countermeasures against physical side-channel attacks.
Aydin AysuBook Proceedings of the 2022 on Cloud Computing Security Workshop
FPGAs are increasingly being used in cloud systems, mainly due to their performance and energy advantages. Recent FPGAs have a relatively large amount of resources, which enables multi-tenancy and hence improves the utilization and economic value for both the cloud providers and customers. However, the ability to co-locate designs from different tenants requires efficient safeguards and support. In this talk, I will explore security and safety issues related to multi-tenant cloud FPGA. Specifically, I will describe recent work on remote physical side-channel attacks and power safety issues, and how to mitigate them for next-generation cloud infrastructure.
Emre Karabulut, Chandu Yuvarajappa, Mohammed Iliyas Shaik, Seetal Potluri, Amro Awad, Aydin AysuConference Paper Proceedings of the 2022 Workshop on Attacks and Solutions in Hardware Security
FPGAs are increasingly being used in cloud systems, mainly due to
their performance and energy advantages. Recent FPGAs have a
relatively large amount of resources, which enables multi-tenancy
and hence improves the utilization and economic value for both the
cloud providers and customers. However, the ability to co-locate
designs from different tenants requires efficient safeguards and support. Fortunately, the majority of the recent FPGAs, e.g., those from
Xilinx (currently AMD), include partial reconfiguration (PR) capabilities which enable partitioning and independently programming
the FPGA resources. FPGA’s PR capability is considered vital for
the temporal and spatial sharing of FPGAs in cloud environments.
In this work, we systematically study how the various power
profiles for FPGA partitions can impact the process of programming
partitions and the overall functionality of the FPGA. Surprisingly,
we observe that high power activity in partitions can significantly
impact the programming time of other partitions. Even worse, we
observe that carefully crafted power viruses can delay (or even) fail
the whole PR process, and in some cases cause the shutting down
of the whole FPGA. Accordingly, we describe such attacks in detail
and discuss how they can impact the availability and timeliness
(in the case of real-time workloads) of multi-tenant FPGAs. Finally,
we propose a lightweight solution that can effectively detect such
abnormal power activities and hence blocks any channels for such
attacks before the PR process starts.
Furkan Aydin, Aydin AysuConference Paper Proceedings of the 2022 Workshop on Attacks and Solutions in Hardware Security
This paper reveals a new side-channel leakage of Microsoft SEAL homomorphic encryption library. The proposed attack exploits the leakage of ternary value assignments made during the Number Theoretic Transform (NTT) sub-routine. Notably, the attack can steal the secret key coefficients from a single power/electromagnetic measurement trace. To achieve high accuracy with a single-trace, we build a novel machine learning based side-channel profiler. Moreover, we implement a defense based on random delay insertion based defense mechanism to mitigate the shown leakage. The results on an ARM Cortex-M4F processor show that our attack extracts secret key coefficients with 98.3% accuracy and random delay insertion defense does not reduce the success rate of our attack.
Emre Karabulut, Chandu Yuvarajappa, Mohammed Iliyas Shaik, Seetal Potluri, Amro Awad, Aydin AysuConference Paper Proceedings of the 2022 Workshop on Attacks and Solutions in Hardware Security, Los Angeles, USA, Nov 2022
FPGAs are increasingly being used in cloud systems, mainly due to their performance and energy advantages. Recent FPGAs have a relatively large amount of resources, which enables multi-tenancy and hence improves the utilization and economic value for both the cloud providers and customers. However, the ability to co-locate designs from different tenants requires efficient safeguards and support. Fortunately, the majority of the recent FPGAs, e.g., those from Xilinx (currently AMD), include partial reconfiguration (PR) capabilities which enable partitioning and independently programming the FPGA resources. FPGA’s PR capability is considered vital for the temporal and spatial sharing of FPGAs in cloud environments. In this work, we systematically study how the various power profiles for FPGA partitions can impact the process of programming partitions and the overall functionality of the FPGA. Surprisingly, we observe that high power activity in partitions can significantly impact the programming time of other partitions. Even worse, we observe that carefully crafted power viruses can delay (or even) fail the whole PR process, and in some cases cause the shutting down of the whole FPGA. Accordingly, we describe such attacks in detail and discuss how they can impact the availability and timeliness (in the case of real-time workloads) of multi-tenant FPGAs. Finally, we propose a lightweight solution that can effectively detect such abnormal power activities and hence blocks any channels for such attacks before the PR process starts.
Hossein Sayadi, Mehrdad Aliasgari, Furkan Aydin, Seetal Potluri, Aydin Aysu, Jack Edmonds, Sara TehranipoorConference Paper 2022 IEEE 28th International Symposium on On-Line Testing and Robust System Design (IOLTS)
Recent developments in Artificial Intelligence (AI) and Machine Learning (ML), driven by a substantial increase in the size of data in emerging computing systems, have led into successful applications of such intelligent techniques in various disciplines including security. Traditionally, integrity of data has been protected with various security protocols at the software level with the underlying hardware assumed to be secure. This assumption however is no longer true with an increasing number of attacks reported on the hardware. The emergence of new security threats (e.g., malware, side-channel attacks, etc.) requires patching/updating the software-based solutions that needs a vast amount of memory and hardware resources. Therefore, the security should be delegated to the underlying hardware, building a bottom-up solution for securing computing devices rather than treating it as an afterthought. This paper highlights the growing role of AI/ML techniques in hardware and architecture security field and provides insightful discussions on pressing challenges, opportunities, and future directions of designing accurate and efficient machine learning-based attacks and defense mechanisms in response to emerging hardware security vulnerabilities in modern computer systems and next generation of cryptosystems.
Furkan Aydin, Aydin Aysu, Mohit Tiwari, Andreas Gerstlauer, Michael OrshanskyJournal Paper ACM Transactions on Embedded Computing Systems (TECS), vol.20, no.6, Nov 2021
Key exchange protocols and key encapsulation mechanisms establish secret keys to communicate digital information confidentially over public channels. Lattice-based cryptography variants of these protocols are promising alternatives given their quantum-cryptanalysis resistance and implementation efficiency. Although lattice cryptosystems can be mathematically secure, their implementations have shown side-channel vulnerabilities. But such attacks largely presume collecting multiple measurements under a fixed key, leaving more dangerous single-trace attacks unexplored.
This article demonstrates successful single-trace power side-channel attacks on lattice-based key exchange and encapsulation protocols Our attack targets both hardware and software implementations of matrix multiplications used in lattice cryptosystems. The crux of our idea is to apply a horizontal attack that makes hypotheses on several intermediate values within a single execution all relating to the same secret, and to combine their correlations for accurately estimating the secret key. We illustrate that the design of protocols combined with the nature of lattice arithmetic enables our attack. Since a straightforward attack suffers from false positives, we demonstrate a novel extend-and-prune procedure to recover the key by following the sequence of intermediate updates during multiplication.
We analyzed two protocols, Frodo and FrodoKEM, and reveal that they are vulnerable to our attack. We implement both stand-alone hardware and RISC-V based software realizations and test the effectiveness of the proposed attack by using concrete parameters of these protocols on physical platforms with real measurements. We show that the proposed attack can estimate secret keys from a single power measurement with over 99% success rate.
Seetal Potluri, Aydin AysuConference Paper IEEE International Conference On Computer Aided Design (ICCAD), Munich, Germany, Nov 2021
Stealing trained machine learning (ML) models is a new and growing concern due to the model’s development cost. Existing work on ML model extraction either applies a mathematical attack or exploits hardware vulnerabilities such as side-channel leakage. This paper shows a new style of attack, for the first time, on ML models running on embedded devices by abusing the scan-chain infrastructure. We illustrate that having course-grained scan-chain access to non-linear layer outputs is sufficient to steal ML models. To that end, we propose a novel small-signal analysis inspired attack that applies small perturbations into the input signals, identifies the quiescent operating points and, selectively activates certain neurons. We then couple this with a Linear Constraint Satisfaction based approach to efficiently extract model parameters such as weights and biases. We conduct our attack on neural network inference topologies defined in earlier works, and we automate our attack. The results show that our attack outperforms mathematical model extraction proposed in CRYPTO 2020, USENIX 2020, and ICML 2020 by an increase in accuracy of 2^20.7x, 2^50.7x, and 2^33.9x, respectively, and a reduction in queries by 2^6.5x, 2^4.6x, and 2^14.2x, respectively.
Priyank Kashyap, Furkan Aydin, Seetal Potluri, Paul Franzon, Aydin AysuJournal Paper IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 40, no. 6, pp. 1217-1229, Jun 2021
Advancements in quantum computing present a security threat to classical cryptography algorithms. Lattice-based key exchange protocols show strong promise due to their resistance to theoretical quantum-cryptanalysis and low implementation overhead. By contrast, their physical implementations have shown vulnerability against side-channel attacks (SCAs) even with a single power measurement. The state-of-the-art SCAs are, however, limited to simple, sequentialized executions of post-quantum key-exchange (PQKE) protocols, leaving the vulnerability of complex, parallelized architectures unknown. This article proposes 2Deep—a deep-learning (DL)-based SCA—targeting parallelized implementations of PQKE protocols, namely Frodo and NewHope with data augmentation techniques. Specifically, we explore approaches that convert one-dimensional (1D) time-series power measurement data into two-dimensional (2D) images to formulate SCA an image recognition task. The results show our attack’s superiority over conventional techniques including horizontal differential power analysis (DPA), template attacks (TAs), and straightforward DL approaches. We demonstrate improvements up to 1.5× to recover a 100% success rate compared to DL with 1D input data while using fewer data. We furthermore show that machine learning improves the results up to 1.25× compared to TAs. Furthermore, we perform cross-device attacks that obtain profiles from a single device, which has never been explored. Our 2D approach is especially favored in this setting, improving the success rate of attacking Frodo from 20% to 99% compared to the 1D approach. Our work thus urges countermeasures even on parallel architectures and single-trace attacks.
Anuj Dubey, Afzal Ahmad, Muhammad Adeel Pasha, Rosario Cammarota, Aydin AysuJournal Paper IACR Transactions on Cryptographic Hardware and Embedded Systems, Nov 2021
Intellectual Property (IP) thefts of trained machine learning (ML) models through side-channel attacks on inference engines are becoming a major threat. Indeed, several recent works have shown reverse engineering of the model internals using such attacks, but the research on building defenses is largely unexplored. There is a critical need to efficiently and securely transform those defenses from cryptography such as masking to ML frameworks. Existing works, however, revealed that a straightforward adaptation of such defenses either provides partial security or leads to high area overheads. To address those limitations, this work proposes a fundamentally new direction to construct neural networks that are inherently more compatible with masking. The key idea is to use modular arithmetic in neural networks and then efficiently realize masking, in either Boolean or arithmetic fashion, depending on the type of neural network layers. We demonstrate our approach on the edge-computing friendly binarized neural networks (BNN) and show how to modify the training and inference of such a network to work with modular arithmetic without sacrificing accuracy. We then design novel masking gadgets using Domain-Oriented Masking (DOM) to efficiently mask the unique operations of ML such as the activation function and the output layer classification, and we prove their security in the glitch-extended probing model. Finally, we implement fully masked neural networks on an FPGA, quantify that they can achieve a similar latency while reducing the FF and LUT costs over the state-of-the-art protected implementations by 34.2% and 42.6%, respectively, and demonstrate their first-order side-channel security with up to 1M traces.
Emre Karabulut, Aydin AysuConference Paper The Design Automation Conference (DAC), San Francisco, USA, Dec 2021
This paper proposes the first side-channel attack on FALCON—a NIST Round-3 finalist for the post-quantum digital signature standard. We demonstrate a known-plaintext attack that uses the electromagnetic measurements of the device to extract the secret signing keys, which then can be used to forge signatures on arbitrary messages. The proposed attack targets the unique floating-point multiplications within FALCON’s Fast Fourier Transform through a novel extend-and-prune strategy that extracts the sign, mantissa, and exponent variables without false positives. The extracted floating-point values are then mapped back to the secret key’s coefficients. Our attack, notably, does not require pre-characterizing the power profile of the target device or crafting special inputs. Instead, the statistical differences on obtained traces are sufficient to successfully execute our proposed differential electromagnetic analysis. The results on an ARM-Cortex-M4 running the FALCON NIST’s reference software show that approximately 10k measurements are sufficient to extract the entire key.
Single-Trace Side-Channel Attacks on ω-Small Polynomial Sampling with Applications to NTRU, NTRU Prime, and CRYSTALS-DILITHIUM
Emre Karabulut, Erdem Alkim, Aydin AysuConference Paper IEEE International Symposium on Hardware Oriented Security and Trust (HOST), Washington DC, USA, Dec 2021
This paper proposes a new single-trace side-channel attack on lattice-based post-quantum protocols. We target the ω-small polynomial sampling of NTRU, NTRU Prime, and CRYSTALS-DILITHIUM algorithms (which are NIST Round- 3 finalist and alternative candidates), and we demonstrate the vulnerabilities of their sub-routines to a power-based side-channel attack. Specifically, we reveal that the sorting algorithm in NTRU/NTRU Prime and the shuffling in CRYSTALS- DILITHIUM’s ω-small polynomial sampling process leaks information about the ‘-1’, ‘0’, or ‘+1’ assignments made to the coefficients. We further demonstrate that these assignments can be found within a single power measurement and that revealing them allows secret and session key recovery for NTRU/NTRU Prime, while reducing the challenge polynomial’s entropy for CRYSTALS-DILITHIUM. We execute our proposed attacks on an ARM Cortex-M4 microcontroller running the reference software submissions from NIST Round-3 software packages. The results show that our attacks can extract coefficients with a success rate of 99.78% for NTRU and NTRU Prime, reducing the search space to 241 or below. For CRYSTALS-DILITHIUM, our attack recovers the coefficients’ signs with over 99.99% success, reducing rejected challenge polynomials’ entropy between 39 to 60 bits. Our work informs the proposers about the single-trace vulnerabilities of their software and urges them to develop single-trace resilient software for low-cost microcontrollers.
Anuj Dubey, Rosario Cammarota, Vikram Suresh, Aydin AysuJournal Paper arXiv preprint arXiv:2109.00187, Sep 2021
Machine learning (ML) models can be trade secrets due to their development cost. Hence, they need protection against malicious forms of reverse engineering (e.g., in IP piracy). With a growing shift of ML to the edge devices, in part for performance and in part for privacy benefits, the models have become susceptible to the so-called physical side-channel attacks. ML being a relatively new target compared to cryptography poses the problem of side-channel analysis in a context that lacks published literature. The gap between the burgeoning edge-based ML devices and the research on adequate defenses to provide side-channel security for them thus motivates our study. Our work develops and combines different flavors of side-channel defenses for ML models in the hardware blocks. We propose and optimize the first defense based on Boolean masking. We first implement all the masked hardware blocks. We then present an adder optimization to reduce the area and latency overheads. Finally, we couple it with a shuffle-based defense. We quantify that the area-delay overhead of masking ranges from 5.4× to 4.7× depending on the adder topology used and demonstrate a first-order side-channel security of millions of power traces. Additionally, the shuffle countermeasure impedes a straightforward second-order attack on our first-order masked implementation.
Francesco Regazzoni, Shivam Bhasin, Amir Ali Pour, Ihab Alshaer, Furkan Aydin, Aydin Aysu, Vincent Beroulle, Giorgio Di Natale, Paul Franzon, David Hely, Naofumi Homma, Akira Ito, Dirmanto Jap, Priyank Kashyap, Ilia Polian, Seetal Potluri, Rei Ueno, Elena-Ioana Vatajelu, Ville Yli-MäyryConference Paper IEEE/ACM International Conference On Computer Aided Design (ICCAD), pp. 1-6,Nov 2020
Machine learning techniques have significantly changed our lives. They helped improving our everyday routines, but they also demonstrated to be an extremely helpful tool for more advanced and complex applications. However, the implications of hardware security problems under a massive diffusion of machine learning techniques are still to be completely understood. This paper first highlights novel applications of machine learning for hardware security, such as evaluation of post quantum cryptography hardware and extraction of physically unclonable functions from neural networks. Later, practical model extraction attack based on electromagnetic side-channel measurements are demonstrated followed by a discussion of strategies to protect proprietary models by watermarking them.
Ahmet Can Mert, Emre Karabulut, Erdinc Ozturk, Erkay Savas, Aydin AysuJournal Paper IEEE Transactions on Computers (Early Access) , Aug, 2020
Efficient lattice-based cryptosystems operate with polynomial rings with the Number Theoretic Transform (NTT) to reduce the computational complexity of polynomial multiplication. NTT has therefore become a major arithmetic component (thus computational bottleneck) in various cryptographic constructions like hash functions, key-encapsulation mechanisms, digital signatures, and homomorphic encryption. Although there exist several hardware designs in prior work for NTT, they all are isolated design instances fixed for specific NTT parameters or parallelization level. This paper provides an extensive study of flexible design methods for NTT implementation. To that end, we evaluate three cases: (1) parametric hardware design, (2) high-level synthesis (HLS) design approach, (3) and design for software implementation compiled on soft-core processors, where all are targeted on reconfigurable hardware devices. We evaluate the designs that implement multiple NTT parameters and/or processing elements, demonstrate the design details for each case, and provide a fair comparison with each other and prior work. On a Xilinx Virtex-7 FPGA, compared to HLS and processor-based methods, the results show that the parametric hardware design is on average 4.4x and 73.9x smaller and 22.5x and 19.3x faster, respectively. Surprisingly, HLS tools can yield less efficient solutions than processor-based approaches in some cases.
Aydin Aysu, Furkan Aydin, Priyank Kashyap, Seetal Potluri, Paul FranzonPresentation
Anuj Dubey, Rosario Cammarota, Aydin AysuConference Paper The International Conference on Computer-Aided Design (ICCAD), Pages 1-9, Virtual Conference, Nov 2020
Recent work on stealing machine learning (ML) models from inference engines with physical side-channel attacks warrant an urgent need for effective side-channel defenses. This work proposes the first fully-masked neural network inference engine design.
Masking uses secure multi-party computation to split the secrets into random shares and to decorrelate the statistical relation of secret-dependent computations to side-channels (e.g., the power draw). In this work, we construct secure hardware primitives to mask all the linear and non-linear operations in a neural network. We address the challenge of masking integer addition by converting each addition into a sequence of XOR and AND gates and by augmenting Trichina’s secure Boolean masking style. We improve the traditional Trichina’s AND gates by adding pipelining elements for better glitch-resistance and we architect the whole design to sustain a throughput of 1 masked addition per cycle.
We implement the proposed secure inference engine on a Xilinx Spartan-6 (XC6SLX75) FPGA. The results show that masking incurs an overhead of 3.5% in latency and 5.9× in area. Finally, we demonstrate the security of the masked design with 2M traces.
DeePar-SCA: Breaking Parallel Architectures of Lattice Cryptography via Learning Based Side-Channel Attacks
Furkan Aydin, Priyank Kashyap, Seetal Potluri, Paul Franzon, Aydin AysuConference Paper The International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Pages 262-280, Virtual Conference, Oct 2020
This paper proposes the first deep-learning based side-channel attacks on post-quantum key-exchange protocols. We target hardware implementations of two lattice-based key-exchange protocols—Frodo and NewHope—and analyze power side-channels of the security-critical arithmetic functions. The challenge in applying side-channel attacks stems from the single-trace nature of the protocols: each new execution will use a fresh and unique key, limiting the adversary to a single power measurement. Although such single-trace attacks are known, they have been so far constrained to sequentialized designs running on simple micro- controllers. By using deep-learning and data augmentation techniques, we extend those attacks to break parallelized hardware designs, and we quantify the attack’s limitations. Specifically, we demonstrate single-trace deep-learning based attacks that outperform traditional attacks such as horizontal differential power analysis and template attacks by up to 900% and 25%, respectively. The developed attacks can therefore break implementations that are otherwise secure, motivating active countermeasures even on parallel architectures for key-exchange protocols.
Qinhan Tan, Seetal Potluri, Aydin AysuConference Paper The International Symposium of Circuit and Systems (ISCAS), Virtual Conference, Oct 2020
Intellectual Property (IP) theft is a serious concern for the integrated circuit (IC) industry. To address this concern, logic locking countermeasure transforms a logic circuit to a different one to obfuscate its inner details. The transformation caused by obfuscation is reversed only upon application of the programmed secret key, thus preserving the circuit’s original function. This technique is known to be vulnerable to Satisfiability (SAT)-based attacks. But in order to succeed, SAT-based attacks implicitly assume a perfectly reverse-engineered circuit, which is difficult to achieve in practice due to reverse engineering (RE) errors caused by automated circuit extraction. In this paper, we analyze the effects of random circuit RE-errors on the success of SAT-based attacks. Empirical evaluation on ISCAS, MCNC benchmarks as well as a fully-fledged RISC-V CPU reveals that the attack success degrades exponentially with increase in the number of random RE-errors. Therefore, the adversaries either have to equip RE-tools with near perfection or propose better SAT-based attacks that can work with RE-imperfections.
Emre Karabulut, Aydin AysuConference Paper The International Conference on Field-Programmable Logic and Applications (FPL), Pages 1-7, Virtual Conference, Oct 2020
Lattice-based cryptography has been growing in demand due to their quantum attack resiliency. Polynomial multiplication is a major computational bottleneck of lattice cryptosystems. To address the challenge, lattice-based cryptosystems use the Number Theoretic Transform (NTT). Although NTT reduces complexity, it is still a well-known computational bottleneck. At the same time, NTT arithmetic needs vary for different algorithms, motivating flexible solutions.
Although there are prior hardware and software NTT designs, they do not simultaneously offer flexibility and efficiency. This work provides an efficient and flexible NTT solution through domain-specific architectural support on RISC-V. Rather than using instruction-set extensions with compiler modifications or loosely coupling a RISC-V core with an NTT co-processor, our proposal uses application-specific dynamic instruction scheduling, memory dependence prediction, and datapath optimizations. This allows achieving a direct translation of C code to optimized NTT executions. We demonstrate the flexibility of our approach by implementing the NTT used in several lattice-based cryptography protocols: NewHope, qTESLA, CRYSTALS-Kyber, CRYSTALS-Dilithium, and Falcon. The results on the FPGA technology show that the proposed design is respectively 6×, 40×, and 3× more efficient than the baseline solution, Berkeley Out-of-Order Machine, and a prior HW/SW co-design, while providing the needed flexibility.
Anuj Dubey, Rosario Cammarota, Aydin AysuConference Paper IEEE International Symposium on Hardware Oriented Security and Trust (HOST), San Jose, CA, USA, May 2020
Differential Power Analysis (DPA) has been an active area of research for the past two decades to study the attacks for extracting secret information from cryptographic implementations through power measurements and their defenses. Unfortunately, the research on power side-channels have so far predominantly focused on analyzing implementations of ciphers such as AES, DES, RSA, and recently post-quantum cryptography primitives (eg, lattices). Meanwhile, machine-learning, and in particular deep-learning applications are becoming ubiquitous with several scenarios where the Machine Learning Models are Intellectual Properties requiring confidentiality. The problem of extending side-channel analysis to Machine Learning Model extraction is largely unexplored.
This paper extends the DPA framework to neural-network classifiers. First, it shows DPA attacks on classifiers that can extract the secret model parameters such as weights and biases of a neural network. Second, it proposes the first countermeasures against these attacks by augmenting masking. The resulting design uses novel masked components such as masked adder trees for fully-connected layers and masked Rectifier Linear Units for activation functions. On a SAKURA-X FPGA board, experiments show both the insecurity of an unprotected design and the security of our proposed protected design.
Seetal Potluri, Aydin Aysu, Akash KumarConference Paper IEEE International Symposium on Quality and Electronic Design (ISQED), Virtual Conference, Mar 2020
Existing logic-locking attacks are known to successfully decrypt functionally correct key of a locked combinational circuit. It is possible to extend these attacks to real-world Silicon-based Intellectual Properties (IPs, which are sequential circuits) through scan-chains by selectively initializing the combinational logic and analyzing the responses. In this paper, we propose SeqL, which achieves functional isolation and locks selective flip-flop functional-input/scan-output pairs, thus rendering the decrypted key functionally incorrect. We conduct a formal study of the scan-locking problem and demonstrate automating our proposed defense on any given IP. We show that SeqL hides functionally correct keys from the attacker, thereby increasing the likelihood of the decrypted key being functionally incorrect. When tested on pipelined combinational benchmarks (ISCAS, MCNC), sequential benchmarks (ITC) and a fully-fledged RISC-V CPU, SeqL gave 100% resilience to a broad range of state-of-the-art attacks including SAT , Double-DIP , HackTest , SMT , FALL , Shift-and-Leak  and Multi-cycle attacks .
A Flexible and Scalable NTT Hardware: Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography
Ahmet Can Mert, Emre Karabulut, Erdinc Ozturk, Erkay Savas, Michela Becchi, Aydin AysuConference Paper Design, Automation and Test in Europe Conference (DATE 2020), Pages 1-6, Grenoble, France, Mar 2020
The Number Theoretic Transform (NTT) enables faster polynomial multiplication and is becoming a fundamental component of next-generation cryptographic systems. NTT hardware designs have two prevalent problems related to design-time flexibility. First, algorithms have different arithmetic structures causing the hardware designs to be manually tuned for each setting. Second, applications have diverse throughput/area needs but the hardware have been designed for a fixed, pre-defined number of processing elements.
This paper proposes a parametric NTT hardware generator that takes arithmetic configurations and the number of processing elements as inputs to produce an efficient hardware with the desired parameters and throughput. We illustrate the employment of the proposed design in two applications with different needs: A homomorphically encrypted deep neural network inference (CryptoNets) and a post-quantum digital signature scheme (qTESLA). We propose the first NTT hardware acceleration for both applications on FPGAs. Compared to prior software and high-level synthesis solutions, the results show that our hardware can accelerate NTT up to 28× and 48×, respectively. Therefore, our work paves the way for high-level, automated, and modular design of next-generation cryptographic hardware solutions.
Fast and Efficient Implementation of Lightweight Crypto Algorithm PRESENT on FPGA through Processor Instruction Set Extension
Abdullah Varici, Gurol Saglam, Seckin Ipek, Abdullah Yildiz, Sezer Goren, Aydin Aysu, Deniz Iskender, T. Baris Aktemur, H. Fatih UgurdagConference Paper IEEE East-West Design & Test Symposium (EWDTS), Pages 1-5, Batum, Georgia, Sep 2019
As Internet of Things (IoT) technology becomes widespread, the importance of information security increases. PRESENT algorithm is a major lightweight symmetric-key encryption algorithm for IoT devices. Compared to the Advanced Encryption Standard (AES), PRESENT uses a lower amount of resources while achieving the same level of security. In this paper, we implement PRESENT with different design methodologies including hand-coded RTL, Vivado HLS, PicoBlaze, VerySimpleCPU (VSCPU) based microcontrollers, and a customized VSCPU. The customized VSCPU design is based on optimizing the instruction set architecture for the algorithm specifics of PRESENT. Our results show that the customized VSCPU design methodology can be more efficient than HLS and PicoBlaze while providing the flexibility compared to RTL designs.
Aydin AysuConference Paper Great Lakes Symposium on VLSI, Pages 237-242, Tysons Corner, USA, May 2019
Evolving threats against cryptographic systems and the increasing diversity of computing platforms enforce teaching cryptographic engineering to a wider audience. This paper describes the development of a new graduate course on hardware security taught at North Carolina State University during Fall 2018. The course targets an audience with no background on cryptography or hardware vulnerabilities. The course focuses especially on post-quantum cryptosystems—the next-generation cryptosystems mitigating quantum computer attacks—and evolves into designing specialized hardware accelerators for post-quantum cryptography, executing sophisticated implementation attacks (e.g., side-channel and fault attacks), and building countermeasures on such hardware designs. We discuss the curriculum design, hands-on assignment’s development, final research project outcome, and the results obtained from the course together with the associated challenges. Our experience shows that such a course is feasible, can achieve its goals, and liked by the students, but there is room for improvement.
Shijia Wei, Aydin Aysu, Michael Orshanky, Andreas Gerstlauer, Mohit TiwariConference Paper Hardware Oriented Security and Trust (HOST), Pages 1-10, McLean, USA, May 2019
High-assurance embedded systems are deployed for decades and expensive to re-certify – hence, each new attack is an unpatchable problem that can only be detected by monitoring out-of-band channels such as the system’s power trace or electromagnetic emissions. Micro-Architectural attacks, for example, have recently come to prominence since they break all existing software-isolation based security – for example, by hammering memory rows to gain root privileges or by abusing speculative execution and shared hardware to leak secret data. This work is the first to use anomalies in an embedded system’s power trace to detect evasive micro-architectural attacks. To this end, we introduce power-mimicking micro-architectural attacks – including DRAM-rowhammer attacks, side/covert-channel and speculation- driven attacks – to study their evasiveness. We then quantify the operating range of the power-anomalies detector using the Odroid XU3 board – showing that rowhammer attacks cannot evade detection while covert channel and speculation-driven attacks can evade detection but are forced to operate at a 36× and 7× lower bandwidth. Our power-anomaly detector is efficient and can be embedded out-of-band into (e.g.,) programmable batteries. While rowhammer, side-channel, and speculation-driven attack defenses require invasive code- and hardware-changes in general-purpose systems, we show that power-anomalies are a simple and effective defense for embedded systems. Power-anomalies can help future- proof embedded systems against vulnerabilities that are likely to emerge as new hardware like phase-change memories and accelerators become mainstream.
Xiaodan Xi, Aydin Aysu, Michael OrshankyConference Paper Hardware Oriented Security and Trust (HOST), Pages 118-125, Washington, USA, April-May 2018
Side-channel attacks on cryptographic implementations threaten system security via the loss of the secret key. Fresh re-keying techniques aim to mitigate these attacks by regularly updating the key so that the side-channel exposure for each key is minimized. Existing key update schemes generate fresh keys by processing a root key with arithmetic operations which have, unfortunately, been demonstrated to be also vulnerable to side- channel attacks.
We propose a novel approach to fresh re-keying that replaces the arithmetic key update function with a strong Physically Unclonable Function (PUF). We show that the security of our scheme hinges on the resilience of the PUF to a power side- channel attack and propose a realization based on a Subthreshold Current Array (SCA) PUF. We show that SCA-PUF is resistant to simple power analysis and that it is resilient to a modeling attack that uses machine learning on the power side-channel. We target an insecure device and secure server encryption scenario for which we provide an efficient and scalable method of PUF enrollment. We finally propose an end-to-end encryption system with the PUF-based fresh re-keying scheme, using a reverse fuzzy extractor construction.
Aydin Aysu, Youssef Tobah, Mohit Tiwari, Andreas Gerstlaue, Michael OrshankyConference Paper Hardware Oriented Security and Trust (HOST), Pages 81-88, Washington, USA, April-May 2018
Key exchange protocols establish a secret key to confidentially communicate digital information over public channels. Lattice-based key exchange protocols are a promising alternative for next-generation applications due to their quantum-cryptanalysis resistance and implementation efficiency. While these constructions rely on the theory of quantum-resistant lattice problems, their practical implementations have shown vulnerability against side-channel attacks in the context of public-key encryption or digital signatures. Applying such attacks on key exchange protocols is, however, much more challenging because the secret key changes after each execution of the protocol, limiting the side-channel adversary to a single measurement.
In this paper, we demonstrate the first successful power side-channel attack on lattice-based key exchange protocols. The attack targets the hardware implementation of matrix and polynomial multiplication used in these protocols. The crux of our idea is to apply a horizontal attack that makes hypothesis on several intermediate values within a single execution all relating to the same secret and to combine their correlations for accurately estimating the secret key. We illustrate that the design of key exchange protocols combined with the nature of lattice arithmetic enables our attack. Since a straightforward attack suffers from false positives, we demonstrate a novel procedure to recover the key by following the sequence of intermediate updates during multiplication.
We analyzed two key exchange protocols, NewHope (USENIX’16) and Frodo (CCS’16), and show that their implementations can be vulnerable to our attack. We test the effectiveness of the proposed attack using concrete parameters of these protocols on a physical platform with real measurements. On a SAKURA-G FPGA Board, we show that the proposed attack can estimate the entire secret key from a single power measurement with over 99% success rate.
Aydin Aysu, Michael Orshansky, Mohit TiwariConference Paper Design, Automation, and Test in Europe – DATE, Pages 1253-1258, Dresden, Germany, Mar 2018
We describe the first hardware implementation of a quantum-secure encryption scheme along with its low-cost power side-channel countermeasures. The encryption uses an implementation-friendly Binary-Ring-Learning-with-Errors (B-RLWE) problem with binary errors that can be efficiently generated in hardware. We demonstrate that a direct implementation of B-RLWE exhibits vulnerability to power side-channel attacks, even to Simple Power Analysis, due to the nature of binary coefficients. We mitigate this vulnerability with a redundant addition and memory update. To further protect against Differential Power Analysis (DPA), we use a B-RLWE specific opportunity to construct a lightweight yet effective countermeasure based on randomization of intermediate states and masked threshold decoding. On a SAKURA-G FPGA board, we show that our method increases the required number of measurements for DPA attacks by 40x compared to unprotected design. Our results also quantify the trade-off between side-channel security and hardware area-cost of B-RLWE.
Aydin Aysu, Ye Wang, Patrick Schaumont, Michael OrshanskyConference Paper Hardware Oriented Security and Trust – HOST, Pages 134-139, McLean, USA, May, 2017
An ideal Physical Unclonable Function produces a string of static random bits. Noise causes these bits to be unstable over subsequent readings and biases cause these bits to have a tendency towards a fixed value. Although the debiasing of random strings is a well-studied problem, the combined problem of noise and bias is unique to PUF design. This paper proposes a new lightweight noise-aware debiasing method superior to earlier techniques. The method is based on identifying an m-to-l encoding that compresses m-bit noisy and biased PUF outputs into l-bit strings which have a reduced combined effect of bias and noise. We describe a methodology for deriving an efficient encoding based on the bias and noise level of the input string. Notably, the method does not require intermediate storage or transmission of PUF-specific mask (debiasing helper) data for reconstruction. We test our method on PUFs with a range of bias and noise levels, and demonstrate its advantages over two debiasing approaches published at CHES 2015 which are based on XOR operation and Von Neumann corrector. The results quantify that the proposed method can achieve up to 76% reduction over the previous method in the number of PUF bits required to establish an authentication system with an error rate of one part in a million and a security level of 80-bits.
Aydin AysuThesis PhD Dissertation Virginia Tech
In the context of a system design, resource-constraints refer to severe restrictions on allowable resources, while resource-efficiency is the capability to achieve a desired performance and, at the same time, to reduce wasting resources. To design for low- cost platforms, these fundamental concepts are useful under different scenarios and they call for different approaches, yet they are often mixed. Resource-constrained systems require aggressive optimizations, even at the expense of performance, to meet the stringent resource limitations. On the other hand, resource-efficient systems need a careful trade-off between resources and performance, to achieve the best possible combination. Designing systems for resource-constraints with the optimizations for resource-efficiency, or vice versa, can result in a suboptimal solution.
Using modern cryptographic applications as the driving domain, I first distinguish resource-constraints from resource-efficiency. Then, I introduce the recurring strategies to handle these cases and apply them on modern cryptosystem designs. I illustrate that by clarifying the application context, and then by using appropriate strategies, it is possible to push the envelope on what is perceived as achievable, by up to two orders-of-magnitude.
In the first part of this dissertation, I focus on resource-constrained modern cryptosystems. The driving application is Physical Unclonable Function (PUF) based symmetric-key authentication. I first propose the smallest block cipher in 128-bit security level. Then, I show how to systematically extend this design into the smallest application-specific instruction set processor for PUF-based authentication protocols. I conclude this part by proposing a compact method to combine multiple PUF components within a system into a single device identifier.
In the second part of this dissertation, I focus on resource-efficient modern cryptosystems. The driving application is post-quantum public-key schemes. I first demonstrate energy-efficient computing techniques for post-quantum digital signatures. Then, I propose an area-efficient partitioning and a Hardware/Software code- sign for its implementation. The results of these implemented modern cryptosystems validate the advantage of my approach by quantifying the drastic improvements over the previous best.
Aydin Aysu, Patrick SchaumontJournal Paper IEEE Transactions on Computers, Volume 65, Issue 9, Pages 2925-2931, Sep, 2016
Energy-harvesting techniques can be combined with wireless embedded sensors to obtain battery-free platforms with an extended lifetime. Although energy-harvesting offers a continuous supply of energy, the delivery rate is typically limited to a few Joules per day. This is a severe constraint to the achievable computing throughput on the embedded sensor node, and to the achievable latency obtained from applications running on those nodes. In this paper, we address these constraints with precomputation. The idea is to reduce the amount of computations required in response to application inputs, by partitioning the algorithm in an offline part, computed before the inputs are available, and an online part, computed in response to the actual input. We show that this technique works well on hash-based cryptographic signatures, which have a complex key generation for each new message that requires a signature. By precomputing the key-material, and by storing it as run-time coupons in non-volatile memory, there is a drastic reduction of the run-time energy needs for a signature, and a drastic reduction of the run-time latency to generate it. For a Winternitz hash-based scheme at 84-bit quantum security level on a MSP430 microcontroller, we measured a run-time energy reduction of 11.9× and a run-time latency reduction of 23.5×.
Aydin Aysu, Shravya Gaddam, Harsha Mandadi, Carol Pinto, Luke Wegryn, Patrick SchaumontConference Paper Design, Automation and Test in Europe (DATE 2016), Pages 1517-1522, Dresden, Germany, Mar 2016
Modern, complex printed circuit boards contain high-end commercial off-the-shelf components such as high-capacity FPGAs and expensive peripherals. This paper describes a strategy to build a hardware attestation protocol for such a board. The owner or operator of the PCB wants to achieve the assurance that the board installed in the field is physically the same as the one that was originally deployed. Our methodology builds a unique identifier for the PCB by cryptographically linking individual component-level identifiers from the board. The component-level identifiers are implemented using Physical Unclonable Functions (PUF) within the components of the board. We discuss a generic methodology for design and dimensioning of the critical post-processing parameters of the PUF, and we present several strategies to combine multiple PUF into a combined Fusion PUF. We present a prototype of the proposed technique on an FPGA board running μClinux, and we characterize its performance on a population of 22 PCBs.
Aydin Aysu, Ege Gulcan, Daisuke Moriyama, Patrick SchaumontJournal Paper IET Journal on Information Security, Volume 10, Issue 5, Pages 232-241, Sep, 2016
There is a disconnection between the theory and the practice of lightweight physical unclonable function (PUF)-based protocols. At a theoretical level, there exist several PUF-based authentication protocols with unique features and novel efficiency claims, but most of these solutions lack real-world implementations with simple performance figures. On the other hand, practical protocol implementations are ad-hoc designs fixed to a specific functionality and with limited area optimizations. This work aims to bring these approaches on PUF protocols closer. The authors’ contribution is twofold. First, they provide a novel ASIP (application-specific instruction set processor) that can efficiently execute PUF-based authentication protocols. The key novelty of the proposed ASIP is optimization for area without degrading the performance. Second, they demonstrate the capability of their ASIP by mapping three secure PUF-based authentication protocols and benchmark their execution time, memory footprint, communication overhead, and power/energy consumption. Their results demonstrate the advantage of ASIP over dedicated architectures and also as opposed to general-purpose programming on an MSP430. The results further demonstrate various efficiency metrics that can be used to compare PUF-based protocol implementations.
Christopher Huth, Aydin Aysu, Jorge Guajardo, Paul Duplys, Tim GuneysuConference Paper Information Security and Cryptology – ICISC, Pages 28-48, Seoul, South Korea, Dec, 2016
The Internet of Things (IoT) is boon and bane. It offers great potential for new business models and ecosystems, but raises major security and privacy concerns. Because many IoT systems collect, process, and store personal data, a secure and privacy-preserving identity management is of utmost significance. Yet, strong resource limitations of IoT devices render resource-hungry public-key cryptography infeasible. Additionally, the security model of IoT enforces solutions to work under memory-leakage attacks. Existing constructions address either the privacy issue or the lightweightness, but not both. Our work contributes towards bridging this gap by combining physically unclonable functions (PUFs) and channel-based key agreement (CBKA): (i) We show a flaw in a PUF-based authentication protocol, when outsider chosen perturbation security cannot be guaranteed. (ii) We present a solution to this flaw by introducing CBKA with an improved definition. (iii) We propose a provably secure and lightweight authentication protocol by combining PUFs and CBKA.
Xu Guo, Aydin AysuPatent US Patent 9497573 B2
One feature pertains to a near field communication (NFC) target device comprising a memory circuit adapted to store sensitive data, an NFC interface adapted to transmit and receive information using NFC protocols, and a processing circuit. The processing circuit receives a plurality of provider identification (PID) numbers from a plurality of providers, where each PID number is associated with a different provider. The processing circuit also stores the PID numbers at the memory circuit, and assigns a privilege mask to each PID number received and stored. The NFC target device may also include a physical unclonable function (PUF) circuit. The processing circuit may additionally provide one or more PID numbers as input challenges to the PUF circuit, and receive one or more PUF output responses from the PUF circuit, where the PUF output responses are different from one another and are associated with different providers.
Ege Gulcan, Aydin Aysu, Patrick SchaumontConference Paper International Conference on Cryptology in India (INDOCRYPT 2015), Pages 329-346, Bengaluru, India, November, 2015
There is a significant effort in building lightweight cryptographic operations, yet the proposed solutions are typically single-purpose modules that can implement a single functionality. In contrast, we propose BitCryptor, a multi-purpose, compact processor for cryptographic applications on reconfigurable hardware. The proposed crypto engine can perform pseudo-random number generation, strong collision-resistant hashing and variable-key block cipher encryption. The hardware architecture utilizes SIMON, a recent lightweight block cipher, as its core. The complete engine uses a bit-serial design methodology to minimize the area. Implementation results on the Xilinx Spartan-3 s50 FPGA show that the proposed architecture occupies 95 slices (187 LUTs, 102 registers), which is 10× smaller than the nearest comparable multi-purpose design. BitCryptor is also smaller than the majority of recently proposed lightweight single-purpose designs. Therefore, it is a very efficient cryptographic IP block for resource-constrained domains, providing a good performance at a minimal area overhead.
Aydin Aysu, Ege Gulcan, Daisuke Moriyama, Patrick SchaumontConference Paper Cryptographic Hardware and Embedded Systems (CHES 2015), Pages 556-576, St. Malo, France, Sep, 2015
We demonstrate a prototype implementation of a provably secure protocol that supports privacy-preserving mutual authentication between a server and a constrained device. Our proposed protocol is based on a physically unclonable function (PUF) and it is optimized for resource-constrained platforms. The reported results include a full protocol analysis, the design of its building blocks, their integration into a constrained device, and finally its performance evaluation. We show how to obtain efficient implementations for each of the building blocks of the protocol, including a fuzzy extractor with a novel helper-data construction technique, a truly random number generator (TRNG), and a pseudo-random function (PRF). The prototype is implemented on a SASEBO-GII board, using the on-board SRAM as the source of entropy for the PUF and the TRNG. We present three different implementations. The first two execute on a MSP430 soft-core processor and have a security level of 64-bit and 128-bit respectively. The third uses a hardware accelerator and has 128-bit security level. To our best knowledge, this work is the first effort to describe the end-to-end design and evaluation of a privacy-preserving PUF-based authentication protocol.
Aydin Aysu, Patrick SchaumontJournal Paper ACM Transactions on Embedded Computing Systems, Volume 14, Issue 3, Pages 1-18, April, 2015
Advances in quantum computing have spurred a significant amount of research into public-key cryptographic algorithms that are resistant against postquantum cryptanalysis. Lattice-based cryptography is one of the important candidates because of its reasonable complexity combined with reasonable signature sizes. However, in a postquantum world, not only the cryptography will change but also the computing platforms. Large amounts of resource-constrained embedded systems will connect to a cloud of powerful server computers. We present an optimization technique for lattice-based signature generation on such embedded systems; our goal is to optimize latency rather than throughput. Indeed, on an embedded system, the latency of a single signature for user identification or message authentication is more important than the aggregate signature generation rate. We build a high-performance implementation using hardware/software codesign techniques. The key idea is to partition the signature generation scheme into offline and online phases. The signature scheme allows this separation because a large portion of the computation does not depend on the message to be signed and can be handled before the message is given. Then, we can map complex precomputation operations in software on a low-cost processor and utilize hardware resources to accelerate simpler online operations. To find the optimum hardware architecture for the target platform, we define and explore the design space and implement two design configurations. We realize our solutions on the Altera Cyclone-IV CGX150 FPGA. The implementation consists of a NIOS soft-core processor and a low-latency hash and polynomial multiplication engine. On average, the proposed low-latency architecture can generate a signature with a latency of 96 clock cycles at 40MHz, resulting in a response time of 2.4μs for a signing request. On equivalent platforms, this corresponds to a performance improvement of 33 and 105 times compared to previous hardware and software implementations, respectively.
Aydin Aysu, Patrick SchaumontJournal Paper Elsevier Microprocessors and Microsystems Journal, Volume 39, Issue 7, Pages 589-597, October, 2015
Physical Unclonable Functions (PUFs) enable the generation of device-unique, on-chip, and digital identifiers by exploiting the manufacturing process variation. The past decade has seen an extensive effort in PUF design. Yet, most PUF constructions are regarded as stand-alone hardware building blocks. In contrast, we propose PUF constructions that are tightly integrated into the design of a micro-processor. The proposed PUFs are essentially a collection of time-to-digital converters that are integrated into the custom instruction or memory-mapped interface of a processor. Therefore, the processor can issue the PUF challenges and collect the associated responses using instruction executions. This integration enables practical, run-time physical authentication and it allows flexible post-processing mechanisms using software. In this article, we describe the design, implementation, and the performance analysis details of such hardware/software co-designed authentication mechanisms on FPGAs. We propose two variants of the PUF architecture: a synchronous module that requires minimal place and route constraints utilizing the common clock of the SoC, and an asynchronous alternative that is independent of the clock but realized with a controlled placement. We implemented the synchronous architecture on the Altera Cyclone-IV FPGAs and performed a large-scale characterization on 55 boards. The asynchronous design is realized on the Xilinx Virtex-5 FPGAs and tested on 100 boards. Measurements reveal that the proposed solutions can authenticate trillions of devices and provide better performance than the ring oscillator based alternative.
Nahid Farhady Ghalaty, Aydin Aysu, Patrick SchaumontConference Paper Design, Automation and Test in Europe (DATE), Pages 204-230, Dresden, Germany, Mar 2014
Fault Sensitivity Analysis (FSA) is a new type of side-channel attack that exploits the relation between the sensitive data and the faulty behavior of a circuit, the so-called fault sensitivity. This paper analyzes the behavior of different implementations of AES S-box architectures against FSA, and proposes a systematic countermeasure against this attack. This paper has two contributions. First, we study the behavior and structure of several S-box implementations, to understand the causes behind the fault sensitivity. We identify two factors: the timing of fault sensitive paths, and the number of logic levels of fault sensitive gates within the netlist. Next, we propose a systematic countermeasure against FSA. The countermeasure masks the effect of these factors by intelligent insertion of delay elements. We evaluate our methodology by means of an FPGA prototype with built-in timing-measurement. We show that FSA can be thwarted at low hardware overhead. Compared to earlier work, our method operates at the logic-level, is systematic, and can be easily generalized to bigger circuits.
Aydin Aysu, Ege Gulcan, Patrick SchaumontJournal Paper IEEE Embedded Systems Letters, Volume 6, Issue 2, Pages 37-40, April, 2014
While advanced encryption standard (AES) is extensively in use in a number of applications, its area cost limits its deployment in resource constrained platforms. In this letter, we have implemented SIMON, a recent promising low-cost alternative of AES on reconfigurable platforms. The Feistel network, the construction of the round function and the key generation of SIMON, enables bit-serial hardware architectures which can significantly reduce the cost. Moreover, encryption and decryption can be done using the same hardware. The results show that with an equivalent security level, SIMON is 86% smaller than AES, 70% smaller than PRESENT (a standardized low-cost AES alternative), and its smallest hardware architecture only costs 36 slices (72 LUTs, 30 registers). To our best knowledge, this work sets the new area records as we propose the hardware architecture of the smallest block cipher ever published on field-programmable gate arrays (FPGAs) at 128-bit level of security. Therefore, SIMON is a strong alternative to AES for low-cost FPGA-based applications.
Ege Gulcan, Aydin Aysu, Patrick SchaumontConference Paper International Workshop on Lightweight Cryptography for Security and Privacy (LightSec), Pages 556-576, Istanbul, Turkey, Sep, 2014
SIMON is a recent, light-weight block cipher developed by NSA. Previous work on SIMON shows that it is a very promising alternative of AES for resource-constrained platforms. While SIMON offers a range of block sizes and key lengths, a straightforward implementation would select fixed values in order to achieve a compact design. In contrast, we propose a flexible hardware architecture on FPGAs that still preserves the compactness of SIMON. The proposed implementation can execute all configurations of SIMON, and thus provides a versatile architecture that enables adaptive security using a variable key-size. Moreover, it also reduces the inefficiency of encrypting slightly longer messages by supporting a variable block-size. The implementation results show that the proposed architecture occupies 90 and 32 slices on Spartan-3 and Spartan-6 FPGAs, respectively. To our best knowledge, these area results are smaller than other block ciphers of similar security level. Furthermore, we also quantify the cost of flexibility and show the trade-off between the security level, throughput and area.
Aydin Aysu, Patrick SchaumontConference Paper Reconfigurable Computing and FPGAs (ReConFig), Pages 1-6, Cancun, Mexico, Dec, 2013
Generation of device-unique digital signatures using Physically Unclonable Functions (PUFs) is an active area of research for the last decade. However, most PUFs are conceived and designed as stand-alone hardware modules. In contrast, this paper proposes a PUF architecture that is tightly integrated into the core of a system-on-chip (SoC), with the purpose of creating a physical SoC authentication mechanism. The proposed PUF is integrated into the custom instruction interface of the NIOS-II processor. Therefore, PUF challenges can be issued by instruction calls which allows run-time authentication and which enables implementation of flexible post-processing mechanisms in software. The proposed PUF utilizes critical timing path violations of a custom instruction execution to generate digital signatures which are unique for individual chips due to random process variations. We implement PASC on a low-cost Altera DEO-Nano Development Board and we validate the quality of the authentication keys on 15 Boards.
Patrick Schaumont, Aydin AysuConference Paper Security, Privacy, and Applied Cryptography Engineering (SPACE), Pages 1-20, Kharagpur, India, October, 2013
This contribution explores the design dimensions, the primary quality factors of a design, of secure embedded systems design. Design dimensions define the design space, and they enable a designer to distinguish a high-quality design from a low-quality design. Besides well-known dimensions such as performance and flexibility, secure embedded systems design introduces a new one: risk, or the potential for loss. Risk is on equal footing with flexibility and performance. The design challenges for risk cannot be met by optimizing for performance or flexibility alone. Hence, secure-embedded system design requires a trade-off between flexibility, performance, and risk. We illustrate this trade-off for each pair of factors through several driver applications, including parallel cryptography, integration of physical unclonable functions and side-channel countermeasures.
Aydin Aysu, Nahid Farhady Ghalaty, Zane Franklin, Moein Pahlavan Yali, Patrick SchaumontConference Paper International Workshop on Embedded Systems Security (WESS), Pages 2:1-6, Quebec, Canada, Sep, 2013
With the Internet of Things on the horizon, correct authentication of Things within a population will become one of the major concerns for security. Physical authentication, which is implementing digital fingerprints by utilizing device-unique manufacturing variations, has great potential for achieving this purpose. MEMS sensors that are used in the Internet of Things have not been explored as a source of variation. In this paper, we target a commonly used MEMS sensor, an accelerometer, and utilize its process variations to generate digital fingerprints. This is achieved by measuring the accelerometer’s response to an applied electrostatic impulse and its inherent offset values. Our results revealed that MEMS sensors could be used as a source for digital fingerprints for run-time authentication applications.
Aydin Aysu, Cameron Patterson, Patrick SchaumontConference Paper Hardware-Oriented Security and Trust (HOST), Pages 81-86, Austin, USA, June, 2013
The interest in lattice-based cryptography is increasing due to its quantum resistance and its provable security under some worst-case hardness assumptions. As this is a relatively new topic, the search for efficient hardware architectures for lattice-based cryptographic building blocks is still an active area of research. We present area optimizations for the most critical and computationally-intensive operation in lattice-based cryptography: polynomial multiplication with the Number Theoretic Transform (NTT). The proposed methods are implemented on an FPGA for polynomial multiplication over the ideal ℤp[x]〈xn + 1〉. The proposed hardware architectures reduce slice usage, number of utilized memory blocks and total memory accesses by using a simplified address generation, improved memory organization and on-the-fly operand generations. Compared to prior work, with similar performance the proposed hardware architectures can save up to 67% of occupied slices, 80% of used memory blocks and 60% of memory accesses, and can fit into smallest Xilinx Spartan-6 FPGA.
Aydin Aysu, Murat Sayinta, Cevahir CiglaConference Paper Very Large Scale Integration (VLSI-SoC), Pages 204-209, Istanbul, Turkey, October, 2013
There exist numerous solutions to the stereo matching problem and many have been implemented on FPGAs. However, these solutions are conceived and designed as a stand-alone module without considering the area constraints and hard deadlines of the underlying application. In this paper, we propose a low-cost and real-time stereo matching system that is specifically designed to be integrated into the video pipeline of a Full-HD 3D-TV system, which is compliant with HDMI 1.4a specification. The proposed hardware architecture was implemented in VHDL and mapped on low-cost Spartan6-XC6SLX9 FPGA. The implementation runs at 148.5 MHz and achieves 60 fps in order to meet the HDMI 1.4a requirements, while using only 3k LUTs and 29 of 16Kb-BRAMs. Compared to existing work, the proposed implementation can generate high quality disparity maps with a much compact implementation. Moreover, to the best of our best knowledge, we present the first hardware implementation of information permeability based stereo matching algorithm, which exploits one of the most efficient aggregation methodologies compared to the state-of-the-art. We also identify the problems brought out during the system level integration and provide novel solutions to them.
Efficient hardware implementations of high throughput SHA-3 candidates keccak, luffa and blue midnight wish for single- and multi-message hashing
Abdulkadir Akin, Aydin Aysu, Onur Can Ulusel, Erkay SavasConference Paper International conference on Security of information and networks, Pages 168-177, Taganrog, Russia, Sep, 2011
In November 2007 NIST announced that it would organize the SHA-3 competition to select a new cryptographic hash function family by 2012. In the selection process, hardware performances of the candidates will play an important role. Our analysis of previously proposed hardware implementations shows that three SHA-3 candidate algorithms can provide superior performance in hardware: Keccak, Luffa and Blue Midnight Wish (BMW). In this paper, we provide efficient and fast hardware implementations of these three algorithms. Considering both single- and multi-message hashing applications with an emphasis on both speed and efficiency, our work presents more comprehensive analysis of their hardware performances by providing different performance figures for different target devices. To our best knowledge, this is the first work that provides a comparative analysis of SHA-3 candidates in multi-message applications. We discover that BMW algorithm can provide much higher throughput than previously reported if used in multi-message hashing. We also show that better utilization of resources can increase speed via different configurations. We implement our designs using Verilog HDL, and map to both ASIC and FPGA devices (Spartan3, Virtex2, and Virtex 4) to give a better comparison with those in the literature. We report total area, maximum frequency, maximum throughput and throughput/area of the designs for all target devices. Given that the selection process for SHA3 is still open; our results will be instrumental to evaluate the hardware performance of the candidates.
Aydin Aysu, Gokhan Sayilar, Ilker HamzaogluJournal Paper IEEE Transactions on Consumer Electronics, Volume 57, Issue 3, Pages 1377-1383, August, 2011
Multiple reference frame motion estimation (MRF ME) increases the video coding efficiency at the expense of increased computational complexity and energy consumption. Therefore, in this paper, a low complexity H.264 MRF ME algorithm and a low energy adaptive hardware for its real-time implementation are proposed. The proposed MRF ME algorithm reduces the computational complexity of MRF ME by using a dynamically determined number of reference frames for each Macroblock (MB) and early termination. The proposed H.264 MRF ME hardware is implemented in Verilog HDL. The proposed H.264 MRF ME hardware has 29-72% less energy consumption than an H.264 MRF ME hardware using 5 reference frames for all MBs with a negligible PSNR loss. Therefore, it can be used in consumer electronics products that require real-time video processing or compression with low power consumption.
Ilker Hamzaoglu, Aydin Aysu, Onur Can UluselConference Paper Signal Processing and Communications Applications (SIU), Pages 984-987, Antalya, Turkey, April, 2011
Motion Estimation (ME) is the most computationally intensive part of video compression systems. Multiple reference frame (MRF) ME used in H.264 standard increases the video coding efficiency at the expense of increased computational complexity and power consumption. Therefore, in this paper, we present a reconfigurable baseline H.264 video encoder hardware in which the number of reference frames used for MRF ME can be configured based on the application requirements in order to trade-off video coding efficiency and power consumption. The proposed H.264 video encoder hardware is based on an existing low cost H.264 intra frame coder hardware and it includes new reconfigurable MRF ME, mode decision and motion compensation hardware. The proposed H.264 video encoder hardware is capable of processing 55 CIF (352×288) frames per second and its power consumption ranges between 115 mW and 235 mW depending on the number of reference frames used for MRF ME.
Aydin AysuThesis MS Thesis Sabanci University
The recently developed H.264 / MPEG-4 Part 10 video compression standard achieves better video compression efficiency than previous video compression standards at the expense of increased computational complexity and power consumption. Multiple reference frame (MRF) Motion Estimation (ME) is the most computationally intensive and power consuming part of H.264 video encoders. Therefore, in this thesis, we designed and implemented a reconfigurable baseline H.264 video encoder hardware for real-time portable applications in which the number of reference frames used for MRF ME can be configured based on the application requirements in order to trade-off video coding efficiency and power consumption. The proposed H.264 video encoder hardware is based on an existing low cost H.264 intra frame coder hardware and it includes new reconfigurable MRF ME, mode decision and motion compensation hardware. We first proposed a low complexity H.264 MRF ME algorithm and a low energy adaptive hardware for its real-time implementation. The proposed MRF ME algorithm reduces the computational complexity of MRF ME by using a dynamically determined number of reference frames for each Macroblock and early termination. The proposed MRF ME hardware architecture is implemented in Verilog HDL and mapped to a Xilinx Spartan 6 FPGA. The FPGA implementation is verified with post place & route simulations. The proposed H.264 MRF ME hardware has 29-72% less energy consumption on this FPGA than an H.264 MRF ME hardware using 5 reference frames for all MBs with a negligible PSNR loss. We then designed the H.264 video encoder hardware and implemented it in Verilog HDL. The proposed video encoder hardware is mapped to a Xilinx Virtex 6 FPGA and verified with post place & route simulations. The bitstream generated by the proposed video encoder hardware for an input frame is successfully decoded by H.264 Joint Model reference software decoder and the decoded frame is displayed using a YUV Player tool for visual verification. The FPGA implementation of the proposed H.264 video encoder hardware works at 135 MHz, it can code 55 CIF (352×288) frames per second, and its power consumption ranges between 115mW and 235mW depending on the number of reference frames used for MRF ME.