Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Future Blog Post

less than 1 minute read

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

news

portfolio

publications

Self-Correction Trace Model: A Full-System Simulator for Optical Network-on-Chip

Published in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW 2012), 2012

Abstract

The improvement of the emerging technology involves the nanophotonic into the on-chip interconnection, which provides a large communication capability for the future large-scale CMP processor. As an important way to the architecture research, full-system simulation has been adopted by many researchers. Since the optical devices are fundamentally different from the conventional electronic elements, new methodology and tools are needed to simulate an Optical Network-on-Chip (ONOC) with real workload. In this paper, we introduce a high precise full-system ONOC simulation system. To build this system, we propose a self-correction trace model for accurate simulation in a reasonable period of time. Finally, to test our simulation system, we present a simple case-study to compare our system running real application with a baseline NOC simulator. The result shows that our simulation system achieves a high precision, while not substantially extend the total simulation time.

Recommended citation: Self-Correction Trace Model: A Full-System Simulator for Optical Network-on-Chip. Mingzhe Zhang, Liqiang He, Dongrui Fan. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IPDPSW 2012.

Energy-Performance Modeling and Optimization of Parallel Computing in On-Chip Networks

Published in 2013 12th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2013), 2013

Abstract

This paper discusses energy-performance trade-off of networks-on-chip with real parallel applications. First, we propose an accurate energy-performance analytical model that conduct and analyze the impacts of both frequency-independent and frequency-dependent power. Second, we put together the communication overhead, memory access overhead, frequency scaling, and core count scaling to quantify the performance and energy consumed by NoCs. Third, we propose a new energy-performance optimization method, by choosing a pair of frequency and core count to get optimal energy or performance. Finally, we implement eight PARSEC parallel applications to evaluate our model and the optimization method. The experiment result confirms that our model predicts NoCs energy and performance well, and selects correct frequency level and core count for most parallel applications.

Recommended citation: Energy-Performance Modeling and Optimization of Parallel Computing in On-Chip Networks. Shuai Zhang, Zhiyong Liu, Dongrui Fan, Fonglong Song, Mingzhe Zhang. 2013 12th IEEE International Symposium on Parallel and Distributed Processing with Applications. ISPA 2013.

A Path-Adaptive Opto-electronic Hybrid NoC for Chip Multi-processor

Published in 2013 12th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2013), 2013

Abstract

The continuous development of manufacture allows to integrate optical components in a chip, which providing a feasible solution for the communication between the cores in manycore processors. Considering the limitation of manufacture technology and the characteristics of optical communication, opto-electronic hybrid NoC is a reasonable choice currently. Today, common NoCs connect the separated tile-clusters with the optical network. With these structures, communications between cores in different tile-clusters must go through the optical network. Meanwhile, most of the applications can be hardly divided into isolated parallel parts, leading to un-avoidable communications between the tiles. Thus, the scalability and flexibility for various applications are limited. In this paper, we present path-adaptive opto-electronic hybrid NoC architecture. Rather than dividing the cores into separated clusters, the proposed structure provides an optical network layer and an electronic network layer in mesh topology. Furthermore, a modified routing strategy is implemented to allow the on-chip routers to decide whether transmit the packet through the optical links or the electronic ones, according to the distance between the source node and the destination. With this method, the NoC is flexible for diverse applications without scaling limitations or performance degradation. The experimental results show that, for a 256-core NoC, our proposed architecture gains 1.26x network efficiency comparing with the Corona, while reducing 21% power consumption.

Recommended citation: A Path-Adaptive Opto-electronic Hybrid NoC for Chip Multi-processor. Mingzhe Zhang, Da Wang, Xiaochun Ye, Liqiang He, Dongrui Fan, Zhiyong Liu. 2013 12th IEEE International Symposium on Parallel and Distributed Processing with Applications. ISPA 2013.

Spontaneous reload cache: Mimicking a larger cache with minimal hardware requirement

Published in 2013 IEEE 8th International Conference on Networking, Architecture and Storage (NAS 2013), 2013

Abstract

In modern processor systems, on-chip Last Level Caches (LLCs) are used to bridge the speed gap between CPUs and off-chip memory. In recent years, the LRU policy effectiveness in low level caches has been questioned. A significant amount of recent work has explored the design space of replacement policies for CPUs’ low level cache systems, and proposed a variety of replacement policies. All these pieces of work are based on the traditional idea of a conventional passive cache, which triggers memory accesses exclusively when there is a cache miss. Such passive cache systems have a theoretical performance upper bound, which is represented by Optimal Algorithm. In this work, we introduce a novel cache system called Spontaneous Reload Cache (SR-Cache). Contrary to passive caches, no matter whether a cache access is a hit or miss, an SR-Cache can actively load or reload an off-chip data block which is predicted to be used in the near future and evict the data block which has the lowest probability to be reused soon. We show that, with minimal hardware overhead, SR-Cache can achieve much better performance than conventional passive caches.

Recommended citation: Spontaneous reload cache: Mimicking a larger cache with minimal hardware requirement. Lunkai Zhang, Mingzhe Zhang, Lingjun Fan, Da Wang, Paolo Ienne. 2013 IEEE 8th International Conference on Networking, Architecture and Storage. NAS 2013.

SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture

Published in Proceedings of the 2013 International Symposium on Low Power Electronics and Design (ISLPED2013), 2013

Abstract

Simulation is an important method to evaluate future computer systems. However, the increasing complexity of the target systems has made the development of simulators very difficult. Furthermore, detailed simulation of large-scale parallel architecture is so slow that full evaluation of real application becomes a great challenge.

Recommended citation: SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. Xiaochun Ye, Dongrui Fan, Ninghui Sun, Shibin Tang, Mingzhe Zhang, Hao Zhang. Proceedings of the 2013 International Symposium on Low Power Electronics and Design. ISLPED 2013.

SpongeDirectory: Flexible sparse directories utilizing multi-level memristors

Published in The 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT 2014), 2014

Abstract

Cache-coherent shared memory is critical for programmability in many-core systems. Several directory-based schemes have been proposed, but dynamic, non-uniform sharing make efficient directory storage challenging, with each giving up storage space, performance or energy. We introduce SpongeDirectory, a sparse directory structure that exploits multi-level memristory technology. SpongeDirectory expands directory storage in-place when needed by increasing the number of bits stored on a single memristor device, trading latency and energy for storage. We explore several SpongeDirectory configurations, finding that a provisioning rate of 0.5× with memristors optimized for low energy consumption is the most competitive. This optimal SpongeDirectory configuration has performance comparable to a conventional sparse directory, requires 18× less storage space, and consumes 8× less energy.

Recommended citation: SpongeDirectory: Flexible sparse directories utilizing multi-level memristors. Lunkai Zhang, Dmitri Strukov, Hebatallah Saadeldeen, Dongrui Fan, Mingzhe Zhang, Diana Franklin. 2014 The 23rd International Conference on Parallel Architecture and Compilation Techniques. PACT 2014.

FreeRider: Non-local adaptive network-on-chip routing with packet-carried propagation of congestion information

Published in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2015

Abstract

Non-local adaptive routing techniques, which utilize statuses of both local and distant links to make routing decisions, have recently been shown to be effective solutions for promoting the performance of Network-on-Chip (NoC). The essence of non-local adaptive routing was an additional network dedicated to propagate congestion information of distant links on the NoC. While the dedicated Congestion Propagation Network (CPN) helps routers to make promising routing decisions, it incurs additional wiring and power costs and becomes an unnecessary decoration when the load of NoC is light. Moreover, the CPN has to be extended if one would utilize more sophisticated congestion information to enhance the performance of NoC, bringing in even larger wiring and power costs. This paper proposes an innovative non-local adaptive routing technique called FreeRider, which does not use a dedicated CPN but instead leverages free bits in head flits of existing packets to carry and propagate rich congestion information without introducing additional wires or flits. In order to balance the network load, FreeRider adopts a novel three-stage strategy of output link selection, which adequately utilizes the propagated information to make routing decisions. Experimental results on both synthetic traffic patterns and application traces show that FreeRider achieves better throughput, shorter latency, and smaller power consumption than a state-of-the-art adaptive routing technique with dedicated CPN.

Recommended citation: FreeRider: Non-local adaptive network-on-chip routing with packet-carried propagation of congestion information. Shaoli Liu, Tianshi Chen, Ling Li, Xi Li, Mingzhe Zhang, Chao Wang, Haibo Meng, Xuehai Zhou, Yunji Chen. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 26, Issue 8, pp. 2272-2285.

COMRANCE: A rapid method for Network-on-Chip design space exploration

Published in 2016 The 7th International Green and Sustainable Computing Conference (IGSC 2016), 2016

Abstract

As the communication sub-system that connecting various on-chip components, Network-on-Chip (NoC) has a great influence on the performance of multi-/many-core processors. Because of NoC model contains a large number of parameters, the design space exploration (DSE) for NoC is a critical problem for the architects. Similar to the core design, existing DSE process mainly depends on iteratively time-consuming simulations. To lower the time budget, many previous studies focus on reducing the simulations. However, most of the proposed works based on regression or machine learning techniques, whose accuracy will be significantly affected by the scale of training set. It still needs a lot of simulations to build the training set.

Recommended citation: COMRANCE: A rapid method for Network-on-Chip design space exploration. Mingzhe Zhang, Yangguang Shi, Fa Zhang, Zhiyong Liu. 2016 The 7th International Green and Sustainable Computing Conference. IGSC 2016.

Balancing performance and lifetime of MLC PCM by using a region retention monitor

Published in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA2017), 2017

Abstract

Multi Level Cell (MLC) Phase Change Memory (PCM) is an enhancement of PCM technology, which provides higher capacity by allowing multiple digital bits to be stored in a single PCM cell. However, the retention time of MLC PCM is limited by the resistance drift problem and refresh operations are required. Previous work shows that there exists a trade-off between write latency and retention-a write scheme with more SET iterations and smaller current provides a longer retention time but at the cost of a longer write latency. Otherwise, a write scheme with fewer SET iterations achieves high performance for writes but requires a greater number of refresh operations due to its significantly reduced retention time, and this hurts the lifetime of MLC PCM. In this paper, we show that only a small part of memory (i.e., hot memory regions) will be frequently accessed in a given period of time. Based on such an observation, we propose Region Retention Monitor (RRM), a novel structure that records and predicts the write frequency of memory regions. For every incoming memory write operation, RRM select a proper write latency for it. Our evaluations show that RRM helps the system improves the balance between system performance and memory lifetime. On the performance side, the system with RRM bridges 77.2% of the performance gap between systems with long writes and systems with short writes. On the lifetime side, a system with RRM achieves a lifetime of 6.4 years, while systems using only long writes and short writes achieve lifetimes of 10.6 and 0.3 years, respectively. Also, we can easily control the aggressiveness of RRM through an attribute called hot threshold. A more aggressively configured RRM can achieve the performance which is only 3.5% inferior than the system using static short writes, while still achieve a lifetime of 5.78 years.

Recommended citation: Balancing performance and lifetime of MLC PCM by using a region retention monitor. Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu, Frederic T Chong. 2017 IEEE International Symposium on High Performance Computer Architecture. HPCA 2017.

Quick-and-Dirty: Improving Performance of MLC PCM by Using Temporary Short Writes

Published in 35th IEEE International Conference on Computer Design (ICCD2017), 2017

Abstract

Low write performance is a major obstacle to the commercialization of MLC PCM. One opportunity for improving the latency of MLC PCM writes is to use fewer SET iterations in a single write. Unfortunately, the data written by these short writes have significantly shorter retention time and thus need frequent refreshes. As a result, it is impractical to use these short-latency, short-retention writes globally. In this paper, we analyze the temporal behavior of write operations in typical applications and propose Quick-and-Dirty (QnD), a lightweight scheme to improve the performance of MLC PCM. QnD dynamically performs the short-latency, short-retention write when write operations are bursty, and then uses short-latency, short-retention writes to mitigate the short retention problem when memory system is relatively quiet. Our experimental results show that QnD improves performance by 30.9% on geometric mean while still providing acceptable memory lifetime (7.58 years on geometric mean). We also provide sensitivity studies of the aggressiveness, memory coverage and granularity of QnD technique.

Recommended citation: Quick-and-Dirty: Improving Performance of MLC PCM by Using Temporary Short Writes. Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Frederic T Chong, Zhiyong Liu. 35th IEEE International Conference on Computer Design. ICCD 2017.

Mmalloc: A Dynamic Memory Management on Many-core Coprocessor for the Acceleration of Storage-intensive Bioinformatics Application

Published in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2018), 2018

Abstract

In the past decades, many applications in bioinformatics have achieved great success by extracting useful information from huge amounts of data. However, when some storage-intensive applications like BWA-MEM ported to coprocessors to accelerate, they often have memory bottleneck that severely limits program performance and scalability. While dynamic memory allocation is one of the important topics in CPU and GPU, there has been relatively little work on many-core coprocessors. This paper introduces Mmalloc, a fast and highly scalable allocator that accelerates storage-intensive application on many-core coprocessor. Mmalloc is the first allocator to consider the different architecture between MIC and CPU. Mmalloc removes the global heap to reduce the long-distance on-chip coherent and communication. Mmalloc uses a binary sort interval tree to manage the memory. We also separate the header information from the data area using the logical structure to keep the locality of processed data. Our results on BWA-MEM benchmarks demonstrate that Mmalloc has a better speedup and scalability comparing with the state-of-the-art allocator for CPU like Hoard on the many-core coprocessor.

Recommended citation: Mmalloc: A Dynamic Memory Management on Many-core Coprocessor for the Acceleration of Storage-intensive Bioinformatics Application. Zihao Wang, Mingzhe Zhang, Jingrong Zhang, Rui Yan, Xiaohua Wan, Zhiyong Liu, Fa Zhang, Xuefeng Cui. 2018 IEEE International Conference on Bioinformatics and Biomedicine. BIBM 2018.

A Survey on Architecture Research of Non-Volatile Memory based on Dynamical Trade-off (in chinese)

Published in Journal of Computer Research and Development, 2019

Abstract

As a promising alternative candidate for DRAM, non-volatile memory (NVM) technique gains increasing interests from both industry and academia. Currently, the main problems that limit the wide utilization of NVM include considerable long latency for write operation, high energy consumption for write operation and limited write endurance. To solve these problems, the traditional solutions are based on computer architecture methods, such as adding extra level or scheduling scheme. Unfortunately, these solutions often suffer from unavoidable high soft/hardware overheads and can hardly optimize the architecture for more than one target at the same time. In recent years, as the improvement of research on non-volatile materials, several dynamical trade-offs lies in the materials are introduced, which also provides new opportunity for computer architecture research. Based on these trade-offs, several novel NVM architectures have been proposed. Compared with the traditional solutions, these proposed architectures have a series of advantages, such as low hardware overhead and the ability of optimizing for multi-targets. In this survey, we firstly introduce the existing problems of NVM and the traditional solutions. Then, we present three important dynamical trade-offs of NVM. After that, we introduce the newly proposed architectures based on these trade-offs. Finally, we make the conclusion for this kind of research work and point out some potential opportunities.

Recommended citation: A Survey on Architecture Research of Non-Volatile Memory based on Dynamical Trade-off (in chinese). Mingzhe Zhang, Fa Zhang, Zhiyong Liu. Journal of Computer Research and Development, Vol. 56, Issue 4, pp. 677-691.

Magma: A Monolithic 3D Vertical Heterogeneous ReRAM-based Main Memory Architecture

Published in 2019 Proceedings of the 56th Annual Design Automation Conference (DAC 2019), 2019

Abstract

3D vertical ReRAM (3DV-ReRAM) emerges as one of the most promising alternatives to DRAM due to its good scalability beyond 10nm. Monolithic 3D (M3D) integration enables 3DV-ReRAM to improve its array area efficiency by stacking peripheral circuits underneath an array. A 3DV-ReRAM array has to be large enough to fully cover the peripheral circuits, but such large array size significantly increases its access latency. In this paper, we propose Magma, a M3D stacked heterogeneous ReRAM array architecture, for future main memory systems by stacking a large unipolar 3DV-ReRAM array on the top of a small bipolar 3DV-ReRAM array and peripheral circuits shared by two arrays. We further architect the small bipolar array as a direct-mapped cache for the main memory system. Compared to homogeneous ReRAMs, on average, Magma improves the system performance by 11.4%, reduces the system energy by 24.3% and obtains > 5-year lifetime.

Recommended citation: Magma: A Monolithic 3D Vertical Heterogeneous ReRAM-based Main Memory Architecture. Farzaneh Zokaee, Mingzhe Zhang, Xiaochun Ye, Dongrui Fan, Lei Jiang. 2019 Proceedings of the 56th Annual Design Automation Conference. DAC 2019.

C-MAP: Improving the Effectiveness of Mapping Method for CGRA by Reducing NoC Congestion

Published in 21st IEEE International Conference on High Performance Computing and Communications (HPCC 2019), 2019

Abstract

The Coarse-Grained Reconfigurable Architecture (CGRA) is considered as one of the most potential candidates for big data applications, which provides significant throughput improvement and high energy efficiency. Unlike the dynamic issue superscalar method in conventional processors, the CGRA architecture uses the static placement dynamic issue (SPDI) execution method in which the compiler decides how to map the instructions onto the distributed processing elements (PEs) and the PEs executes one instruction when the required data is ready. Since the dataflow of the program is determined in the logical view, an improper mapping of instructions may leads to more network congestion and hurts the performance. Furthermore, the exploration for most optimized mapping in CGRA is proved to a NPC problem and can hardly be achieved in limited time. In this paper, we propose a novel mapping algorithm named Congestion-MAP (C-MAP). C-Map improves the effectiveness of CGRA mapping in the perspective of reducing network congestion and enhancing the continuity of the data-flow. Furthermore, C-Map also accelerates the mapping optimization for CGRA by using network analysis method, which supports the fast comparison of mapping plan and parallel exploration. Additionally, with C-Map, we also analyze the impact of several key considerations in CGRA instruction mapping, such as NoC workload reduction and workload balance. The experiment result shows that C-Map improves the performance by 2.2× as a geometric mean.

Recommended citation: C-MAP: Improving the Effectiveness of Mapping Method for CGRA by Reducing NoC Congestion</b>.Shuqian An, Mingzhe Zhang, Xiaochun Ye, Da Wang, Hao Zhang, Dongrui Fan, Zhimin Tang. 21st IEEE International Conference on High Performance Computing and Communications. HPCC 2019.

Quick-and-Dirty: An Architecture for High-Performance Temporary Short Writes in MLC PCM

Published in IEEE Transactions on Computers (TC), 2019

Abstract

MLC PCM provides high-density data storage and extended data retention; therefore it is a promising alternative for DRAM main memory. However, its low write performance is a major obstacle to commercialization. One opportunity for improving the latency of MLC PCM writes is to use fewer SET iterations in a single write. Unfortunately, this comes with a cost: the data written by these short writes have remarkably shorter retentions and thus need frequent refreshes. As a result, it is impractical to use these short-latency, short-retention writes globally. In this paper, we analyze the temporal behavior of write operations in typical applications and show that the write operations are bursty in nature, that is, during some time intervals the memory is subject to a large number of writes, while during other time intervals there hardly any memory operations take place. Based on this observation, we propose Quick-and-Dirty (QnD), a lightweight scheme to improve the performance of MLC PCM. When the write performance becomes the system bottleneck, QnD performs some write operations using the short-latency, short-retention write mode. Then, when the memory system is relatively quiet, QnD uses idle-memory intervals to refresh the data written by short-latency, short-retention writes in order to mitigate the short retention problem. Our experimental results show that QnD improves performance by 30.9 percent on geometric mean while still providing acceptable memory lifetime (7.58 years on geometric mean). We also provide sensitivity studies of the aggressiveness, memory coverage and granularity of QnD technique.

Recommended citation: Quick-and-Dirty: An Architecture for High-Performance Temporary Short Writes in MLC PCM. Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Frederic T Chong, Zhiyong Liu. IEEE Transactions on Computers (TC), Vol. 68, Issue 9, pp. 1365-1375.

FindeR: Accelerating FM-Index-based Exact Pattern Matching in Genomic Sequences through ReRAM technology

Published in 28th International Conference on Parallel Architectures and Compilation (PACT2019), 2019

Abstract

Genomics is the critical key to enable the precision medicine, ensure the global food security and enforce the wildlife conservation. The massive genomic data produced by various genome sequencing technologies presents a significant challenge for genome analysis. Because of errors from sequencing machines and genetic variations, approximate pattern matching (APM) is a must for practical genome analysis. Recent work proposes FPGA, ASIC and even process-in-memory-based accelerators to boost the APM throughput by accelerating dynamic-programming- based algorithms (e.g., Smith-Waterman). However, existing ac- celerators lack the efficient hardware acceleration for exact pattern matching (EPM) that is a even more critical and essential function widely used in almost every step of genome analysis including assembly, alignment, annotation and compression.

Recommended citation: FindeR: Accelerating FM-Index-based Exact Pattern Matching in Genomic Sequences through ReRAM technology. Farzaneh Zokaee, Mingzhe Zhang, Lei Jiang. 28th International Conference on Parallel Architectures and Compilation. PACT 2019.

Architecting Effectual Computation for Machine Learning Accelerators

Published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2019

Abstract

Inference efficiency is the predominant design consideration for modern machine learning accelerators. The ability of executing multiply-and-accumulate (MAC) significantly impacts the throughput and energy consumption during inference. However, MAC operation suffers from significant ineffectual computations that severely undermines the inference efficiency and must be appropriately handled by the accelerator. The ineffectual computations are manifested in two ways: firstly, zero values as the input operands of the multiplier, waste time and energy but contribute nothing to the model inference; secondly, zero bits in non-zero values occupy a large portion of multiplication time but are useless to the final result. In this article, we propose an ineffectual-free yet cost-effective computing architecture, called Split-and-ACcumulate (SAC) with two essential bit detection mechanisms to address these intractable problems in tandem. It replaces the conventional MAC operation in the accelerator by only manipulating the essential bits in the parameters (weights) to accomplish the partial sum computation. Besides, it also eliminates multiplications without any accuracy loss, and supports a wide range of precision configurations. Based on SAC, we propose an accelerator family called Tetris and demonstrate its application in accelerating state-of-the-art deep learning models. Tetris includes two implementations designed for either high performance (i.e. cloud applications) or low power consumption (i.e. edge devices) respectively, contingent to its built-in essential bit detection mechanism. We evaluate our design with Vivado HLS platform and achieve up to 6.96x performance enhancement, and up to 55.1x energy efficiency improvement over conventional accelerator designs.

Recommended citation: Architecting Effectual Computation for Machine Learning Accelerators. Hang Lu, Mingzhe Zhang, Yinhe Han, Huawei Li, Li Xiaowei. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD).

When Deep Learning Meets the Edge: Auto-Masking Deep Neural Networks for Efficient Machine Learning on Edge Devices

Published in 37th IEEE International Conference on Computer Design (ICCD 2019), 2019

Abstract

Deep neural network (DNN) has demonstrated promising performance in various machine learning tasks. Due to the privacy issue and the unpredictable transmission latency, inferring DNN models directly on edge devices trends the development of intelligent systems, like self-driving cars, smart Internet-of-Things (IoTs) and autonomous robotics. The on- device DNN model is obtained by expensive training via vast volumes of high-quality training data in the cloud datacenter, and then deployed into these devices, expecting it to work effectively at the edge. However, edge device always deals with low-quality images caused by compression or environmental noise pollutions. The well-trained model, though could work perfectly on the cloud, cannot adapt to these edge-specific conditions without remarkable accuracy drop. In this paper, we propose an automated strategy, called “AutoMask”, to embrace effective machine learning and accelerate DNN inference on edge devices. AutoMask comprises end-to-end trainable software strategies and cost-effective hardware accelerator architecture to improve the adaptability of the device without compromising the constrained computation and storage resources. Extensive experiments, over ImageNet dataset and various state-of-the-art DNNs, show that AutoMask achieves significant inference acceleration and storage reduction while maintains comparable accuracy level on embedded Xilinx Z7020 FPGA, as well as NVIDIA Jetson TX2.

Recommended citation: When Deep Learning Meets the Edge: Auto-Masking Deep Neural Networks for Efficient Machine Learning on Edge Devices. Ning Lin, Hang Lu, Jingliang Gao, Mingzhe Zhang, Xiaowei Li. 37th IEEE International Conference on Computer Design. ICCD 2019.

Balancing Performance and Energy Efficiency of ONoC by Using Adaptive Bandwidth

Published in 37th IEEE International Conference on Computer Design (ICCD 2019), 2019

Abstract

The considerable energy consumption is a challenge for implementing on-chip optical links. Previous work shows that, there exists a trade-off between the optical link bandwidth and energy consumption. In this paper, we analyze the temporal behavior of on-chip communication in typical applications and make the following observation: both ONoC and processor cores are not always busy, and there are significant amounts of time during which these components are relatively idle. Base on such an observation, we present a novel technique called Shifting Link to dynamically changing the bandwidth for optical link according to the workload situation. To evaluate our proposed method, we implement the Shifting Link in FlexiShare ONoC architecture. The experiment result shows that, on geometric mean, Shifting Link reduces the energy consumption by 35.0% with only 5.8% decrease in the performance.

Recommended citation: Balancing Performance and Energy Efficiency of ONoC by Using Adaptive Bandwidth. Mingzhe Zhang, Lunkai Zhang, Frederic T. Chong, Zhiyong Liu. 37th IEEE International Conference on Computer Design. ICCD 2019.

Self-adaptive Address Mapping Mechanism for Access Pattern Awareness on DRAM

Published in 17th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2019), 2019

Abstract

As DRAM is a considerably slow storage compared to CPU, the long access latency becomes a serious issue and affects the whole execution if fetching data is on the critical path. It is benefical if the data layout on DRAM, which is decided by address mapping, can serve data accesses with either great locality or bank-level parallelism. However for some cases, there exists a huge mismatch between access patterns and data layout of applications, which introduces the difficulty in obtaining locality or parallelism and current general address mapping cannot resolve it well.

Recommended citation: Self-adaptive Address Mapping Mechanism for Access Pattern Awareness on DRAM. Chundian Li, Mingzhe Zhang, Zhiwei Xu, Xianhe Sun. 17th IEEE International Symposium on Parallel and Distributed Processing with Applications. ISPA 2019.

Application-Oriented Data Migration to Accelerate In-Memory Database on Hybrid Memory

Published in Micromachines, 2021

Abstract

With the advantage of faster data access than traditional disks, in-memory database systems, such as Redis and Memcached, have been widely applied in data centers and embedded systems. The performance of in-memory database greatly depends on the access speed of memory. With the requirement of high bandwidth and low energy, die-stacked memory (e.g., High Bandwidth Memory (HBM)) has been developed to extend the channel number and width. However, the capacity of die-stacked memory is limited due to the interposer challenge. Thus, hybrid memory system with traditional Dynamic Random Access Memory (DRAM) and die-stacked memory emerges. Existing works have proposed to place and manage data on hybrid memory architecture in the view of hardware. This paper considers to manage in-memory database data in hybrid memory in the view of application. We first perform a preliminary study on the hotness distribution of client requests on Redis. From the results, we observe that most requests happen on a small portion of data objects in in-memory database. Then, we propose the Application-oriented Data Migration called ADM to accelerate in-memory database on hybrid memory. We design a hotness management method and two migration policies to migrate data into or out of HBM. We take Redis under comprehensive benchmarks as a case study for the proposed method. Through the experimental results, it is verified that our proposed method can effectively gain performance improvement and reduce energy consumption compared with existing Redis database.

Recommended citation: Application-Oriented Data Migration to Accelerate In-Memory Database on Hybrid Memory. Wenze Zhao, Yajuan Du, Mingzhe Zhang, Mingyang Liu, Kailun Jin, Rachata Ausavarungnirun. Micromachines.

Accelerating Graph Processing with Lightweight Learning-Based Data Reordering

Published in IEEE Computer Architecture Letters, 2022

Abstract

Graph processing is a vital component in various application domains. However, a good graph processing performance is hard to achieve due to its intensive irregular data accesses. Noticing that in real-world graphs, a small portion of vertices occupy most connections, several techniques are proposed to reorder vertices based on their access frequency for better data access locality. However, these approaches can be further improved by identifying reordered data more effectively, which will reduce reordering overhead and improve overall performance. In this letter, we propose Learning-Based Reordering (LBR), a novel lightweight framework that identifies and reorders hot data adaptively for given graphs, algorithms, and threads. Our experimental evaluation indicates that LBR decreases reordering overhead by 24.7% while improves performance by 9.9% compared to the best-performing existing scheme.

Recommended citation: Accelerating Graph Processing with Lightweight Learning-Based Data Reordering. Mo Zou, Mingzhe Zhang, Rujia Wang, Xian-He Sun, Xiaochun Ye, Dongrui Fan, Zhimin Tang. IEEE Computer Architecture Letters (CAL).

VNet: a versatile network to train real-time semantic segmentation models on a single GPU

Published in Science China Information Sciences, 2022

Abstract

Modern semantic segmentation, which has important applications such as medical image analysis, image editing, and video surveillance, has made remarkable progress using deep convolution neural network models. Recently, an efficient real-time semantic segmentation method has received considerable attention, as intelligent edge devices not only have faster inference speed requirements for semantic segmentation models but also cannot rely on the cloud services of data centers. There are two feasible approaches to develop an efficient semantic segmentation model. The first approach is by designing efficient models: designing and developing the models’ architecture from scratch (eg, ENet [1]). The second approach, which is less common but increasingly popular, is network compression: to develop light-weight models (eg, ICNet [2]) with pruning methods [3] that are widely used in image classification tasks. However, both these approaches are difficult to follow to develop light-weight and fast semantic segmentation models without compromising on the accuracy of the models.

Recommended citation: VNet: a versatile network to train real-time semantic segmentation models on a single GPU. Wenxing Li, Ning Lin, Mingzhe Zhang, Hang Lu, Xiaoming Chen, Xiaowei Li. Science China Press. Sci China Inf Sci, 2022, 65(3): 139105, https://doi.org/10.1007/s11432-020-2971-8.

Enhancing GPU Performance via Neighboring Directory Table Based Inter-TLB Sharing

Published in 40th IEEE International Conference on Computer Design (ICCD2022), 2022

Abstract

Modern discrete GPUs support Unified Virtual Memory (UVM), simplifying GPU programming. However, UVM entails address translation on each memory access, which introduces expensive performance overhead during address translation. In this work, we select various workloads and conduct experiments on GPU performance. Our investigation shows that many workloads have low L1 TLB hit ratios of less than 40% on average. Even for a particular workload, the hit ratio is as low as 15%, which leads to significant performance degradation. Through further analysis, we find that a lot of common entries exist between neighboring private L1 TLBs, showing clear inter-TLB sharing behavior. To leverage the sharing, we propose a Neighboring Directory table based hardware scheme, named NeiDty. In NeiDty, L1 TLBs can probe physical addresses from neighboring L1 TLBs through a lightweight interconnect network. And NeiDty uses neighboring directory tables to keep track of the shared entries among neighboring L1-TLBs. In addition, we find it better to update address translation after two consecutive neighboring TLB hits than one hit. We run eight typical workloads with Gem5-GPU, and the results show that NeiDty increases the average hit ratio of L1 TLB TLB by 14% and improves the average performance by 10%.

Recommended citation: Enhancing GPU Performance via Neighboring Directory Table Based Inter-TLB Sharing. Yajuan Du, Mingyang Liu, Yuqi Yang, Mingzhe Zhang and Xulong Tang. 40th IEEE International Conference on Computer Design. ICCD 2022.

TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU

Published in The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA2023), 2023

Abstract

In the cloud computing era, privacy protection is becoming pervasive in a broad range of applications (e.g., machine learning, data mining, etc). Fully Homomorphic Encryption (FHE) is considered the perfect solution as it enables privacy-preserved computation on untrusted servers. Unfortunately, the prohibitive performance overhead blocks the wide adoption of FHE (about 10,000× slower than the normal computation). As heterogeneous architectures have gained remarkable success in several fields, achieving high performance for FHE with specifically designed accelerators seems to be a natural choice. Until now, most FHE accelerator designs have focused on efficiently implementing one FHE operation at a time based on ASIC and with significantly higher performance than GPU and FPGA. However, recent state-of-the-art FHE accelerators rely on an expensive and large on-chip storage and a high-end manufacturing process (i.e., 7nm), which increase the implementation overhead of FHE adoption.

Recommended citation: TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU. Shengyu Fan, Zhiwei Wang, Weizhi Xu, Rui Hou, Dan Meng, Mingzhe Zhang. The 29th IEEE International Symposium on High-Performance Computer Architecture. HPCA2023.

Poseidon: Practical Homomorphic Encryption Accelerator

Published in The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA2023), 2023

Abstract

With the development of the important solution for privacy computing—Fully Homomorphic Encryption (FHE)—the explosion of data size and computing intensity in FHE applications has brought enormous challenges to hardware design. In this paper, we propose a novel HBM+FPGA architecture acceleration scheme named “Poseidon,” which focuses on improving the efficiency of the hardware resource and the bandwidth. To implement a practical and efficient accelerator that supports complex FHE applications that require Bootstrapping using limited FPGA resources, we refine the FHE application into inseparable computational streams, implement these lowest-level computational units (CUs), and combine them into upper-layer FHE operators through multiplexing to further complete the entire FHE application. To utilize resources more efficiently and improve parallelism, we adopt the radix-based NTT algorithm and propose HFAuto, an automorphism computing unit highly paralleled and suitable for FPGA. Then, we design the accelerator based on the optimized CUs and HBM to maximize data and computation parallelism with the limited hardware resources. Additionally, we evaluate Poseidon with four domain-specific FHE applications on the Alveo U280, which is a practical HBM+FPGA device. The empirical studies show that the efficient reuse of computational unit and on-chip storage enable Poseidon to be vastly superior to the state-of-the-art FPGA-based accelerator and to obtain performance close to F1, an ASIC-based accelerator: (1) up to 370× speedup over CPU for all operators of FHE; (2) up to 1300×/52× higher speedup over CPU and the best FPGA solution for NTT; (3) up to 10.6×/8.7× higher speedup over SOTA GPU implementation and ASIC-based F1 for homomorphic logistic regression.

Recommended citation: Poseidon: Practical Homomorphic Encryption Accelerator. Yinghao Yang, Huaizhi Zhang, Shengyu Fan, Hang Lu, Mingzhe Zhang, Xiaowei Li. The 29th IEEE International Symposium on High-Performance Computer Architecture. HPCA2023.

Alchemist: A Unified Accelerator Architecture for Cross-Scheme Fully Homomorphic Encryption

Published in Proceedings of the 61st Annual Design Automation Conference (DAC 2024), 2024

Abstract

The use of cross-scheme fully homomorphic encryption (FHE) in privacy-preserving applications present to be a new challenge to hardware accelerator design. Existing accelerator architectures with customized polynomial-level operator abstraction fail to efficiently handle hybrid FHE schemes due to the mismatch between computational demands and available hardware resources under various parameter settings. In this work, we propose a new accelerator architecture that consists of a novel finer-grained low-level operator, i.e., Meta-OP, that not only mathematically supports a diverse range of polynomial operations, but is also hardware-friendly for accelerator design without complex topological logic. We then design a new slot-based data management scheme to efficiently handle the distinct memory access patterns over the Meta-OP. With a slot-based data management approach, Alchemist can accelerate both arithmetic and logic FHE workloads with high hardware utilization rates. In the experiment, we show that Alchemist is up to 24,829× faster than CPU. For arithmetic FHE, compared with the SOTA ASIC accelerators, Alchemist achieves a 29.4× performance per area improvement on average. For logic FHE, compared with the SOTA ASIC accelerators, Alchemist achieves a 7.0× overall speed up on average.

Recommended citation: Alchemist: A Unified Accelerator Architecture for Cross-Scheme Fully Homomorphic Encryption. Jianan Mu, Husheng Han, Shangyi Shi, Jing Ye, Zizhen Liu, Shengwen Liang, Meng Li, Mingzhe Zhang, Song Bian, Xing Hu, Huawei Li, Xiaowei Li.2024 Proceedings of the 61st Annual Design Automation Conference. DAC 2024.

Flagger: Cooperative Acceleration for Large-Scale Cross-Silo Federated Learning Aggregation

Published in 51st ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2024), 2024

Abstract

Cross-silo federated learning (FL) leverages homomorphic encryption (HE) to obscure the model updates from the clients. However, HE poses the challenges of complex cryptographic computations and inflated ciphertext sizes. As cross-silo FL scales to accommodate larger models and more clients, the overheads of HE can overwhelm a CPU-centric aggregator architecture, including excessive network traffic, enormous data volume, intricate computations, and redundant data movements. Tackling these issues, we propose Flagger, an efficient and high-performance FL aggregator. Flagger meticulously integrates the data processing unit (DPU) with computational storage drives (CSD), employing these two distinct near-data processing (NDP) accelerators as a holistic architecture to collaboratively enhance FL aggregation. With the delicate delegation of complex FL aggregation tasks, we build Flagger-DPU and Flagger-CSD to exploit both in-network and in-storage HE acceleration to streamline FL aggregation. We also implement Flagger-Runtime, a dedicated software layer, to coordinate NDP accelerators and enable direct peer-to-peer data exchanges, markedly reducing data migration burdens. Our evaluation results reveal that Flagger expedites the aggregation in FL training iterations by 436% on average, compared with traditional CPU-centric aggregators.

Recommended citation: Flagger: Cooperative Acceleration for Large-Scale Cross-Silo Federated Learning Aggregation. Xiurui Pan, Yuda An, Shengwen Liang, Bo Mao, Mingzhe Zhang, Qiao Li, Myoungsoo Jung, Jie Zhang. 51st ACM/IEEE Annual International Symposium on Computer Architecture. ISCA 2024.

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.