All sessions will be held in Santa Clara Ballroom unless otherwise noted.
Papers are available for download below to registered attendees now and to everyone beginning Tuesday, February 21. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].
Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents
Papers and Proceedings
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].
Tuesday, February 21, 2023
8:00 am–9:00 am
Continental Breakfast
Mezzanine East/West
9:00 am–9:15 am
Opening Remarks and Awards
Program Co-Chairs: Ashvin Goel, University of Toronto, and Dalit Naor, The Academic College of Tel Aviv–Yaffo
9:15 am–10:15 am
Keynote Address
Building and Operating a Pretty Big Storage System (My Adventures in Amazon S3)
Andy Warfield, Amazon
Five years ago I decided to leave my faculty position at UBC and join Amazon. A lot of that time has been spent working as an engineer on the S3 team. In this talk, I'm going to reflect on some of my experiences working on a cloud storage service, and in particular talk about what it's been like to build storage at the scale of S3. Now, we all know that S3 is pretty big and so it's probably not all that surprising to hear that I expected "scale" to be something that I'd learn a fair bit about when I decided to join. I was, however, a little surprised at some of the ways that scale would really factor into my role, and how it would show up in operating the business. So in the talk, we'll learn about building storage on large quantities of hard drives, and we'll also learn about some of the cultural, organizational, and personal dimensions of building, evolving, and running a really big storage system.
10:15 am–10:45 am
Break with Refreshments
Mezzanine East/West
10:45 am–12:05 pm
Coding and Cloud Storage
Session Chair: Ali R. Butt, Virginia Tech
Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant, Google
Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.
We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.
ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage
Xiaolu Li, Huazhong University of Science and Technology; Keyun Cheng, Kaicheng Tang, and Patrick P. C. Lee, The Chinese University of Hong Kong; Yuchong Hu, Huazhong University of Science and Technology; Dan Feng, Huazhong University of Science and Technology, Wuhan, China; Jie Li and Ting-Yi Wu, Huawei Technologies Co., Ltd., Hong Kong
Minimum-storage regenerating (MSR) codes are provably optimal erasure codes that minimize the repair bandwidth (i.e., the amount of traffic being transferred during a repair operation), with the minimum storage redundancy, in distributed storage systems. However, the practical repair performance of MSR codes still has significant room to improve, as the mathematical structure of MSR codes makes their repair operations difficult to parallelize. We present ParaRC, a parallel repair framework for MSR codes. ParaRC exploits the sub-packetization nature of MSR codes to parallelize the repair of sub-blocks and balance the repair load (i.e., the amount of traffic sent or received by a node) across the available nodes. We show that there exists a trade-off between the repair bandwidth and the maximum repair load, and further propose a fast heuristic that approximately minimizes the maximum repair load with limited search time for large coding parameters. We prototype our heuristic in ParaRC and show that ParaRC reduces the degraded read and full-node recovery times over the conventional centralized repair approach in MSR codes by up to 59.3% and 39.2%, respectively.
InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication
Iwona Kotlarska, Andrzej Jackowski, Krzysztof Lichota, Michal Welnicki, and Cezary Dubnicki, 9LivesData, LLC; Konrad Iwanicki, University of Warsaw
Cloud tiering is the process of moving selected data from on-premise storage to the cloud, which has recently become important for backup solutions. As subsequent backups usually contain repeating data, deduplication in cloud tiering can significantly reduce cloud storage utilization, and hence costs.
In this paper, we introduce InftyDedup, a novel system for cloud tiering with deduplication. Unlike existing solutions, it maximizes scalability by utilizing cloud services not only for storage but also for computation. Following a distributed batch approach with dynamically assigned cloud computation resources, InftyDedup can deduplicate multi-petabyte backups from multiple sources at costs on the order of a couple of dollars. Moreover, by selecting between hot and cold cloud storage based on the characteristics of each data chunk, our solution further reduces the overall costs by up to 26%–44%. InftyDedup is implemented in a state-of-the-art commercial backup system and evaluated in the cloud of a hyperscaler.
Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, Alibaba Inc. and Shanghai Jiao Tong University; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.
Awarded Best Paper!The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.
12:05 pm–2:00 pm
Conference Luncheon
Mezzanine East/West
2:00 pm–3:00 pm
Key-Value Stores
Session Chair: Ethan Miller, University of California, Santa Cruz, and Pure Storage
ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance
Jinghuan Yu, City University of Hong Kong; Sam H. Noh, UNIST & Virginia Tech; Young-ri Choi, UNIST; Chun Jason Xue, City University of Hong Kong
Log-Structure Merge-tree (LSM) based Key-Value (KV) systems are widely deployed. A widely acknowledged problem with LSM-KVs is write stalls, which refers to sudden performance drops under heavy write pressure. Prior studies have attributed write stalls to a particular cause such as a resource shortage or a scheduling issue. In this paper, we conduct a systematic study on the causes of write stalls by evaluating RocksDB with a variety of storage devices and show that the conclusions that focus on the individual aspects, though valid, are not generally applicable. Through a thorough review and further experiments with RocksDB, we show that data overflow, which refers to the rapid expansion of one or more components in an LSM-KV system due to a surge in data flow into one of the components, is able to explain the formation of write stalls. We contend that by balancing and harmonizing data flow among components, we will be able to reduce data overflow and thus, write stalls. As evidence, we propose a tuning framework called ADOC (Automatic Data Overflow Control) that automatically adjusts the system configurations, specifically, the number of threads and the batch size, to minimize data overflow in RocksDB. Our extensive experimental evaluations with RocksDB show that ADOC reduces the duration of write stalls by as much as 87.9% and improves performance by as much as 322.8% compared with the auto-tuned RocksDB. Compared to the manually optimized state-of-the-art SILK, ADOC achieves up to 66% higher throughput for the synthetic write-intensive workload that we used, while achieving comparable performance for the real-world YCSB workloads. However, SILK has to use over 20% more DRAM on average.
FUSEE: A Fully Memory-Disaggregated Key-Value Store
Jiacheng Shen, The Chinese University of Hong Kong; Pengfei Zuo, Huawei Cloud; Xuchuan Luo, Fudan University; Tianyi Yang, The Chinese University of Hong Kong; Yuxin Su, Sun Yat-sen University; Yangfan Zhou, Fudan University; Michael R. Lyu, The Chinese University of Hong Kong
Distributed in-memory key-value (KV) stores are embracing the disaggregated memory (DM) architecture for higher resource utilization. However, existing KV stores on DM employ a \emph{semi-disaggregated} design that stores KV pairs on DM but manages metadata with monolithic metadata servers, hence still suffering from low resource efficiency on metadata servers. To address this issue, this paper proposes FUSEE, a FUlly memory-diSaggrEgated KV StorE that brings disaggregation to metadata management. FUSEE replicates metadata, i.e., the index and memory management information, on memory nodes, manages them directly on the client side, and handles complex failures under the DM architecture. To scalably replicate the index on clients, FUSEE proposes a client-centric replication protocol that allows clients to concurrently access and modify the replicated index. To efficiently manage disaggregated memory, FUSEE adopts a two-level memory management scheme that splits the memory management duty among clients and memory nodes. Finally, to handle the metadata corruption under client failures, FUSEE leverages an embedded operation log scheme to repair metadata with low log maintenance overhead. We evaluate FUSEE with both micro and YCSB hybrid benchmarks. The experimental results show that FUSEE outperforms the state-of-the-art KV stores on DM by up to 4.5 times with less resource consumption.
ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems
Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and Jiajie Sheng, Huazhong University of Science and Technology
Awarded Best Paper!Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.
3:00 pm–3:30 pm
Break with Refreshments
Mezzanine East/West
3:30 pm–4:30 pm
AI and Storage
Session Chair: Avani Wildani, Emory University
GL-Cache: Group-level learning for efficient and high-performance caching
Juncheng Yang, Carnegie Mellon University; Ziming Mao, Yale University; Yao Yue, Pelikan Foundation; K. V. Rashmi, Carnegie Mellon University
Web applications rely heavily on software caches to achieve low-latency, high-throughput services. To adapt to changing workloads, three types of learned caches (learned evictions) have been designed in recent years: object-level learning, learning-from-distribution, and learning-from-simple-experts. However, we argue that the learning granularity in existing approaches is either too fine (object-level), incurring significant computation and storage overheads, or too coarse (workload or expert-level) to capture the differences between objects and leaves a considerable efficiency gap.
In this work, we propose a new approach for learning in caches (group-level learning), which clusters similar objects into groups and performs learning and eviction at the group level. Learning at the group level accumulates more signals for learning, leverages more features with adaptive weights, and amortizes overheads over objects, thereby achieving both high efficiency and high throughput.
We designed and implemented GL-Cache on an open-source production cache to demonstrate group-level learning. Evaluations on 118 production block I/O and CDN cache traces show that GL-Cache has a higher hit ratio and higher throughput than state-of-the-art designs. Compared to LRB (object-level learning), GL-Cache improves throughput by 228$\times$ and hit ratio by 7\% on average across cache sizes. For 10\% of the traces (P90), GL-Cache provides a 25\% hit ratio increase from LRB. Compared to the best of all learned caches, GL-Cache achieves a 64\% higher throughput, a 3\% higher hit ratio on average, and a 13\% hit ratio increase at the P90.
SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
Redwan Ibne Seraj Khan and Ahmad Hossein Yazdani, Virginia Tech; Yuqi Fu, University of Virginia; Arnab K. Paul, BITS Pilani; Bo Ji and Xun Jian, Virginia Tech; Yue Cheng, University of Virginia; Ali R. Butt, Virginia Tech
Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today's DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling.
To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job's training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5 times compared to the LRU caching policy.
Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach
Lei Liu, Beihang University; Xinglei Dou, ICT, CAS; Yuetao Chen, ICT, CAS; Sys-Inventor Lab
Latency-critical services have been widely deployed in cloud environments. For cost-efficiency, multiple services are usually co-located on a server. Thus, run-time resource scheduling becomes the pivot for QoS control in these complicated co-location cases. However, the scheduling exploration space enlarges rapidly with the increasing server resources, making the schedulers hardly provide ideal solutions quickly. More importantly, we observe that there are “resource cliffs” in the scheduling exploration space. They affect the exploration efficiency and always lead to severe QoS fluctuations in previous schedulers. To address these problems, we propose a novel ML-based intelligent scheduler – OSML. It learns the correlation between architectural hints (e.g., IPC, cache misses, memory footprint, etc.), scheduling solutions and the QoS demands based on a data set we collected from 11 widely deployed services running on off-the-shelf servers. OSML employs multiple ML models to work collaboratively to predict QoS variations, shepherd the scheduling, and recover from QoS violations in complicated co-location cases. OSML can intelligently avoid resource cliffs during scheduling and reach an optimal solution much faster than previous approaches for co-located LC services. Experimental results show that OSML supports higher loads and meets QoS targets with lower scheduling overheads and shorter convergence time than previous studies.
5:30 pm–7:00 pm
FAST '23 Poster Session and Reception
Mezzanine East/West
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks. View the complete list of accepted posters.
Wednesday, February 22, 2023
8:00 am–9:00 am
Continental Breakfast
Mezzanine East/West
9:00 am–10:40 am
File Systems
Session Chair: Peter Desnoyers, Northeastern University and Red Hat
CJFS: Concurrent Journaling for Better Scalability
Joontaek Oh, Seung Won Yoo, and Hojin Nam, KAIST; Changwoo Min, Virginia Tech; Youjip Won, KAIST
In this paper, we propose CJFS, \emph{Concurrent Journaling Filesystem}. CJFS extends EXT4 filesystem and addresses the fundamental limitations of the EXT4 journaling design, which are the main cause for the poor scalability of EXT4 filesystem. The heavy-weight EXT4 journal suffers from two limitations. First, the journal commit is a strictly serial activity. Second, the journal commit uses the original page cache entry, not the copy of it, and subsequently any access to the in-flight page cache entry is blocked. To address these limitations, we propose four techniques, namely Dual Thread Journaling, Multi-version Shadow Paging, Opportunistic Coalescing, and Compound Flush. With Dual Thread design, CJFS can commit a transaction before the preceding journal commit finishes. With Multi-version Shadow Paging, CJFS can be free from the transaction conflict even though there can exist multiple committing transactions. With Opportunistic Coalescing, CJFS can mitigate the transaction lock-up overhead in journal commit so that it can increase the coalescing degree -- i.e., the number of system calls associated with a single transaction -- of a running transaction. With Compound Flush, CJFS minimizes the number of flush calls. CJFS improves the throughput by 81%, 68% and 125% in filebench varmail, dbench, and OLTP-Insert on MySQL, respectively, against EXT4 by removing the transaction conflict and lock-up overhead.
Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities
Aditya Basu and John Sampson, The Pennsylvania State University; Zhiyun Qian, University of California, Riverside; Trent Jaeger, The Pennsylvania State University
File name confusion attacks, such as malicious symlinks and file squatting, have long been studied as sources of security vulnerabilities. However, a recently emerged type, i.e., case-sensitivity-induced name collisions, has not been scrutinized. These collisions are introduced by differences in name resolution under case-sensitive and case-insensitive file systems or directories. A prominent example is the recent Git vulnerability (CVE-2021-21300) which can lead to code execution on a victim client when it clones a maliciously crafted repository onto a case-insensitive file system. With trends including ext4 adding support for per-directory case-insensitivity and the broad deployment of the Windows Subsystem for Linux, the prerequisites for such vulnerabilities are increasingly likely to exist even in a single system.
In this paper, we make a first effort to investigate how and where the lack of any uniform approach to handling name collisions leads to a diffusion of responsibility and resultant vulnerabilities. Interestingly, we demonstrate the existence of a range of novel security challenges arising from name collisions and their inconsistent handling by low-level utilities and applications. Specifically, our experiments show that utilities handle many name collision scenarios unsafely, leaving the responsibility to applications whose developers are unfortunately not yet aware of the threats. We examine three case studies as a first step towards systematically understanding the emerging type of name collision vulnerability.
ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit
Tabassum Mahmud, Om Rameshwar Gatla, Duo Zhang, Carson Love, Ryan Bumann, and Mai Zheng, Iowa State University
File systems play an essential role in modern society for managing precious data. To meet diverse needs, they often support many configuration parameters. Such flexibility comes at the price of additional complexity which can lead to subtle configuration-related issues. To address this challenge, we study the configuration-related issues of two major file systems (i.e., Ext4 and XFS) in depth, and identify a prevalent pattern called multilevel configuration dependencies. Based on the study, we build an extensible tool called ConfD to extract the dependencies automatically, and create six plugins to address different configuration-related issues. Our experiments on Ext4 and XFS show that ConfD can extract more than 150 configuration dependencies for the file systems with a low false positive rate. Moreover, the dependency-guided plugins can identify various configuration issues (e.g., mishandling of configurations, regression test failures induced by valid configurations).
HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers
Xiaobin He, National Supercomputing Center in Wuxi; Bin Yang, Tsinghua University, Dept. of C.S; National Supercomputer Center in Wuxi; Jie Gao and Wei Xiao, National Supercomputing Center in Wuxi; Qi Chen, Tsinghua University, Dept. of C.S; Shupeng Shi and Dexun Chen, National Supercomputing Center in Wuxi; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University, Dept. of C.S; Tsinghua University, BNRist.; National Supercomputer Center in Wuxi; Zuo-ning Chen, Chinese Academy of Engineering
Current supercomputers introduce SSDs to form a Burst Buffer (BB) layer to meet the HPC application’s growing I/O requirements. BBs can be divided into two types by deployment location. One is the local BB, which is known for its scalability and performance. The other is the shared BB, which has the advantage of data sharing and deployment costs. How to unify the advantages of the local BB and the shared BB is a key issue in the HPC community.
We propose a novel BB file system named HadaFS that provides the advantages of local BB deployments to shared BB deployments. First, HadaFS offers a new Localized Triage Architecture (LTA) to solve the problem of ultra-scale expansion and data sharing. Then, HadaFS proposes a full-path indexing approach with three metadata synchronization strategies to solve the problem of complex metadata management of traditional file systems and mismatch with the application I/O behaviors. Moreover, HadaFS integrates a data management tool named Hadash, which supports efficient data query in the BB and accelerates data migration between the BB and traditional HPC storage. HadaFS has been deployed on the Sunway New-generation Supercomputer (SNS), serving hundreds of applications and supporting a maximum of 600,000-client scaling.
Fisc: A Large-scale Cloud-native-oriented File System
Qiang Li, Alibaba Group; Lulu Chen, Fudan University and Alibaba Group; Xiaoliang Wang, Nanjing University; Shuo Huang, Alibaba Group; Qiao Xiang, Xiamen University; Yuanyuan Dong, Wenhui Yao, Minfei Huang, Puyuan Yang, Shanyang Liu, Zhaosheng Zhu, Huayong Wang, Haonan Qiu, Derui Liu, Shaozong Liu, Yujie Zhou, Yaohui Wu, Zhiwu Wu, Shang Gao, Chao Han, Zicheng Luo, Yuchao Shao, Gexiao Tian, Zhongjie Wu, Zheng Cao, and Jinbo Wu, Alibaba Group; Jiwu Shu, Xiamen University; Jie Wu, Fudan University; Jiesheng Wu, Alibaba Group
The wide adoption of Cloud Native shifts the boundary between cloud users and CSPs (Cloud Service Providers) from VM-based infrastructure to container-based applications. However, traditional file systems face challenges. First, the traditional file system (eg, Tectonic, Colossus, HDFS) clients are sophisticated and compete with the scarce resources in the application containers. Second, it is challenging for CSP to help the I/O pass from the containers to the storage clusters while guaranteeing their security, availability, and performance.
To provide file system service for cloud-native applications, we design \system{}, a cloud-native-oriented file system. \system{}~ introduces four key designs: 1) a lightweight file system client in the container, 2) a DPU-based virtio-\system{}~device to implement the hardware offloading, 3) a storage-aware mechanism to address the I/O to the storage node to improve the I/O's availability and realizes local read, 4) a full path QoS mechanism to guarantee the QoS of hybrid deployed applications. \system{} has been deployed in production for over three years. It now serves cloud-native applications running over 3 million cores. Results show that \system{}~client only consumes 80% CPU resources compared to the traditional file system client. The production environment shows that the online searching task's latency is less than 500 $\mu$s when accessing the remote storage cluster.
10:40 am–11:10 am
Break with Refreshments
Mezzanine East/West
11:10 am–12:10 pm
Persistent Memory Systems
Session Chair: Peter Macko, MongoDB
TENET: Memory Safe and Fault Tolerant Persistent Transactional Memory
R. Madhava Krishnan, Virginia Tech; Diyu Zhou, EPFL; Wook-Hee Kim, Konkuk University; Sudarsun Kannan, Rutgers University; Sanidhya Kashyap, EPFL; Changwoo Min, Virginia Tech
Byte-addressable Non-Volatile Memory (NVM) allows programs to directly access storage using memory interface without going through the expensive conventional storage stack. However, direct access to NVM makes the NVM data vulnerable to software memory bugs (memory safety) and hardware errors (fault tolerance). This issue is critical because, unlike DRAM, corrupted data can persist forever, even after the system restart. Albeit the plethora of research on NVM programs and systems, there is little attention protecting NVM data from software bugs and hardware errors. In this paper, we propose TENET, a new NVM programming framework, which guarantees memory safety and fault-tolerance to protect NVM data against software bugs and hardware errors. TENET provides the most popular Persistent Transactional Memory (PTM) programming model. TENET leverages the concurrency and commit-time guarantees of a PTM to provide performant and cost-efficient memory safety and fault tolerance. Our evaluations shows that TENET offers the protection for NVM data at a modest performance overhead and storage cost, as compared to other PTMs with partial or no memory safety and fault-tolerance support.
MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems
Shawn Zhong, Chenhao Ye, Guanzhou Hu, Suyan Qu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Michael Swift, University of Wisconsin–Madison
Persistent memory (PM) can be accessed directly from userspace without kernel involvement, but most PM filesystems still perform metadata operations in the kernel for security and rely on the kernel for cross-process synchronization.
We present per-file virtualization, where a virtualization layer implements a complete set of file functionalities, including metadata management, crash consistency, and concurrency control, in userspace. We observe that not all file metadata need to be maintained by the kernel and propose embedding insensitive metadata into the file for userspace management. For crash consistency, copy-on-write (CoW) benefits from the embedding of the block mapping since the mapping can be efficiently updated without kernel involvement. For cross-process synchronization, we introduce lock-free optimistic concurrency control (OCC) at user level, which tolerates process crashes and provides better scalability.
Based on per-file virtualization, we implement MadFS, a library PM filesystem that maintains the embedded metadata as a compact log. Experimental results show that on concurrent workloads, MadFS achieves up to 3.6× the throughput of ext4-DAX. For real-world applications, MadFS provides up to 48% speedup for YCSB on LevelDB and 85% for TPC-C on SQLite compared to NOVA.
On Stacking a Persistent Memory File System on Legacy File Systems
Hobin Woo, Samsung Electronics; Daegyu Han, Sungkyunkwan University; Seungjoon Ha, Samsung Electronics; Sam H. Noh, UNIST & Virginia Tech; Beomseok Nam, Sungkyunkwan University
In this work, we design and implement a Stackable Persistent memory File System (SPFS), which serves NVMM as a persistent writeback cache to NVMM-oblivious filesystems. SPFS can be stacked on a disk-optimized file system to improve I/O performance by absorbing frequent order-preserving small synchronous writes in NVMM while also exploiting the VFS cache of the underlying disk-optimized file system for non-synchronous writes. A stackable file system must be lightweight in that it manages only NVMM and not the disk or VFS cache. Therefore, SPFS manages all file system metadata including extents using simple but highly efficient dynamic hash tables. To manage extents using hash tables, we design a novel Extent Hashing algorithm that exhibits fast insertion as well as fast scan performance. Our performance study shows that SPFS effectively improves I/O performance of the lower file system by up to 9.9×.
12:10 pm–2:00 pm
Conference Luncheon
Mezzanine East/West
2:00 pm–3:00 pm
Remote Memory
Session Chair: Hyungon Moon, UNIST (Ulsan National Institute of Science and Technology)
Citron: Distributed Range Lock Management with One-sided RDMA
Jian Gao, Youyou Lu, Minhui Xie, Qing Wang, and Jiwu Shu, Tsinghua University
Range lock enables concurrent accesses to disjoint parts of a shared storage. However, existing range lock managers rely on centralized CPU resources to process lock requests, which results in server-side CPU bottleneck and suboptimal performance when placed in a distributed scenario.
We propose Citron, an RDMA-enabled distributed range lock manager that bypasses server-side CPUs by using only one-sided RDMA in range lock acquisition and release paths. Citron manages range locks with a static data structure called segment tree, which effectively accommodates dynamically located and sized ranges but only requires limited and nearly constant synchronization costs from the clients. Citron can also scale up itself in microseconds to adapt to a shared storage of a growing size at runtime. Evaluation shows that under various workloads, Citron delivers up to 3.05× throughput and 76.4% lower tail latency than CPU-based approaches.
Patronus: High-Performance and Protective Remote Memory
Bin Yan, Youyou Lu, Qing Wang, Minhui Xie, and Jiwu Shu, Tsinghua University
RDMA-enabled remote memory (RM) systems are gaining popularity with improved memory utilization and elasticity. However, since it is commonly believed that fine-grained RDMA permission management is impractical, existing RM systems forgo memory protection, an indispensable property in a real-world deployment. In this paper, we propose PATRONUS, an RM system that can simultaneously offer protection and high performance. PATRONUS introduces a fast permission management mechanism by exploiting advanced RDMA hardware features with a set of elaborate software techniques. Moreover, to retain the high performance under exception scenarios (e.g., client failures, illegal access), PATRONUS attaches microsecond-scaled leases to permission and reserves spare RDMA resources for fast recovery. We evaluate PATRONUS over two one-sided data structures and two function-as-a-service (FaaS) applications. The experiment shows that the protection only brings 2.4% to 27.7% overhead among all the workloads, and our system performs at most ×5.2 than the best competitor.
More Than Capacity: Performance-oriented Evolution of Pangu in Alibaba
Qiang Li, Alibaba Group; Qiao Xiang, Xiamen University; Yuxin Wang, Haohao Song, and Ridi Wen, Xiamen University and Alibaba Group; Wenhui Yao, Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, Huayong Wang, Shanyang Liu, Lulu Chen, Zhiwu Wu, Haonan Qiu, Derui Liu, Gexiao Tian, Chao Han, Shaozong Liu, Yaohui Wu, Zicheng Luo, Yuchao Shao, Junping Wu, Zheng Cao, Zhongjie Wu, Jiaji Zhu, and Jinbo Wu, Alibaba Group; Jiwu Shu, Xiamen University; Jiesheng Wu, Alibaba Group
This paper presents how the Pangu storage system continuously evolves with hardware technologies and the business model to provide high-performance, reliable storage services with a 100-microsecond level of I/O latency. Pangu’s evolution includes two phases. In the first phase, Pangu embraces the emergence of the solid-state drive (SSD) storage and remote direct memory access (RDMA) network technologies by innovating its file system and designing a user-space storage operating system to substantially reduce the I/O latency while providing high throughput and IOPS. In the second phase, Pangu evolves from a volume-oriented storage provider to a performance-oriented one. To adapt to this change of business model, Pangu upgrades its infrastructure with storage servers of much higher SSD volume and RDMA bandwidth from 25Gbps to 100Gbps. It introduces a series of key designs, including traffic amplification reduction, remote direct cache access, and CPU computation offloading, to ensure Pangu fully harvests the performance improvement brought by hardware upgrade. Other than introducing these technology innovations, we also share our operating experiences during Pangu’s evolution, and discuss important lessons learned from them.
3:00 pm–3:30 pm
Break with Refreshments
Mezzanine East/West
3:30 pm–3:45 pm
FAST '23 Test of Time Award Presentation
A Study of Practical Deduplication
Dutch T. Meyer and William J. Bolosky
Published in the Proceedings of the 9th USENIX Conference on File and Storage Technologies, February 2011
3:45 pm–5:00 pm
Work-in-Progress Reports (WiPs)
View the list of accepted Work-in-Progress Reports.
5:00 pm–6:00 pm
Happy Hour
Mezzanine East/West
Thursday, February 23, 2023
8:00 am–9:00 am
Continental Breakfast
Mezzanine East/West
9:00 am–10:20 am
IO Stacks
Session Chair: Carl Waldspurger, Carl Waldspurger Consulting
λ-IO: A Unified IO Stack for Computational Storage
Zhe Yang, Youyou Lu, Xiaojian Liao, Youmin Chen, Junru Li, Siyu He, and Jiwu Shu, Tsinghua University
The emerging computational storage device offers an opportunity for in-storage computing. It alleviates the overhead of data movement between the host and the device, and thus accelerates data-intensive applications. In this paper, we present λ-IO, a unified IO stack managing both computation and storage resources across the host and the device. We propose a set of designs – interface, runtime, and scheduling – to tackle three critical issues. We implement λ-IO in full-stack software and hardware environment, and evaluate it with synthetic and real applications against Linux IO, showing up to 5.12× performance improvement.
Revitalizing the Forgotten On-Chip DMA to Expedite Data Movement in NVM-based Storage Systems
Jingbo Su, Jiahao Li, and Luofan Chen, University of Science and Technology of China; Cheng Li, University of Science and Technology of China and Anhui Province Key Laboratory of High Performance Computing; Kai Zhang and Liang Yang, SmartX; Sam H. Noh, UNIST & Virginia Tech; Yinlong Xu, University of Science and Technology of China and Anhui Province Key Laboratory of High Performance Computing
Data-intensive applications executing on NVM-based storage systems experience serious bottlenecks when moving data between DRAM and NVM. We advocate for the use of the long-existing but recently neglected on-chip DMA to expedite data movement with three contributions. First, we explore new latency-oriented optimization directions, driven by a comprehensive DMA study, to design a high-performance DMA module, which significantly lowers the I/O size threshold to observe benefits. Second, we propose a new data movement engine, Fastmove, that coordinates the use of the DMA along with the CPU with judicious scheduling and load splitting such that the DMA’s limitations are compensated, and the overall gains are maximized. Finally, with a general kernel-based design, simple APIs, and DAX file system integration, Fastmove allows applications to transparently exploit the DMA and its new features without code change. We run three data-intensive applications MySQL, GraphWalker, and Filebench atop NOVA, ext4-DAX, and XFS-DAX, with standard benchmarks like TPC-C, and popular graph algorithms like PageRank. Across single- and multi-socket settings, compared to the conventional CPU-only NVM accesses, Fastmove introduces to TPC-C with MySQL 1.13-2.16× speedups of peak throughput, reduces the average latency by 17.7-60.8%, and saves 37.1-68.9% CPU usage spent in data movement. It also shortens the execution time of graph algorithms with GraphWalker by 39.7-53.4%, and introduces 1.12-1.27× throughput speedups for Filebench.
NVMeVirt: A Versatile Software-defined Virtual NVMe Device
Sang-Hoon Kim, Ajou University; Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, and Jin-Soo Kim, Seoul National University
There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices.
In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.
SMRSTORE: A Storage Engine for Cloud Object Storage on HM-SMR Drives
Su Zhou, Erci Xu, Hao Wu, Yu Du, Jiacheng Cui, Wanyu Fu, Chang Liu, Yingni Wang, Wenbo Wang, Shouqu Sun, Xianfei Wang, Bo Feng, Biyun Zhu, Xin Tong, Weikang Kong, Linyan Liu, Zhongjie Wu, Jinbo Wu, Qingchao Luo, and Jiesheng Wu, Alibaba Group
Cloud object storage vendors are always in pursuit of better cost efficiency. Emerging Shingled Magnetic Recording (SMR) drives are becoming economically favorable in archival storage systems due to significantly improved areal density. However, for standard-class object storage, previous studies and our preliminary exploration revealed that the existing SMR drive solutions can experience severe throughput variations due to garbage collection (GC).
In this paper, we introduce SMRSTORE, an SMR-based storage engine for standard-class object storage without compromising performance or durability. The key features of SMRSTORE include directly implementing chunk store interfaces over SMR drives, using a complete log-structured design, and applying guided data placement to reduce GC for consistent performance. The evaluation shows that SMRSTORE delivers comparable performance as Ext4 on the Conventional Magnetic Recording (CMR) drives, and can be up to 2.16x faster than F2FS on SMR drives. By switching to SMR drives, we have decreased the total cost by up to 15% and provided performance on par with the prior system for customers. Currently, we have deployed SMRSTORE in standard-class Alibaba Cloud Object Storage Service (OSS) to store hundreds of PBs of data. We plan to use SMR drives for all classes of OSS in the near future.
10:20 am–10:50 am
Break with Refreshments
Mezzanine East/West
10:50 am–11:50 am
SSDs and Smartphones
Session Chair: Geoff Kuenning, Harvey Mudd College
Multi-view Feature-based SSD Failure Prediction: What, When, and Why
Yuqi Zhang and Wenwen Hao, Samsung R&D Institute China Xi'an, Samsung Electronics; Ben Niu and Kangkang Liu, Tencent; Shuyang Wang, Na Liu, and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics; Yongwong Gwon and Chankyu Koh, Samsung Electronics
Solid state drives (SSDs) play an important role in large-scale data centers. SSD failures affect the stability of storage systems and cause additional maintenance overhead. To predict and handle SSD failures in advance, this paper proposes a multi-view and multi-task random forest (MVTRF) scheme. MVTRF predicts SSD failures based on multi-view features extracted from both long-term and short-term monitoring data of SSDs. Particularly, multi-task learning is adopted to simultaneously predict what type of failure it is and when it will occur through the same model. We also extract the key decisions of MVTRF to analyze why the failure will occur. These details of failure would be useful for verifying and handling SSD failures. The proposed MVTRF is evaluated on the large-scale real data from data centers. The experimental results show that MVTRF has higher failure prediction accuracy and improves precision by 46.1% and recall by 57.4% on average compared with the existing schemes. The results also demonstrate the effectiveness of MVTRF on failure type and time prediction and failure cause identification, which helps to improve the efficiency of failure handling.
Fast Application Launch on Personal Computing/Communication Devices
Junhee Ryu, SK hynix; Dongeun Lee, Texas A&M University - Commerce; Kang G. Shin, University of Michigan; Kyungtae Kang, Hanyang University
We present Paralfetch, a novel prefetcher to speed up app launches on personal computing/communication devices by: 1) accurate collection of launch-related disk read requests, 2) pre-scheduling of these requests to improve I/O throughput during prefetching, and 3) overlapping app execution with disk prefetching for hiding disk access time from the app execution. We have implemented Paralfetch under Linux kernels on a desktop/laptop PC, a Raspberry Pi 3 board, and an Android smartphone. Tests with popular apps show that Paralfetch significantly reduces app launch times on flash-based drives, and outperforms GSoC Prefetch and FAST, which are representative app prefetchers available for Linux-based systems.
Integrated Host-SSD Mapping Table Management for Improving User Experience of Smartphones
Yoona Kim and Inhyuk Choi, Seoul National University; Juhyung Park, Jaeheon Lee, and Sungjin Lee, DGIST; Jihong Kim, Seoul National University
Host Performance Booster (HPB) was proposed to improve the performance of high-capacity mobile flash storage systems by utilizing unused host DRAM memory. In this paper, we investigate how HPB should be managed so that the user experience of smartphones can be enhanced from HPB-enabled high-performance mobile storage systems. From our empirical study on Android environments, we identified two requirements for an efficient HPB management scheme in smartphones. First, HPB should be managed in a foreground app-centric manner so that the user-perceived latency can be greatly reduced. Second, the capacity of the HPB memory should be dynamically adjusted so as not to degrade the user experience of the foreground app. As an efficient HPB management solution that meets the identified requirements, we propose an integrated host-SSD mapping table management scheme, HPBvalve, for smartphones. HPBvalve prioritizes the foreground app in managing mapping table entries in the HPB memory. HPBvalve dynamically resizes the overall capacity of the HPB memory depending on the memory pressure status of the smartphone. Our experimental results using the prototype implementation demonstrate that HPBvalve improves UX-critical app launching time by up to 43% (250 ms) over the existing HPB management scheme, without negatively affecting memory pressure. Meanwhile, the L2P mapping misses are alleviated by up to 78%.
11:50 am–12:45 pm
Farewell Boxed Lunch
Mezzanine East/West