Integrated {Host-SSD} Mapping Table Management for Improving User Experience of Smartphones

Yoona Kim; Inhyuk Choi; Juhyung Park; Jaeheon Lee; Qiao Xiang; Sungjin Lee; Mengtian Wang; Yuanyuan Dong; Jihong Kim; Wenhui Yao; Yuanyuan Dong; Xing He; Shuqi Zhao; Puyuan Yang; Yongwong Gwon; Shanyang Liu; Zhaosheng Zhu; Chankyu Koh; Huayong Wang; Huayong Wang; Shouqu Sun; Haonan Qiu; Lulu Chen; Xianfei Wang; Zhiwu Wu; Bo Feng; Haonan Qiu; Yujie Zhou; Biyun Zhu; Yaohui Wu; Xin Tong; Chao Han; Weikang Kong; Shaozong Liu; Linyan Liu; Zicheng Luo; Zhongjie Wu; Yuchao Shao; Jinbo Wu; Zheng Cao; Qingchao Luo; Zhongjie Wu; Jinbo Wu; Jiesheng Wu; Jiaji Zhu; Jiwu Shu; Jinbo Wu; Jie Wu; Jiwu Shu; Jiesheng Wu

All sessions will be held in Santa Clara Ballroom unless otherwise noted.

Papers are available for download below to registered attendees now and to everyone beginning Tuesday, February 21. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter

Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Papers and Proceedings

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].

Full Proceedings PDF Files

FAST '23 Full Proceedings (PDF)

FAST '23 Full Proceedings Interior (PDF, Best for Mobile Devices)

Attendee Files

FAST '23 Attendee List (PDF)

FAST '23 Proceedings Web Archive (ZIP)

Tuesday, February 21, 2023

8:00 am–9:00 am

Continental Breakfast

Mezzanine East/West

9:00 am–9:15 am

Opening Remarks and Awards

Program Co-Chairs: Ashvin Goel, University of Toronto, and Dalit Naor, The Academic College of Tel Aviv–Yaffo

9:15 am–10:15 am

Keynote Address

Building and Operating a Pretty Big Storage System (My Adventures in Amazon S3)

Andy Warfield, Amazon

Available Media

Five years ago I decided to leave my faculty position at UBC and join Amazon. A lot of that time has been spent working as an engineer on the S3 team. In this talk, I'm going to reflect on some of my experiences working on a cloud storage service, and in particular talk about what it's been like to build storage at the scale of S3. Now, we all know that S3 is pretty big and so it's probably not all that surprising to hear that I expected "scale" to be something that I'd learn a fair bit about when I decided to join. I was, however, a little surprised at some of the ways that scale would really factor into my role, and how it would show up in operating the business. So in the talk, we'll learn about building storage on large quantities of hard drives, and we'll also learn about some of the cultural, organizational, and personal dimensions of building, evolving, and running a really big storage system.

Andy Warfield, Amazon

10:15 am–10:45 am

Break with Refreshments

Mezzanine East/West

10:45 am–12:05 pm

Coding and Cloud Storage

Session Chair: Ali R. Butt, Virginia Tech

Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)

Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant, Google

Available Media

Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.

We conduct a practically-minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations, and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.

ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage

Xiaolu Li, Huazhong University of Science and Technology; Keyun Cheng, Kaicheng Tang, and Patrick P. C. Lee, The Chinese University of Hong Kong; Yuchong Hu, Huazhong University of Science and Technology; Dan Feng, Huazhong University of Science and Technology, Wuhan, China; Jie Li and Ting-Yi Wu, Huawei Technologies Co., Ltd., Hong Kong

Available Media

Minimum-storage regenerating (MSR) codes are provably optimal erasure codes that minimize the repair bandwidth (i.e., the amount of traffic being transferred during a repair operation), with the minimum storage redundancy, in distributed storage systems. However, the practical repair performance of MSR codes still has significant room to improve, as the mathematical structure of MSR codes makes their repair operations difficult to parallelize. We present ParaRC, a parallel repair framework for MSR codes. ParaRC exploits the sub-packetization nature of MSR codes to parallelize the repair of sub-blocks and balance the repair load (i.e., the amount of traffic sent or received by a node) across the available nodes. We show that there exists a trade-off between the repair bandwidth and the maximum repair load, and further propose a fast heuristic that approximately minimizes the maximum repair load with limited search time for large coding parameters. We prototype our heuristic in ParaRC and show that ParaRC reduces the degraded read and full-node recovery times over the conventional centralized repair approach in MSR codes by up to 59.3% and 39.2%, respectively.

InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication

Iwona Kotlarska, Andrzej Jackowski, Krzysztof Lichota, Michal Welnicki, and Cezary Dubnicki, 9LivesData, LLC; Konrad Iwanicki, University of Warsaw

Available Media

Cloud tiering is the process of moving selected data from on-premise storage to the cloud, which has recently become important for backup solutions. As subsequent backups usually contain repeating data, deduplication in cloud tiering can significantly reduce cloud storage utilization, and hence costs.

In this paper, we introduce InftyDedup, a novel system for cloud tiering with deduplication. Unlike existing solutions, it maximizes scalability by utilizing cloud services not only for storage but also for computation. Following a distributed batch approach with dynamically assigned cloud computation resources, InftyDedup can deduplicate multi-petabyte backups from multiple sources at costs on the order of a couple of dollars. Moreover, by selecting between hot and cold cloud storage based on the characteristics of each data chunk, our solution further reduces the overall costs by up to 26%–44%. InftyDedup is implemented in a state-of-the-art commercial backup system and evaluated in the cloud of a hyperscaler.

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, Alibaba Inc. and Shanghai Jiao Tong University; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Awarded Best Paper!

Deployed-Systems Paper

Available Media

The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

12:05 pm–2:00 pm

Conference Luncheon

Mezzanine East/West

2:00 pm–3:00 pm

Key-Value Stores

Session Chair: Ethan Miller, University of California, Santa Cruz, and Pure Storage

ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance

Jinghuan Yu, City University of Hong Kong; Sam H. Noh, UNIST & Virginia Tech; Young-ri Choi, UNIST; Chun Jason Xue, City University of Hong Kong

Available Media

Log-Structure Merge-tree (LSM) based Key-Value (KV) systems are widely deployed. A widely acknowledged problem with LSM-KVs is write stalls, which refers to sudden performance drops under heavy write pressure. Prior studies have attributed write stalls to a particular cause such as a resource shortage or a scheduling issue. In this paper, we conduct a systematic study on the causes of write stalls by evaluating RocksDB with a variety of storage devices and show that the conclusions that focus on the individual aspects, though valid, are not generally applicable. Through a thorough review and further experiments with RocksDB, we show that data overflow, which refers to the rapid expansion of one or more components in an LSM-KV system due to a surge in data flow into one of the components, is able to explain the formation of write stalls. We contend that by balancing and harmonizing data flow among components, we will be able to reduce data overflow and thus, write stalls. As evidence, we propose a tuning framework called ADOC (Automatic Data Overflow Control) that automatically adjusts the system configurations, specifically, the number of threads and the batch size, to minimize data overflow in RocksDB. Our extensive experimental evaluations with RocksDB show that ADOC reduces the duration of write stalls by as much as 87.9% and improves performance by as much as 322.8% compared with the auto-tuned RocksDB. Compared to the manually optimized state-of-the-art SILK, ADOC achieves up to 66% higher throughput for the synthetic write-intensive workload that we used, while achieving comparable performance for the real-world YCSB workloads. However, SILK has to use over 20% more DRAM on average.

FUSEE: A Fully Memory-Disaggregated Key-Value Store

Jiacheng Shen, The Chinese University of Hong Kong; Pengfei Zuo, Huawei Cloud; Xuchuan Luo, Fudan University; Tianyi Yang, The Chinese University of Hong Kong; Yuxin Su, Sun Yat-sen University; Yangfan Zhou, Fudan University; Michael R. Lyu, The Chinese University of Hong Kong

Available Media

Distributed in-memory key-value (KV) stores are embracing the disaggregated memory (DM) architecture for higher resource utilization. However, existing KV stores on DM employ a \emph{semi-disaggregated} design that stores KV pairs on DM but manages metadata with monolithic metadata servers, hence still suffering from low resource efficiency on metadata servers. To address this issue, this paper proposes FUSEE, a FUlly memory-diSaggrEgated KV StorE that brings disaggregation to metadata management. FUSEE replicates metadata, i.e., the index and memory management information, on memory nodes, manages them directly on the client side, and handles complex failures under the DM architecture. To scalably replicate the index on clients, FUSEE proposes a client-centric replication protocol that allows clients to concurrently access and modify the replicated index. To efficiently manage disaggregated memory, FUSEE adopts a two-level memory management scheme that splits the memory management duty among clients and memory nodes. Finally, to handle the metadata corruption under client failures, FUSEE leverages an embedded operation log scheme to repair metadata with low log maintenance overhead. We evaluate FUSEE with both micro and YCSB hybrid benchmarks. The experimental results show that FUSEE outperforms the state-of-the-art KV stores on DM by up to 4.5 times with less resource consumption.

ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems

Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and Jiajie Sheng, Huazhong University of Science and Technology

Awarded Best Paper!

Available Media

Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

3:00 pm–3:30 pm

Break with Refreshments

Mezzanine East/West

3:30 pm–4:30 pm

AI and Storage

Session Chair: Avani Wildani, Emory University

GL-Cache: Group-level learning for efficient and high-performance caching

Juncheng Yang, Carnegie Mellon University; Ziming Mao, Yale University; Yao Yue, Pelikan Foundation; K. V. Rashmi, Carnegie Mellon University

Available Media

Web applications rely heavily on software caches to achieve low-latency, high-throughput services. To adapt to changing workloads, three types of learned caches (learned evictions) have been designed in recent years: object-level learning, learning-from-distribution, and learning-from-simple-experts. However, we argue that the learning granularity in existing approaches is either too fine (object-level), incurring significant computation and storage overheads, or too coarse (workload or expert-level) to capture the differences between objects and leaves a considerable efficiency gap.

In this work, we propose a new approach for learning in caches (group-level learning), which clusters similar objects into groups and performs learning and eviction at the group level. Learning at the group level accumulates more signals for learning, leverages more features with adaptive weights, and amortizes overheads over objects, thereby achieving both high efficiency and high throughput.

We designed and implemented GL-Cache on an open-source production cache to demonstrate group-level learning. Evaluations on 118 production block I/O and CDN cache traces show that GL-Cache has a higher hit ratio and higher throughput than state-of-the-art designs. Compared to LRB (object-level learning), GL-Cache improves throughput by 228$\times$ and hit ratio by 7\% on average across cache sizes. For 10\% of the traces (P90), GL-Cache provides a 25\% hit ratio increase from LRB. Compared to the best of all learned caches, GL-Cache achieves a 64\% higher throughput, a 3\% higher hit ratio on average, and a 13\% hit ratio increase at the P90.

SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Redwan Ibne Seraj Khan and Ahmad Hossein Yazdani, Virginia Tech; Yuqi Fu, University of Virginia; Arnab K. Paul, BITS Pilani; Bo Ji and Xun Jian, Virginia Tech; Yue Cheng, University of Virginia; Ali R. Butt, Virginia Tech

Available Media

Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today's DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling.

To this end, we design and implement SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE adopts a novel, rank-based approach, which captures the relative importance of data samples across different minibatches. SHADE then dynamically updates the importance scores of all samples during training. With these techniques, SHADE manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job's training performance. Evaluation with representative computer vision (CV) models shows that SHADE, with a small cache, improves the cache hit ratio by up to 4.5 times compared to the LRU caching policy.

Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

Lei Liu, Beihang University; Xinglei Dou, ICT, CAS; Yuetao Chen, ICT, CAS; Sys-Inventor Lab

Available Media

Latency-critical services have been widely deployed in cloud environments. For cost-efficiency, multiple services are usually co-located on a server. Thus, run-time resource scheduling becomes the pivot for QoS control in these complicated co-location cases. However, the scheduling exploration space enlarges rapidly with the increasing server resources, making the schedulers hardly provide ideal solutions quickly. More importantly, we observe that there are “resource cliffs” in the scheduling exploration space. They affect the exploration efficiency and always lead to severe QoS fluctuations in previous schedulers. To address these problems, we propose a novel ML-based intelligent scheduler – OSML. It learns the correlation between architectural hints (e.g., IPC, cache misses, memory footprint, etc.), scheduling solutions and the QoS demands based on a data set we collected from 11 widely deployed services running on off-the-shelf servers. OSML employs multiple ML models to work collaboratively to predict QoS variations, shepherd the scheduling, and recover from QoS violations in complicated co-location cases. OSML can intelligently avoid resource cliffs during scheduling and reach an optimal solution much faster than previous approaches for co-located LC services. Experimental results show that OSML supports higher loads and meets QoS targets with lower scheduling overheads and shorter convergence time than previous studies.

5:30 pm–7:00 pm

FAST '23 Poster Session and Reception

Mezzanine East/West

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks. View the complete list of accepted posters.