See the At a Glance for a quick overview of what's happening at NSDI '17.
All sessions will be held in the Grand Ballroom unless otherwise noted.
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].
Proceedings Front Matter
Proceedings Cover |
Title Page and List of Organizers |
Table of Contents |
Message from the Program Co-Chairs
Full Proceedings PDFs
NSDI '17 Full Proceedings (PDF)
NSDI '17 Proceedings Interior (PDF, best for mobile devices)
NSDI '17 Proceedings Errata Slip #1(PDF)
NSDI '17 Proceedings Errata Slip #2(PDF)
Full Proceedings ePub (for iPad and most eReaders)
NSDI '17 Full Proceedings (ePub)
Full Proceedings Mobi (for Kindle)
NSDI '17 Full Proceedings (Mobi)
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, March 27
7:30 am–8:45 am
Continental Breakfast
Grand Ballroom Foyer
8:45 am–9:00 am
Opening Remarks and Awards
Program Co-Chairs: Aditya Akella, University of Wisconsin–Madison, and Jon Howell, Google
9:00 am–10:40 am
Storage Systems
Session Chair: Wyatt Lloyd, University of Southern California
The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files for a File-Storage Service
Daniel Reiter Horn, Ken Elkabany, and Chris Lesniewski-Lass, Dropbox; Keith Winstein, Stanford University
Community Award Winner!
We report the design, implementation, and deployment of Lepton, a fault-tolerant system that losslessly compresses JPEG images to 77% of their original size on average. Lepton replaces the lowest layer of baseline JPEG compression—a Huffman code—with a parallelized arithmetic code, so that the exact bytes of the original JPEG file can be recovered quickly. Lepton matches the compression efficiency of the best prior work, while decoding more than nine times faster and in a streaming manner. Lepton has been released as open-source software and has been deployed for a year on the Dropbox file-storage backend. As of February 2017, it had compressed more than 203 PiB of user JPEG files, saving more than 46 PiB.
Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage
Mihir Nanavati, Jake Wires, and Andrew Warfield, Coho Data and University of British Columbia
The performance characteristics of modern non-volatile storage devices have led to storage becoming a shared, rack-scale resource. System designers have an imperative to rethink existing storage abstractions in light of this development. This paper proposes a lightweight storage abstraction that encapsulates full-system hardware resources and focuses only on isolating and sharing storage devices across multiple, remote tenants. In support of this, it prototypes a shared-nothing runtime capable of managing hardware and scheduling this abstraction in a manner that avoids performance interference and preserves the performance of the storage devices for tenants.
vCorfu: A Cloud-Scale Object Store on a Shared Log
Michael Wei, University of California, San Diego, and VMware Research Group; Amy Tai, Princeton University and VMware Research Group; Christopher J. Rossbach, The University of Texas at Austin and VMware Research Group; Ittai Abraham, VMware Research Group; Maithem Munshed, Medhavi Dhawan, and Jim Stabile, VMware; Udi Wieder and Scott Fritchie, VMware Research Group; Steven Swanson, University of California, San Diego; Michael J. Freedman, Princeton University; Dahlia Malkhi, VMware Research Group
This paper presents vCorfu, a strongly consistent cloudscale object store built over a shared log. vCorfu augments the traditional replication scheme of a shared log to provide fast reads and leverages a new technique, composable state machine replication, to compose large state machines from smaller ones, enabling the use of state machine replication to be used to efficiently in huge data stores. We show that vCorfu outperforms Cassandra, a popular state-of-the art NOSQL stores while providing strong consistency (opacity, read-own-writes), efficient transactions, and global snapshots at cloud scale.
Curator: Self-Managing Storage for Enterprise Clusters
Ignacio Cano, University of Washington; Srinivas Aiyar, Varun Arora, Manosiz Bhattacharyya, Akhilesh Chaganti, Chern Cheah, Brent Chun, Karan Gupta, and Vinayak Khot, Nutanix Inc.; Arvind Krishnamurthy, University of Washington
Modern cluster storage systems perform a variety of background tasks to improve the performance, availability, durability, and cost-efficiency of stored data. For example, cleaners compact fragmented data to generate long sequential runs, tiering services automatically migrate data between solid-state and hard disk drives based on usage, recovery mechanisms replicate data to improve availability and durability in the face of failures, cost saving techniques perform data transformations to reduce the storage costs, and so on.
In this work, we present Curator, a background MapReduce-style execution framework for cluster management tasks, in the context of a distributed storage system used in enterprise clusters. We describe Curator’s design and implementation, and evaluate its performance using a handful of relevant metrics. We further report experiences and lessons learned from its five-year construction period, as well as thousands of customer deployments. Finally, we propose a machine learning-based model to identify an efficient execution policy for Curator’s management tasks that can adapt to varying workload characteristics.
10:40 am–11:10 am
Break with Refreshments
Grand Ballroom Foyer
11:10 am–12:50 pm
Packet Processing
Session Chair: Anirudh Badam, Microsoft
Evaluating the Power of Flexible Packet Processing for Network Resource Allocation
Naveen Kr. Sharma, Antoine Kaufmann, and Thomas Anderson, University of Washington; Changhoon Kim, Barefoot Networks; Arvind Krishnamurthy, University of Washington; Jacob Nelson, Microsoft Research; Simon Peter, The University of Texas at Austin
Recent hardware switch architectures make it feasible to perform flexible packet processing inside the network. This allows operators to configure switches to parse and process custom packet headers using flexible match+action tables in order to exercise control over how packets are processed and routed. However, flexible switches have limited state, support limited types of operations, and limit per-packet computation in order to be able to operate at line rate.
Our work addresses these limitations by providing a set of general building blocks that mask these limitations using approximation techniques and thereby enabling the implementation of realistic network protocols. In particular, we use these building blocks to tackle the network resource allocation problem within datacenters and realize approximate variants of congestion control and load balancing protocols, such as XCP, RCP, and CONGA, that require explicit support from the network. Our evaluations show that these approximations are accurate and that they do not exceed the hardware resource limits associated with these flexible switches. We demonstrate their feasibility by implementing RCP with the production Cavium CNX880xx switch. This implementation provides significantly faster and lower-variance flow completion times compared with TCP.
APUNet: Revitalizing GPU as Packet Processing Accelerator
Younghwan Go, Muhammad Asim Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park, Korea Advanced Institute of Science and Technology (KAIST)
Many research works have recently experimented with GPU to accelerate packet processing in network applications. Most works have shown that GPU brings a significant performance boost when it is compared to the CPUonly approach, thanks to its highly-parallel computation capacity and large memory bandwidth. However, a recent work argues that for many applications, the key enabler for high performance is the inherent feature of GPU that automatically hides memory access latency rather than its parallel computation power. It also claims that CPU can outperform or achieve a similar performance as GPU if its code is re-arranged to run concurrently with memory access, employing optimization techniques such as group prefetching and software pipelining.
In this paper, we revisit the claim of the work and see if it can be generalized to a large class of network applications. Our findings with eight popular algorithms widely used in network applications show that (a) there are many compute-bound algorithms that do benefit from the parallel computation capacity of GPU while CPU-based optimizations fail to help, and (b) the relative performance advantage of CPU over GPU in most applications is due to data transfer bottleneck in PCIe communication of discrete GPU rather than lack of capacity of GPU itself. To avoid the PCIe bottleneck, we suggest employing integrated GPU in recent APU platforms as a cost-effective packet processing accelerator. We address a number of practical issues in fully exploiting the capacity of APU and show that network applications based on APU achieve multi-10 Gbps performance for many compute/memory-intensive algorithms.
Stateless Network Functions: Breaking the Tight Coupling of State and Processing
Murad Kablan, Azzam Alsudais, and Eric Keller, University of Colorado Boulder; Franck Le, IBM Research
In this paper we present Stateless Network Functions, a new architecture for network functions virtualization, where we decouple the existing design of network functions into a stateless processing component along with a data store layer. In breaking the tight coupling, we enable a more elastic and resilient network function infrastructure. Our StatelessNF processing instances are architected around efficient pipelines utilizing DPDK for high performance network I/O, packaged as Docker containers for easy deployment, and a data store interface optimized based on the expected request patterns to efficiently access a RAMCloud-based data store. A network-wide orchestrator monitors the instances for load and failure, manages instances to scale and provide resilience, and leverages an OpenFlow-based network to direct traffic to instances. We implemented three example network functions (network address translator, firewall, and load balancer). Our evaluation shows (i) we are able to reach a throughput of 10Gbit/sec, with an added latency overhead of between 100μs and 500μs, (ii) we are able to have a failover which does not disrupt ongoing traffic, and (iii) when scaling out and scaling in we are able to match the ideal performance.
mOS: A Reusable Networking Stack for Flow Monitoring Middleboxes
Muhammad Asim Jamshed, YoungGyoun Moon, Donghwi Kim, Dongsu Han, and KyoungSoo Park, Korea Advanced Institute of Science and Technology (KAIST)
Awarded Best Paper!
Stateful middleboxes, such as intrusion detection systems and application-level firewalls, have provided key functionalities in operating modern IP networks. However, designing an efficient middlebox is challenging due to the lack of networking stack abstraction for TCP flow processing. Thus, middlebox developers often write the complex flow management logic from scratch, which is not only prone to errors, but also wastes efforts for similar functionalities across applications.
This paper presents the design and implementation of mOS, a reusable networking stack for stateful flow processing in middlebox applications. Our API allows developers to focus on the core application logic instead of dealing with low-level packet/flow processing themselves. Under the hood, it implements an efficient event system that scales to monitoring millions of concurrent flow events. Our evaluation demonstrates that mOS enables modular development of stateful middleboxes, often significantly reducing development efforts represented by the source lines of code, while introducing little performance overhead in multi-10Gbps network environments.
12:50 pm–2:00 pm
Symposium Luncheon
Palm Garden Room
The 2017 NSDI Test of Time award will be presented during the symposium luncheon.
View all Test of Time award winners.
2:00 pm–3:40 pm
Security and Privacy
Session Chair: Dahlia Malkhi, VMware
One Key to Sign Them All Considered Vulnerable: Evaluation of DNSSEC in the Internet
Haya Shulman and Michael Waidner, Fraunhofer Institute for Secure Information Technology SIT
We perform the first Internet study of the cryptographic security of DNSSEC-signed domains. To that end, we collected 2.1M DNSSEC keys for popular signed domains out of these 1.9M are RSA keys. We analyse the RSA keys and show that a large fraction of signed domains are using vulnerable keys: 35% are signed with RSA keys that share their moduli with some other domain and 66% use keys that are too short (1024 bit or less) or keys which modulus has a GCD > 1 with the modulus of some other domain. As we show, to a large extent the vulnerabilities are due to poor key generation practices, but also due to potential faulty hardware or software bugs.
The DNSSEC keys collection and analysis is performed on a daily basis with the DNSSEC Keys Validation Engine which we developed. The statistics as well as the DNSSEC Keys Validation Engine are made available online, as a service for Internet users.
Enhancing Security and Privacy of Tor's Ecosystem by Using Trusted Execution Environments
Seongmin Kim, Juhyeng Han, and Jaehyeong Ha, Korea Advanced Institute of Science and Technology (KAIST); Taesoo Kim, Georgia Institute of Technology; Dongsu Han, Korea Advanced Institute of Science and Technology (KAIST)
With Tor being a popular anonymity network, many attacks have been proposed to break its anonymity or leak information of a private communication on Tor. However, guaranteeing complete privacy in the face of an adversary on Tor is especially difficult because Tor relays are under complete control of world-wide volunteers. Currently, one can gain private information, such as circuit identifiers and hidden service identifiers, by running Tor relays and can even modify their behaviors with malicious intent.
This paper presents a practical approach to effectively enhancing the security and privacy of Tor by utilizing Intel SGX, a commodity trusted execution environment. We present a design and implementation of Tor, called SGX-Tor, that prevents code modification and limits the information exposed to untrusted parties. We demonstrate that our approach is practical and effectively reduces the power of an adversary to a traditional network-level adversary. Finally, SGX-Tor incurs moderate performance overhead; the end-to-end latency and throughput overheads for HTTP connections are 3.9% and 11.9%, respectively.
ViewMap: Sharing Private In-Vehicle Dashcam Videos
Minho Kim, Jaemin Lim, Hyunwoo Yu, Kiyeon Kim, Younghoon Kim, and Suk-Bok Lee, Hanyang University
Today, search for dashcam video evidences is conducted manually and its procedure does not guarantee privacy. In this paper, we motivate, design, and implement ViewMap, an automated public service system that enables sharing of private dashcam videos under anonymity. ViewMap takes a profile-based approach where each video is represented in a compact form called a view profile (VP), and the anonymized VPs are treated as entities for search, verification, and reward instead of their owners. ViewMap exploits the line-of-sight (LOS) properties of dedicated short-range communications (DSRC) such that each vehicle makes VP links with nearby ones that share the same sight while driving. ViewMap uses such LOS-based VP links to build a map of visibility around a given incident, and identifies VPs whose videos are worth reviewing. Original videos are never transmitted unless they are verified to be taken near the incident and anonymously solicited. ViewMap offers untraceable rewards for the provision of videos whose owners remain anonymous. We demonstrate the feasibility of ViewMap via field experiments on real roads using our DSRC testbeds and trace-driven simulations.
A System to Verify Network Behavior of Known Cryptographic Clients
Andrew Chi, Robert A. Cochran, Marie Nesfield, Michael K. Reiter, and Cynthia Sturton, The University of North Carolina at Chapel Hill
Numerous exploits of client-server protocols and applications involve modifying clients to behave in ways that untampered clients would not, such as crafting malicious packets. In this paper, we develop a system for verifying in near real-time that a cryptographic client’s message sequence is consistent with its known implementation. Moreover, we accomplish this without knowing all of the client-side inputs driving its behavior. Our toolchain for verifying a client’s messages explores multiple candidate execution paths in the client concurrently, an innovation useful for aspects of certain cryptographic protocols such as message padding (which will be permitted in TLS 1.3). In addition, our toolchain includes a novel approach to symbolically executing the client software in multiple passes that defers expensive functions until their inputs can be inferred and concretized. We demonstrate client verification on OpenSSL and BoringSSL to show that, e.g., Heartbleed exploits can be detected without Heartbleed-specific filtering and within seconds of the first malicious packet. On legitimate traffic our verification keeps pace with Gmail-shaped workloads, with a median lag of 0.85s.
3:40 pm–4:10 pm
Break with Refreshments
Grand Ballroom Foyer
4:10 pm–5:50 pm
Wireless Networking
Session Chair: Ranveer Chandra, Microsoft
FlexCore: Massively Parallel and Flexible Processing for Large MIMO Access Points
Christopher Husmann, Georgios Georgis, and Konstantinos Nikitopoulos, University of Surrey; Kyle Jamieson, Princeton University and University College London
Large MIMO base stations remain among wireless network designers’ best tools for increasing wireless throughput while serving many clients, but current system designs, sacrifice throughput with simple linear MIMO detection algorithms. Higher-performance detection techniques are known, but remain off the table because these systems parallelize their computation at the level of a whole OFDM subcarrier, sufficing only for the less demanding linear detection approaches they opt for. This paper presents FlexCore, the first computational architecture capable of parallelizing the detection of large numbers of mutually-interfering information streams at a granularity below individual OFDM subcarriers, in a nearly-embarrassingly parallel manner while utilizing any number of available processing elements. For 12 clients sending 64-QAM symbols to a 12-antenna base station, our WARP testbed evaluation shows similar network throughput to the state-of-the-art while using an order of magnitude fewer processing elements. For the same scenario, our combined WARP-GPU testbed evaluation demonstrates a 19x computational speedup, with 97% increased energy efficiency when compared with the state of the art. Finally, for the same scenario, an FPGA-based comparison between FlexCore and the state of the art shows that FlexCore can achieve up to 96% better energy efficiency, and can offer up to 32x the processing throughput.
Facilitating Robust 60 GHz Network Deployment By Sensing Ambient Reflectors
Teng Wei, University of Wisconsin—Madison; Anfu Zhou, Beijing University of Posts and Telecommunications; Xinyu Zhang, University of Wisconsin—Madison
60 GHz millimeter-wave networks represent the next frontier in high-speed wireless access technologies. Due to the use of highly directional and electronically steerable beams, the performance of 60 GHz networks becomes a sensitive function of environment structure and reflectivity, which cannot be handled by existing networking paradigms. In this paper, we propose E-Mi, a framework that harnesses 60 GHz radios’ sensing capabilities to boost network performance. E-Mi uses a single pair of 60 GHz transmitter and receiver to sense the environment. It can resolve all dominant reflection paths between the two nodes, from which it reconstructs a coarse outline of major reflectors in the environment. It then feeds the reflector information into a ray-tracer to predict the channel and network performance of arbitrarily located links. Our experiments on a custom-built 60 GHz testbed verify that E-Mi can accurately sense a given environment, and predict the channel quality of different links with 2.8 dB median error. The prediction is then used to optimize the deployment of 60 GHz access points, with 2:2x to 4:5x capacity gain over empirical approaches.
Skip-Correlation for Multi-Power Wireless Carrier Sensing
Romil Bhardwaj, Krishna Chintalapudi, and Ramachandran Ramjee, Microsoft Research
Carrier sensing is a key mechanism that enables decentralized sharing of unlicensed spectrum. However, carrier sensing in its current form is fundamentally unsuitable when devices transmit at different power levels, a scenario increasingly common given the diversity of Wi-Fi APs in the market and the need for Wi- Fi’s co-existence with new upcoming standards such as LAA/LWA. The primary contribution of this paper is a novel carrier sensing mechanism – skip correlation – that extends carrier sensing to accommodate multiple transmit power levels. Through an FPGA based implementation on the WARP platform, we demonstrate the effectiveness of our technique in a variety of scenarios including support for backward compatibility.
FM Backscatter: Enabling Connected Cities and Smart Fabrics
Anran Wang, Vikram Iyer, Vamsi Talla, Joshua R. Smith, and Shyamnath Gollakota, University of Washington
This paper enables connectivity on everyday objects by transforming them into FM radio stations. To do this, we show for the first time that ambient FM radio signals can be used as a signal source for backscatter communication. Our design creates backscatter transmissions that can be decoded on any FM receiver including those in cars and smartphones. This enables us to achieve a previously infeasible capability: backscattering information to cars and smartphones in outdoor environments.
Our key innovation is a modulation technique that transforms backscatter, which is a multiplication operation on RF signals, into an addition operation on the audio signals output by FM receivers. This enables us to embed both digital data as well as arbitrary audio into ambient analog FM radio signals. We build prototype hardware of our design and successfully embed audio transmissions over ambient FM signals. Further, we achieve data rates of up to 3.2 kbps and ranges of 5–60 feet, while consuming as little as 11.07 μW of power. To demonstrate the potential of our design, we also fabricate our prototype on a cotton t-shirt by machine sewing patterns of a conductive thread to create a smart fabric that can transmit data to a smartphone. We also embed FM antennas into posters and billboards and show that they can communicate with FM receivers in cars and smartphones.
Tuesday, March 28
7:30 am–8:45 am
Continental Breakfast
Grand Ballroom Foyer
8:45 am–10:00 am
Privacy and Security
Session Chair: Jacob Lorch, Microsoft
Prio: Private, Robust, and Scalable Computation of Aggregate Statistics
Henry Corrigan-Gibbs and Dan Boneh, Stanford University
This paper presents Prio, a privacy-preserving system for the collection of aggregate statistics. Each Prio client holds a private data value (e.g., its current location), and a small set of servers compute statistical functions over the values of all clients (e.g., the most popular location). As long as at least one server is honest, the Prio servers learn nearly nothing about the clients’ private data, except what they can infer from the aggregate statistics that the system computes. To protect functionality in the face of faulty or malicious clients, Prio uses secret-shared non-interactive proofs (SNIPs), a new cryptographic technique that yields a hundred-fold performance improvement over conventional zero-knowledge approaches. Prio extends classic private aggregation techniques to enable the collection of a large class of useful statistics. For example, Prio can perform a least-squares regression on high-dimensional client-provided data without ever seeing the data in the clear.
Opaque: An Oblivious and Encrypted Distributed Analytics Platform
Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica, University of California, Berkeley
Many systems run rich analytics on sensitive data in the cloud, but are prone to data breaches. Hardware enclaves promise data confidentiality and secure execution of arbitrary computation, yet still suffer from access pattern leakage. We propose Opaque, a distributed data analytics platform supporting a wide range of queries while providing strong security guarantees. Opaque introduces new distributed oblivious relational operators that hide access patterns, and new query planning techniques to optimize these new operators. Opaque is implemented on Spark SQL with few changes to the underlying system. Opaque provides data encryption, authentication and computation verification with a performance ranging from 52% faster to 3.3x slower as compared to vanilla Spark SQL; obliviousness comes with a 1.6–46x overhead. Opaque provides an improvement of three orders of magnitude over state-of-the-art oblivious protocols, and our query optimization techniques improve performance by 2–5x.
Splinter: Practical Private Queries on Public Data
Frank Wang, Catherine Yun, Shafi Goldwasser, and Vinod Vaikuntanathan, MIT CSAIL; Matei Zaharia, Stanford InfoLab
Many online services let users query public datasets such as maps, flight prices, or restaurant reviews. Unfortunately, the queries to these services reveal highly sensitive information that can compromise users’ privacy. This paper presents Splinter, a system that protects users’ queries on public data and scales to realistic applications. A user splits her query into multiple parts and sends each part to a different provider that holds a copy of the data. As long as any one of the providers is honest and does not collude with the others, the providers cannot determine the query. Splinter uses and extends a new cryptographic primitive called Function Secret Sharing (FSS) that makes it up to an order of magnitude more efficient than prior systems based on Private Information Retrieval and garbled circuits. We develop protocols extending FSS to new types of queries, such as MAX and TOPK queries. We also provide an optimized implementation of FSS using AES-NI instructions and multicores. Splinter achieves end-to-end latencies below 1.6 seconds for realistic workloads including a Yelp clone, flight search, and map routing.
10:00 am–10:30 am
Break with Refreshments
Grand Ballroom Foyer
10:30 am–11:45 am
SDN and Network Design
Session Chair: Srini Seshan, Carnegie Mellon University
VFP: A Virtual Switch Platform for Host SDN in the Public Cloud
Daniel Firestone, Microsoft
Many modern scalable cloud networking architectures rely on host networking for implementing VM network policy - e.g. tunneling for virtual networks, NAT for load balancing, stateful ACLs, QoS, and more. We present the Virtual Filtering Platform (VFP) - a programmable virtual switch that powers Microsoft Azure, a large public cloud, and provides this policy. We define several major goals for a programmable virtual switch based on our operational experiences, including support for multiple independent network controllers, policy based on connections rather than only on packets, efficient caching and classification algorithms for performance, and efficient offload of flow policy to programmable NICs, and demonstrate how VFP achieves these goals. VFP has been deployed on >1M hosts running IaaS and PaaS workloads for over 4 years. We present the design of VFP and its API, its flow language and compiler used for flow processing, performance results, and experiences deploying and using VFP in Azure over several years.
SCL: Simplifying Distributed SDN Control Planes
Aurojit Panda and Wenting Zheng, University of California, Berkeley; Xiaohe Hu, Tsinghua University; Arvind Krishnamurthy, University of Washington; Scott Shenker, University of California, Berkeley, and International Computer Science Institute
We consider the following question: what consistency model is appropriate for coordinating the actions of a replicated set of SDN controllers? We first argue that the conventional requirement of strong consistency, typically achieved through the use of Paxos or other consensus algorithms, is conceptually unnecessary to handle unplanned network updates. We present an alternate approach, based on the weaker notion of eventual correctness, and describe the design of a simple coordination layer (SCL) that can seamlessly turn a set of single-image SDN controllers (that obey certain properties) into a distributed SDN system that achieves this goal (whereas traditional consensus mechanisms do not). We then show through analysis and simulation that our approach provides faster responses to network events. While our primary focus is on handling unplanned network updates, our coordination layer also handles policy updates and other situations where consistency is warranted. Thus, contrary to the prevailing wisdom, we argue that distributed SDN control planes need only be slightly more complicated than single-image controllers.
Robust Validation of Network Designs under Uncertain Demands and Failures
Yiyang Chang, Sanjay Rao, and Mohit Tawarmalani, Purdue University
A key challenge confronting wide-area network architects is validating that their network designs provide assurable performance in the face of variable traffic demands and failures. Validation is hard because of the exponential, and possibly non-enumerable, set of scenarios that must be considered. Current theoretical tools provide overly conservative bounds on network performance since to remain tractable, they do not adequately model the flexible routing strategies that networks employ in practice to adapt to failures and changing traffic demands. In this paper, we develop an optimization-theoretic framework to derive the worst-case network performance across scenarios of interest by modeling flexible routing adaptation strategies. We present an approach to tackling the resulting intractable problems, which can achieve tighter bounds on network performance than current techniques. While our framework is general, we focus on bounding worst-case link utilizations, and case studies involving topology design, and MPLS tunnels, chosen both for their practical importance and to illustrate key aspects of our framework. Evaluations over real network topologies and traffic data show the promise of the approach.
11:45 am–1:15 pm
Lunch (on your own)
1:15 pm–2:30 pm
Data-Driven Systems
Session Chair: Mosharaf Chowdhury, University of Michigan
Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
We describe ExCamera, a system that can edit, transform, and encode a video, including 4K and VR material, with low latency. The system makes two major contributions.
First, we designed a framework to run general-purpose parallel computations on a commercial “cloud function” service. The system starts up thousands of threads in seconds and manages inter-thread communication.
Second, we implemented a video encoder intended for fine-grained parallelism, using a functional-programming style that allows computation to be split into thousands of tiny tasks without harming compression efficiency. Our design reflects a key insight: the work of video encoding can be divided into fast and slow parts, with the “slow” work done in parallel, and only “fast” work done serially.
Live Video Analytics at Scale with Approximation and Delay-Tolerance
Haoyu Zhang, Microsoft and Princeton University; Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, and Paramvir Bahl, Microsoft; Michael J. Freedman, Princeton University
Video cameras are pervasively deployed for security and smart city scenarios, with millions of them in large cities worldwide. Achieving the potential of these cameras requires efficiently analyzing the live videos in realtime. We describe VideoStorm, a video analytics system that processes thousands of video analytics queries on live video streams over large clusters. Given the high costs of vision processing, resource management is crucial. We consider two key characteristics of video analytics: resource-quality tradeoff with multi-dimensional configurations, and variety in quality and lag goals. VideoStorm’s offline profiler generates query resourcequality profile, while its online scheduler allocates resources to queries to maximize performance on quality and lag, in contrast to the commonly used fair sharing of resources in clusters. Deployment on an Azure cluster of 101 machines shows improvement by as much as 80% in quality of real-world queries and 7x better lag, processing video from operational traffic cameras.
Pytheas: Enabling Data-Driven Quality of Experience Optimization Using Group-Based Exploration-Exploitation
Junchen Jiang, Carnegie Mellon University; Shijie Sun, Tsinghua University; Vyas Sekar, Carnegie Mellon University; Hui Zhang, Carnegie Mellon University and Conviva Inc.
Content providers are increasingly using data-driven mechanisms to optimize quality of experience (QoE). Many existing approaches formulate this process as a prediction problem of learning optimal decisions (e.g., server, bitrate, relay) based on observed QoE of recent sessions. While prediction-based mechanisms have shown promising QoE improvements, they are necessarily incomplete as they: (1) suffer from many known biases (e.g., incomplete visibility) and (2) cannot respond to sudden changes (e.g., load changes). Drawing a parallel from machine learning, we argue that data-driven QoE optimization should instead be cast as a real-time exploration and exploitation (E2) process rather than as a prediction problem. Adopting E2 in network applications, however, introduces key architectural (e.g., how to update decisions in real time with fresh data) and algorithmic (e.g., capturing complex interactions between session features vs. QoE) challenges. We present Pytheas, a framework which addresses these challenges using a group-based E2 mechanism. The insight is that application sessions sharing the same features (e.g., IP prefix, location) can be grouped so that we can run E2 algorithms at a per-group granularity. This naturally captures the complex interactions and is amenable to realtime control with fresh measurements. Using an endto- end implementation and a proof-of-concept deployment in CloudLab, we show that Pytheas improves video QoE over a state-of-the-art prediction-based system by up to 31% on average and 78% on 90th percentile of persession QoE.
2:30 pm–3:00 pm
Break with Refreshments
Grand Ballroom Foyer
3:00 pm–4:15 pm
Datacenter Networking
Session Chair: George Porter, University of California, San Diego
Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching
Erico Vanini and Rong Pan, Cisco Systems; Mohammad Alizadeh, Massachusetts Institute of Technology; Parvin Taheri and Tom Edsall, Cisco Systems
Datacenter networks require efficient multi-path load balancing to achieve high bisection bandwidth. Despite much progress in recent years towards addressing this challenge, a load balancing design that is both simple to implement and resilient to network asymmetry has remained elusive. In this paper, we show that flowlet switching, an idea first proposed more than a decade ago, is a powerful technique for resilient load balancing with asymmetry. Flowlets have a remarkable elasticity property: their size changes automatically based on traffic conditions on their path. We use this insight to develop LetFlow, a very simple load balancing scheme that is resilient to asymmetry. LetFlow simply picks paths at random for flowlets and lets their elasticity naturally balance the traffic on different paths. Our extensive evaluation with real hardware and packet-level simulations shows that LetFlow is very effective. Despite being much simpler, it performs significantly better than other traffic oblivious schemes like WCMP and Presto in asymmetric scenarios, while achieving average flow completions time within 10-20% of CONGA in testbed experiments and 2x of CONGA in simulated topologies with large asymmetry and heavy traffic load.
Flowtune: Flowlet Control for Datacenter Networks
Jonathan Perry, Hari Balakrishnan, and Devavrat Shah, Massachusetts Institute of Technology
Rapid convergence to a desired allocation of network resources to endpoint traffic is a difficult problem. The reason is that congestion control decisions are distributed across the endpoints, which vary their offered load in response to changes in application demand and network feedback on a packet-by-packet basis. We propose a different approach for datacenter networks, flowlet control, in which congestion control decisions are made at the granularity of a flowlet, not a packet. With flowlet control, allocations have to change only when flowlets arrive or leave. We have implemented this idea in a system called Flowtune using a centralized allocator that receives flowlet start and end notifications from endpoints. The allocator computes optimal rates using a new, fast method for network utility maximization, and updates endpoint congestion-control parameters. Experiments show that Flowtune outperforms DCTCP, pFabric, sfqCoDel, and XCP on tail packet delays in various settings, converging to optimal rates within a few packets rather than over several RTTs. Benchmarks on an EC2 deployment show a fairer rate allocation than Linux’s Cubic. A data aggregation benchmark shows 1.61x lower p95 coflow completion time.
Flexplane: An Experimentation Platform for Resource Management in Datacenters
Amy Ousterhout, Jonathan Perry, and Hari Balakrishnan, MIT CSAIL; Petr Lapukhov, Facebook
Flexplane enables users to program data plane algorithms and conduct experiments that run real application traffic over them at hardware line rates. Flexplane explores an intermediate point in the design space between past work on software routers and emerging work on programmable hardware chipsets. Like software routers, Flexplane enables users to express resource management schemes in a high-level language (C++), but unlike software routers, Flexplane runs at close to hardware line rates. To achieve these two goals, a centralized emulator faithfully emulates, in real-time on a multi-core machine, the desired data plane algorithms with very succinct representations of the original packets. Real packets traverse the network when notified by the emulator, sharing the same fate and relative delays as their emulated counterparts.
Flexplane accurately predicts the behavior of several network schemes such as RED and DCTCP, sustains aggregate throughput of up to 760 Gbits/s on a 10-core machine (≈ 20x faster than software routers), and enables experiments with real-world operating systems and applications (e.g., Spark) running on diverse network schemes at line rate, including those such as HULL and pFabric that are not available in hardware today.
4:15 pm–4:30 pm
Short Break
4:30 pm–5:45 pm
Cloud and Distributed Systems
Session Chair: John Wilkes, Google
I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades
Syed Akbar Mehdi, Cody Littley, and Natacha Crooks, The University of Texas at Austin; Lorenzo Alvisi, The University of Texas at Austin and Cornell University; Nathan Bronson, Facebook; Wyatt Lloyd, University of Southern California
We describe the design, implementation, and evaluation of Occult (Observable Causal Consistency Using Lossy Timestamps), the first scalable, geo-replicated data store that provides causal consistency to its clients without exposing the system to the possibility of slowdown cascades, a key obstacle to the deployment of causal consistency at scale. Occult supports read/write transactions under PC-PSI, a variant of Parallel Snapshot Isolation that contributes to Occult’s immunity to slowdown cascades by weakening how PSI replicates transactions committed at the same replica. While PSI insists that they all be totally ordered, PC-PSI simply requires total order Per Client session. Nonetheless, Occult guarantees that all transactions read from a causally consistent snapshot of the datastore without requiring any coordination in how transactions are asynchronously replicated.
CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics
Omid Alipourfard, Yale University; Hongqiang Harry Liu and Jianshu Chen, Microsoft Research; Shivaram Venkataraman, University of California, Berkeley; Minlan Yu, Yale University; Ming Zhang, Alibaba Group
Picking the right cloud configuration for recurring big data analytics jobs running in clouds is hard, because there can be tens of possible VM instance types and even more cluster sizes to pick from. Choosing poorly can significantly degrade performance and increase the cost to run a job by 2-3x on average, and as much as 12x in the worst-case. However, it is challenging to automatically identify the best configuration for a broad spectrum of applications and cloud configurations with low search cost. CherryPick is a system that leverages Bayesian Optimization to build performance models for various applications, and the models are just accurate enough to distinguish the best or close-to-the-best configuration from the rest with only a few test runs. Our experiments on five analytic applications in AWS EC2 show that CherryPick has a 45-90% chance to find optimal configurations, otherwise near-optimal, saving up to 75% search cost compared to existing solutions.
AdaptSize: Orchestrating the Hot Object Memory Cache in a Content Delivery Network
Daniel S. Berger, University of Kaiserslautern; Ramesh K. Sitaraman, University of Massachusetts Amherst and Akamai Technologies; Mor Harchol-Balter, Carnegie Mellon University
Most major content providers use content delivery networks (CDNs) to serve web and video content to their users. A CDN is a large distributed system of servers that caches and delivers content to users. The first-level cache in a CDN server is the memory-resident Hot Object Cache (HOC). A major goal of a CDN is to maximize the object hit ratio (OHR) of its HOCs. But, the small size of the HOC, the huge variance in the requested object sizes, and the diversity of request patterns make this goal challenging.
We propose AdaptSize, the first adaptive, size-aware cache admission policy for HOCs that achieves a high OHR, even when object size distributions and request characteristics vary significantly over time. At the core of AdaptSize is a novel Markov cache model that seamlessly adapts the caching parameters to the changing request patterns. Using request traces from one of the largest CDNs in the world, we show that our implementation of AdaptSize achieves significantly higher OHR than widelyused production systems: 30-48% and 47-91% higher OHR than Nginx and Varnish, respectively. AdaptSize also achieves 33-46% higher OHR than state-of-the-art research systems. Further, AdaptSize is more robust to changing request patterns than the traditional tuning approach of hill climbing and shadow queues studied in other contexts.
6:30 pm–8:00 pm
Poster Session and Reception
Palm Garden Room
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks. View the complete list of accepted posters.
Wednesday, March 29
8:00 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–10:40 am
Mobile Systems and IoT
Session Chair: Deepak Ganesan, University of Massachusetts Amherst
Bringing IoT to Sports Analytics
Mahanth Gowda, Ashutosh Dhekne, Sheng Shen, and Romit Roy Choudhury, University of Illinois at Urbana–Champaign; Xue Yang, Lei Yang, Suresh Golwalkar, and Alexander Essanian, Intel
This paper explores the possibility of bringing IoT to sports analytics, particularly to the game of Cricket. We develop solutions to track a ball’s 3D trajectory and spin with inexpensive sensors and radios embedded in the ball. Unique challenges arise rendering existing localization and motion tracking solutions inadequate. Our system, iBall, mitigates these problems by fusing disparate sources of partial information – wireless, inertial sensing, and motion models – into a non-linear error minimization framework. Measured against a mm-level ground truth, the median ball location error is at 8cm while rotational error remains below 12 even at the end of the flight. The results do not rely on any calibration or training, hence we expect the core techniques to extend to other sports like baseball, with some domain-specific modifications.
FarmBeats: An IoT Platform for Data-Driven Agriculture
Deepak Vasisht, Microsoft and Massachusetts Institute of Technology; Zerina Kapetanovic, Microsoft and University of Washington; Jongho Won, Microsoft and Purdue University; Xinxin Jin, Microsoft and University of California, San Diego; Ranveer Chandra, Ashish Kapoor, Sudipta N. Sinha, and Madhusudhan Sudarshan, Microsoft; Sean Stratman, Dancing Crow Farm
Data-driven techniques help boost agricultural productivity by increasing yields, reducing losses and cutting down input costs. However, these techniques have seen sparse adoption owing to high costs of manual data collection and limited connectivity solutions. In this paper, we present FarmBeats, an end-to-end IoT platform for agriculture that enables seamless data collection from various sensors, cameras and drones. FarmBeats’s system design that explicitly accounts for weather-related power and Internet outages has enabled six month long deployments in two US farms.
Enabling High-Quality Untethered Virtual Reality
Omid Abari, Dinesh Bharadia, Austin Duffield, and Dina Katabi, Massachusetts Institute of Technology
Today’s virtual reality (VR) headsets require a cable connection to a PC or game console. This cable significantly limits the player’s mobility and, hence, her VR experience. The high data rate requirement of this link (multiple Gbps) precludes its replacement by WiFi. Thus, in this paper, we focus on using mmWave technology to deliver multi-Gbps wireless communication between VR headsets and their game consoles. We address the two key problems that prevent existing mmWave links from being used in VR systems. First, mmWave signals suffer from a blockage problem, i.e., they operate mainly in line-of-sight and can be blocked by simple obstacles such as the player lifting her hand in front of the headset. Second, mmWave radios use highly directional antennas with very narrow beams; they work only when the transmitter’s beam is aligned with the receiver’s beam. Any small movement of the headset can break the alignment and stall the data stream. We present MoVR, a novel system that allows mmWave links to sustain high data rates even in the presence of a blockage and mobility. MoVR does this by introducing a smart mmWave mirror and leveraging VR headset tracking information. We implement MoVR and empirically demonstrate its performance using an HTC VR headset.
Improving User Perceived Page Load Times Using Gaze
Conor Kelton, Jihoon Ryoo, Aruna Balasubramanian, and Samir R. Das, Stony Brook University
We take a fresh look at Web page load performance from the point of view of user experience. Our user study shows that perceptual performance, defined as user-perceived page load time (uPLT) poorly correlates with traditional page load time (PLT) metrics. However, most page load optimizations are designed to improve the traditional PLT metrics, rendering their impact on user experience uncertain. Instead, we present WebGaze, a system that specifically optimizes for the uPLT metric. The key insight in WebGaze is that user attention and interest can be captured using a user’s eye gaze and can in-turn be used to improve uPLT. We collect eye gaze data from 50 users across 45 Web pages and find that there is commonality in user attention across users. Specifically, users are drawn to certain regions on the page, that we call regions of high collective fixation. WebGaze prioritizes loading objects that exhibit a high degree of collective fixation to improve user-perceived latencies. We compare WebGaze with three alternate strategies, one of which is the state-of-the-art system that also uses prioritization to improve user experience. Our evaluation based on a user study shows that WebGaze improves median uPLT for 73% of the Web pages compared to all three alternate strategies.
10:40 am–11:10 am
Break with Refreshments
Grand Ballroom Foyer
11:10 am–12:25 pm
Networking in the Datacenter
Session Chair: Boon Thau Loo, University of Pennsylvania
RAIL: A Case for Redundant Arrays of Inexpensive Links in Data Center Networks
Danyang Zhuo, University of Washington; Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, and Xuan Kelvin Zou, Microsoft Research; Hang Guan, Columbia University; Arvind Krishnamurthy and Thomas Anderson, University of Washington
While there are many proposals to reduce the cost of data center networks (DCN), little attention has been paid to the role played by the physical links that carry packets. By studying over 300K optical links across many production DCNs, we show that these links are operating quite conservatively relative to the requirements in the IEEE standards. Motivated by this observation, to reduce DCN costs, we propose using transceivers—a key contributor to DCN cost—beyond their currently specified limit. Our experiments with multiple commodity transceivers show that their reach can be “stretched” 1.6 to 4 times their specification. However, with stretching, the performance of 1–5% of the DCN paths can fall below the IEEE standard. We develop RAIL, a system to ensure that in such a network, applications only use paths that meet their performance needs. Our proposal can reduce the network cost by up to 10% for 10Gbps networks and 44% for 40Gbps networks, without affecting the applications’ performance.
Enabling Wide-Spread Communications on Optical Fabric with MegaSwitch
Li Chen and Kai Chen, The Hong Kong University of Science and Technology; Zhonghua Zhu, Omnisensing Photonics; Minlan Yu, Yale University; George Porter, University of California, San Diego; Chunming Qiao, University at Buffalo; Shan Zhong, CoAdna
Existing wired optical interconnects face a challenge of supporting wide-spread communications in production clusters. Initial proposals are constrained, as they only support connections among a small number of racks (e.g., 2 or 4) at a time, with switching time of milliseconds. Recent efforts on reducing optical circuit reconfiguration time to microseconds partially mitigate this problem by rapidly time-sharing optical circuits across more nodes, but are still limited by the total number of parallel circuits available simultaneously.
In this paper, we seek an optical interconnect that can enable unconstrained communications within a computing cluster of thousands of servers. We present Mega- Switch, a multi-fiber ring optical fabric that exploits space division multiplexing across multiple fibers to deliver rearrangeably non-blocking communications to 30+ racks and 6000+ servers. We have implemented a 5-rack 40-server MegaSwitch prototype with commercial optical devices, and used testbed experiments as well as large-scale simulations to explore MegaSwitch’s architectural benefits and tradeoffs.
Passive Realtime Datacenter Fault Detection and Localization
Arjun Roy, University of California, San Diego; Hongyi Zeng and Jasmeet Bagga, Facebook; Alex C. Snoeren, University of California, San Diego
Datacenters are characterized by their large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with small but non-zero failure rates mean that datacenters are subject to significant numbers of failures, impacting the performance of the services that rely on them. To make matters worse, these failures are not always obvious; network switches and links can fail partially, dropping or delaying various subsets of packets without necessarily delivering a clear signal that they are faulty. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors.
We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate transport-layer flow metrics and network-I/O system call delay at end hosts with the path that traffic takes through the datacenter and apply statistical analysis techniques to identify outliers and localize the faulty link and/or switch(es). We evaluate our approach in a production Facebook front-end datacenter.
12:25 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:40 pm
Big Data Systems
Session Chair: Rodrigo Fonseca, Brown University
Clipper: A Low-Latency Online Prediction Serving System
Daniel Crankshaw, Xin Wang, and Guilio Zhou, University of California, Berkeley; Michael J. Franklin, University of California, Berkeley, and The University of Chicago; Joseph E. Gonzalez and Ion Stoica, University of California, Berkeley
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks and applications. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the Tensorflow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve accuracy and render more robust predictions.
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, and Phillip B. Gibbons, Carnegie Mellon University; Onur Mutlu, ETH Zurich and Carnegie Mellon University
Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. Unfortunately, it is infeasible to move all this globally-generated data to a centralized data center before running an ML algorithm over it—moving large amounts of raw data over wide-area networks (WANs) can be extremely slow, and is also subject to the constraints of privacy and data sovereignty laws. This motivates the need for a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over WANs can significantly degrade ML system performance (by as much as 53.7x in our study) because the communication overwhelms the limited WAN bandwidth.
Our goal in this work is to develop a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and correctness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms.
To this end, we introduce a new, general geo-distributed ML system, Gaia, that decouples the communication within a data center from the communication between data centers, enabling different communication and consistency models for each. We present a new ML synchronization model, Approximate Synchronous Parallel (ASP), whose key idea is to dynamically eliminate insignificant communication between data centers while still guaranteeing the correctness of ML algorithms. Our experiments on our prototypes of Gaia running across 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that Gaia provides 1.8–53.5x speedup over two state-of-the-art distributed ML systems, and is within 0.94–1.40x of the speed of running the same ML algorithm on machines on a local area network (LAN).
Efficient Memory Disaggregation with Infiniswap
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang Shin, University of Michigan
Memory-intensive applications suffer large performance loss when their working sets do not fully fit in memory. Yet, they cannot leverage otherwise unused remote memory when paging out to disks even in the presence of large imbalance in memory utilizations across a cluster. Existing proposals for memory disaggregation call for new architectures, new hardware designs, and/or new programming models, making them infeasible.
This paper describes the design and implementation of INFINISWAP, a remote memory paging system designed specifically for an RDMA network. INFINISWAP opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines’ remote memory. Because one-sided RDMA operations bypass remote CPUs, INFINISWAP leverages the power of many choices to perform decentralized slab placements and evictions.
We have implemented and deployed INFINISWAP on an RDMA cluster without any modifications to user applications or the OS and evaluated its effectiveness using multiple workloads running on unmodified VoltDB, Memcached, PowerGraph, GraphX, and Apache Spark. Using INFINISWAP, throughputs of these applications improve between 4x (0:94x) to 15:4x (7:8x) over disk (Mellanox nbdX), and median and tail latencies between 5:4x (2x) and 6x (2:3x). INFINISWAP achieves these with negligible remote CPU usage, whereas nbdX becomes CPU-bound. INFINISWAP increases the overall memory utilization of a cluster and works well at scale.
Tux²: Distributed Graph Computation for Machine Learning
Wencong Xiao, Beihang University and Microsoft Research; Jilong Xue, Peking University and Microsoft Research; Youshan Miao, Microsoft Research; Zhen Li, Beihang University and Microsoft Research; Cheng Chen and Ming Wu, Microsoft Research; Wei Li, Beihang University; Lidong Zhou, Microsoft Research
TUX2 is a new distributed graph engine that bridges graph computation and distributed machine learning. TUX2 inherits the benefits of an elegant graph computation model, efficient graph layout, and balanced parallelism to scale to billion-edge graphs; we extend and optimize it for distributed machine learning to support heterogeneity, a Stale Synchronous Parallel model, and a new MEGA (Mini-batch, Exchange, GlobalSync, and Apply) model.
We have developed a set of representative distributed machine learning algorithms in TUX2, covering both supervised and unsupervised learning. Compared to implementations on distributed machine learning platforms, writing these algorithms in TUX2 takes only about 25% of the code: Our graph computation model hides the detailed management of data layout, partitioning, and parallelism from developers. Our extensive evaluation of TUX2, using large data sets with up to 64 billion edges, shows that TUX2 outperforms state-of-the-art distributed graph engines PowerGraph and PowerLyra by an order of magnitude, while beating two state-of-the-art distributed machine learning systems by at least 48%.
3:40 pm–4:10 pm
Break with Refreshments
Grand Ballroom Foyer
4:10 pm–5:50 pm
Network Verification and Debugging
Session Chair: Minlan Yu, Yale University
Correct by Construction Networks Using Stepwise Refinement
Building software-defined network controllers is an exercise in software development and, as such, likely to introduce bugs. We present Cocoon, a framework for SDN development that facilitates both the design and verification of complex networks using stepwise refinement to move from a high-level specification to the final network implementation.
A Cocoon user specifies intermediate design levels in a hierarchical design process that delineates the modularity in complicated network forwarding and makes verification extremely efficient. For example, an enterprise network, equipped with VLANs, ACLs, and Level 2 and Level 3 Routing, can be decomposed cleanly into abstractions for each mechanism, and the resulting stepwise verification is over 200x faster than verifying the final implementation. Cocoon further separates static network design from its dynamically changing configuration. The former is verified at design time, while the latter is checked at run time using statically defined invariants. We present six different SDN use cases including B4 and F10. Our performance evaluation demonstrates that Cocoon is not only faster than existing verification tools but can also find many bugs statically before the network design has been fully specified.
Verifying Reachability in Networks with Mutable Datapaths
Aurojit Panda, University of California, Berkeley; Ori Lahav, Max Planck Institute for Software Systems (MPI-SWS); Katerina Argyraki, École Polytechnique Fédérale de Lausanne (EPFL); Mooly Sagiv, Tel Aviv University; Scott Shenker, University of California, Berkeley, and International Computer Science Institute
Recent work has made great progress in verifying the forwarding correctness of networks [26–28, 35]. However, these approaches cannot be used to verify networks containing middleboxes, such as caches and firewalls, whose forwarding behavior depends on previously observed traffic. We explore how to verify reachability properties for networks that include such “mutable datapath” elements, both for the original network and in the presence of failures. The main challenge lies in handling large and complicated networks. We achieve scaling by developing and leveraging the concept of slices, which allow network-wide verification to only require analyzing small portions of the network. We show that with slices the time required to verify an invariant on many production networks is independent of the size of the network itself.
Automated Bug Removal for Software-Defined Networks
Yang Wu, Ang Chen, and Andreas Haeberlen, University of Pennsylvania; Wenchao Zhou, Georgetown University; Boon Thau Loo, University of Pennsylvania
When debugging an SDN application, diagnosing the problem is merely the first step: the operator must still find a fix that solves the problem, without causing new problems elsewhere. However, most existing debuggers focus exclusively on diagnosis and offer the network operator little or no help with finding an effective fix. Finding a suitable fix is difficult because the number of candidates can be enormous.
In this paper, we propose a step towards automated repair for SDN applications. Our approach consists of two elements. The first is a data structure that we call meta provenance, which can be used to efficiently find good candidate repairs. Meta provenance is inspired by the provenance concept from the database community; however, whereas standard provenance can only reason about changes to data, meta provenance can also reason about changes to programs. The second element is a system that can efficiently backtest a set of candidate repairs using historical data from the network. This is used to eliminate candidate repairs that do not work well, or that cause other problems.
We have implemented a system that maintains meta provenance for SDNs, as well as a prototype debugger that uses the meta provenance to automatically suggest repairs. Results from several case studies show that, for problems of moderate complexity, our debugger can find high-quality repairs within one minute.
Delta-net: Real-time Network Verification Using Atoms
Alex Horn, Fujitsu Labs of America; Ali Kheradmand, University of Illinois at Urbana–Champaign; Mukul Prasad, Fujitsu Labs of America
Real-time network verification promises to automatically detect violations of network-wide reachability invariants on the data plane. To be useful in practice, these violations need to be detected in the order of milliseconds, without raising false alarms. To date, most real-time data plane checkers address this problem by exploiting at least one of the following two observations: (i) only small parts of the network tend to be affected by typical changes to the data plane, and (ii) many different packets tend to share the same forwarding behaviour in the entire network. This paper shows how to effectively exploit a third characteristic of the problem, namely: similarity among forwarding behaviour of packets through parts of the network, rather than its entirety. We propose the first provably amortized quasi-linear algorithm to do so. We implement our algorithm in a new real-time data plane checker, Delta-net. Our experiments with SDN-IP, a globally deployed ONOS software-defined networking application, and several hundred million IP prefix rules generated using topologies and BGP updates from real-world deployed networks, show that Delta-net checks a rule insertion or removal in approximately 40 microseconds on average, a more than 10x improvement over the state-of-the-art. We also show that Delta-net eliminates an inherent bottleneck in the state-of-the-art that restricts its use in answering Datalog-style “what if” queries.