ABSTRACT
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.
In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
- Cisco Data Center Infrastructure 2.5 Design Guide. http://www.cisco.com/univercd/cc/td/doc/solution/dcidg21.pdf.Google Scholar
- InfiniBand Architecture Specification Volume 1, Release 1.0. http://www.infinibandta.org/specs.Google Scholar
- Juniper J-Flow. http://www.juniper.net/techpubs/software/erx/junose61/swconfig-routing-vol1/html/ip-jflow-stats-config2.html.Google Scholar
- Sun Datacenter Switch 3456 Architecture White Paper. http://www.sun.com/products/networking/datacenter/ds3456/ds3456_wp.pdf.Google Scholar
- M. Blumrich, D. Chen, P. Coteus, A. Gara, M. Giampapa, P. Heidelberger, S. Singh, B. Steinmacher-Burow, T. Takken, and P. Vranas. Design and Analysis of the BlueGene/L Torus Interconnection Network. IBM Research Report RC23025 (W0312--022), 3, 2003.Google Scholar
- N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, and J. Seizovic. Myrinet: A Gigabit-per-second Local Area Network. Micro, IEEE, 15(1), 1995. Google Scholar
Digital Library
- S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1--7), 1998. Google Scholar
Digital Library
- R. Cheveresan, M. Ramsay, C. Feucht, and I. Sharapov. Characteristics of Workloads used in High Performance and Technical Computing. In International Conference on Supercomputing, 2007. Google Scholar
Digital Library
- L. Chisvin and R. J. Duckworth. Content-Addressable and Associative Memory: Alternatives to the Ubiquitous RAM. Computer, 22(7):51--64, 1989. Google Scholar
Digital Library
- B. Claise. Cisco Systems NetFlow Services Export Version 9. RFC 3954, Internet Engineering Task Force, 2004.Google Scholar
- C. Clos. A Study of Non-blocking Switching Networks. Bell System Technical Journal, 32(2), 1953.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. USENIX Symposium on Operating Systems Design and Implementation, 2004. Google Scholar
Digital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. ACM Symposium on Operating Systems Principles, 2007. Google Scholar
Digital Library
- A. B. Downey. Evidence for Long-tailed Distributions in the Internet. ACM SIGCOMM Workshop on Internet Measurement, 2001. Google Scholar
Digital Library
- W. Eatherton, G. Varghese, and Z. Dittia. Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates. SIGCOMM Computer Communications Review, 34(2):97--122, 2004. Google Scholar
Digital Library
- S. B. Fred, T. Bonald, A. Proutiere, G. Régnié, and J. W. Roberts. Statistical Bandwidth Sharing: A Study of Congestion at Flow Level. SIGCOMM Computer Communication Review, 2001. Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google Scholar
Digital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. ACM SIGOPS Operating Systems Review, 37(5), 2003. Google Scholar
Digital Library
- C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, Internet Engineering Task Force, 2000. Google Scholar
Digital Library
- D. Katz, D. Ward. BFD for IPv4 and IPv6 (Single Hop) (Draft). Technical report, Internet Engineering Task Force, 2008.Google Scholar
- E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek. The Click Modular Router. ACM Transactions on Computer Systems, 18(3), 2000. Google Scholar
Digital Library
- C. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. Pierre, D. Wells, et al. The Network Architecture of the Connection Machine CM-5 (Extended Abstract). ACM Symposium on Parallel Algorithms and Architectures, 1992. Google Scholar
Digital Library
- C. E. Leiserson. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers, 34(10):892--901, 1985. Google Scholar
Digital Library
- J. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo. NetFPGA-An Open Platform for Gigabit-rate Network Switching and Routing. In IEEE International Conference on Microelectronic Systems Education, 2007. Google Scholar
Digital Library
- J. Moy. OSPF Version 2. RFC 2328, Internet Engineering Task Force, 1998.Google Scholar
- F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX Conference on File and Storage Technologies, 2002. Google Scholar
Digital Library
- L. R. Scott, T. Clark, and B. Bagheri. Scientific Parallel Computing. Princeton University Press, 2005. Google Scholar
Digital Library
- SGI Developer Central Open Source Linux XFS. XFS: A High-performance Journaling Filesystem. http://oss.sgi.com/projects/xfs/.Google Scholar
- V. Srinivasan and G. Varghese. Faster IP Lookups using Controlled Prefix Expansion. ACM Transactions on Computer Systems (TOCS), 17(1):1--40, 1999. Google Scholar
Digital Library
- D. Thaler and C. Hopps. Multipath Issues in Unicast and Multicast Next-Hop Selection. RFC 2991, Internet Engineering Task Force, 2000. Google Scholar
Digital Library
- L. Tucker and G. Robertson. Architecture and Applications of the Connection Machine. Computer, 21(8), 1988. Google Scholar
Digital Library
- J. Vetter, S. Alam, J. Dunigan, T.H., M. Fahey, P. Roth, and P. Worley. Early Evaluation of the Cray XT3. In IEEE International Parallel and Distributed Processing Symposium, 2006. Google Scholar
Digital Library
- M. Woodacre, D. Robb, D. Roe, and K. Feind. The SGI Altix 3000 Global Shared-Memory Architecture. SGI White Paper, 2003.Google Scholar
Index Terms
- A scalable, commodity data center network architecture
Recommendations
Data center TCP (DCTCP)
SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010 conferenceCloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a ...
A scalable, commodity data center network architecture
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive ...
Comments