skip to main content
10.1145/3190508.3190546acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Service fabric: a distributed platform for building microservices in the cloud

Authors Info & Claims
Published:23 April 2018Publication History

ABSTRACT

We describe Service Fabric (SF), Microsoft's distributed platform for building, running, and maintaining microservice applications in the cloud. SF has been running in production for 10+ years, powering many critical services at Microsoft. This paper outlines key design philosophies in SF. We then adopt a bottom-up approach to describe low-level components in its architecture, focusing on modular use and support for strong semantics like fault-tolerance and consistency within each component of SF. We discuss lessons learned, and present experimental results from production data.

References

  1. Adding nodes to an existing cluster. https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html. Last accessed February 2018.Google ScholarGoogle Scholar
  2. Aguilera, M. K., Leners, J. B., and Walfish, M. Yesquel: Scalable SQL storage for web applications. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 245--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Akka. http://akka.io/. Last accessed February 2018.Google ScholarGoogle Scholar
  4. Amazon SimpleDB. https://aws.amazon.com/simpledb/. Last accessed February 2018.Google ScholarGoogle Scholar
  5. Andler, S. F., Hansson, J., Eriksson, J., Mellin, J., Berndtsson, M., and Eftring, B. DeeDS : Towards a distributed and active real-time database system. ACM SIGMOD Record 25, 1 (Mar. 1996), 38--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Archaius. https://github.com/Netflix/archaius. Last accessed February 2018.Google ScholarGoogle Scholar
  7. AWS Lambda. https://aws.amazon.com/lambda/. Last accessed February 2018.Google ScholarGoogle Scholar
  8. Azure Container Service. https://azure.microsoft.com/en-us/services/container-service/. Last accessed February 2018.Google ScholarGoogle Scholar
  9. Azure Queue Storage. https://azure.microsoft.com/en-us/services/storage/queues/. Last accessed February 2018.Google ScholarGoogle Scholar
  10. Azure Table Storage. https://azure.microsoft.com/en-us/services/storage/tables/. Last accessed February 2018.Google ScholarGoogle Scholar
  11. Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db/. Last accessed February 2018.Google ScholarGoogle Scholar
  12. Azure Event Hubs. https://azure.microsoft.com/en-us/services/event-hubs/. Last accessed February 2018.Google ScholarGoogle Scholar
  13. Azure Functions. https://azure.microsoft.com/en-us/services/functions/. Last accessed February 2018.Google ScholarGoogle Scholar
  14. Azure IoT. https://azure.microsoft.com/en-us/suites/iot-suite/. Last accessed February 2018.Google ScholarGoogle Scholar
  15. Azure SQL DB. https://azure.microsoft.com/en-us/services/sql-database/. Last accessed February 2018.Google ScholarGoogle Scholar
  16. Bailis, P., and Ghodsi, A. Eventual consistency today: Limitations, extensions, and beyond. Communications of the ACM 56, 5 (May 2013), 55--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Balalaie, A., Heydarnoori, A., and Jamshidi, P. Migrating to cloud-native architectures using microservices: An experience report. Computing Research Repository abs/1507.08217 (2015).Google ScholarGoogle Scholar
  18. Balalaie, A., Heydarnoori, A., and Jamshidi, P. Microservices architecture enables devops: Migration to a cloud-native architecture. IEEE Software 33, 3 (2016), 42--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Birman, K., and Joseph, T. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (New York, NY, USA, 1987), SOSP '87, ACM, pp. 123--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Birman, K. P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., and Minsky, Y. Bimodal multicast. ACM Transactions on Computer Systems (TOCS) 17, 2 (May 1999), 41--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bluemix. https://www.ibm.com/cloud-computing/bluemix. Last accessed February 2018.Google ScholarGoogle Scholar
  22. BMW Connected App. http://www.bmwblog.com/2016/10/06/new-bmw-connected-app-now-available-ios-android/. Last accessed February 2018.Google ScholarGoogle Scholar
  23. BMW Open Mobility Cloud. http://www.bmwblog.com/tag/open-mobility-cloud/. Last accessed February 2018.Google ScholarGoogle Scholar
  24. Burrows, M. The Chubby Lock Service for Loosely-coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 335--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Carretero, J., and Xhara, F. Genetic algorithm based schedulers for Grid computing systems. In International Journal of Innovative Computing, Information, and Control ICIC 3 (01 2007), vol. 5, pp. 1053--1071.Google ScholarGoogle Scholar
  26. Carstoiu, B., and Carstoiu, D. High performance eventually consistent distributed database Zatara. In Proceedings of the 6th International Conference on Networked Computing (May 2010), pp. 1--6.Google ScholarGoogle Scholar
  27. Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally-distributed Database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI'12, USENIX Association, pp. 251--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. CouchDB. http://couchdb.apache.org/. Last accessed February 2018.Google ScholarGoogle Scholar
  29. Service Fabric Customer Profile: BMW Technology Corporation. https://blogs.msdn.microsoft.com/azureservicefabric/2016/08/24/service-fabric-customer-profile-bmw-technology-corporation/. Last accessed February 2018.Google ScholarGoogle Scholar
  30. Service Fabric Customer Profile: Mesh Systems. https://blogs.msdn.microsoft.com/azureservicefabric/2016/06/20/service-fabric-customer-profile-mesh-systems/. Last accessed February 2018.Google ScholarGoogle Scholar
  31. Service Fabric Customer Profile: TalkTalk TV. https://blogs.msdn.microsoft.com/azureservicefabric/2016/03/15/service-fabric-customer-profile-talktalk-tv/. Last accessed February 2018.Google ScholarGoogle Scholar
  32. Das, A., Gupta, I., and Motivala, A. SWIM: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks (2002), DSN '02, pp. 303--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon's highly available key-value store. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (New York, NY, USA, 2007), SOSP '07, ACM, pp. 205--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., Shamis, A., Badam, A., and Castro, M. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 54--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Dragoni, N., Giallorenzo, S., Lluch-Lafuente, A., Mazzara, M., Montesi, F., Mustafin, R., and Safina, L. Microservices: yesterday, today, and tomorrow. Computing Research Repository abs/1606.04036 (2016).Google ScholarGoogle Scholar
  36. Esposito, C., Castiglione, A., and Choo, K. K. R. Challenges in delivering software in the cloud as microservices. IEEE Cloud Computing 3, 5 (Sept 2016), 10--14.Google ScholarGoogle ScholarCross RefCross Ref
  37. Eureka. https://github.com/Netflix/eureka. Last accessed February 2018.Google ScholarGoogle Scholar
  38. Ge, Y., and Wei, G. GA-Based Task Scheduler for the Cloud Computing Systems. In Proceedings of International Conference on Web Information Systems and Mining (Oct 2010), vol. 2, pp. 181--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Gupta, A., Liskov, B., and Rodrigues, R. One hop lookups for peer-to-peer overlays. In Proceedings of the 9th Conference on Hot Topics in Operating Systems - Volume 9 (Berkeley, CA, USA, 2003), HOTOS'03, USENIX Association, pp. 2--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Gupta, I., Birman, K., Linga, P., Demers, A., and van Renesse, R. Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (2003).Google ScholarGoogle ScholarCross RefCross Ref
  41. Hadoop. http://hadoop.apache.org/. Last accessed February 2018.Google ScholarGoogle Scholar
  42. Hasha, R., Xun, L., Kakivaya, G., and Malkhi, D. Allocating and reclaiming resources within a rendezvous federation. https://patents.google.com/patent/US20080031246 A1, 2008. US Patent 11,752,198.Google ScholarGoogle Scholar
  43. Hasha, R. L., Xun, L., Kakivaya, G. K. R., and Malkhi, D. Maintaining consistency within a federation infrastructure. https://patents.google.com/patent/US20080288659 A1, 2008. US Patent 11,936,589.Google ScholarGoogle Scholar
  44. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI '11, USENIX Association, pp. 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2010), USENIX ATC '10, USENIX Association, pp. 11--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Johnson, D. B., and Maltz, D. A. Dynamic source routing in ad hoc wireless networks. In Mobile Computing (1996), Kluwer Academic Publishers, pp. 153--181.Google ScholarGoogle ScholarCross RefCross Ref
  47. Kakivaya, G., Hasha, R., Xun, L., and Malkhi, D. Maintaining routing consistency within a rendezvous federation. https://patents.google.com/patent/US20080005624 A1, 2008. US Patent 11,549,332.Google ScholarGoogle Scholar
  48. Kakivaya, G. K. R., and Xun, L. Neighborhood maintenance in the federation. https://patents.google.com/patent/US20090213757 A1, 2009. US Patent 12,038,363.Google ScholarGoogle Scholar
  49. Kerberos. https://web.mit.edu/kerberos/. Last accessed February 2018.Google ScholarGoogle Scholar
  50. Khachaturyan, A., Semenovsovskaya, S., and Vainshtein, B. The thermo-dynamic approach to the structure analysis of crystals. Acta Crystallographica Section A 37, 5 (Sep 1981), 742--754.Google ScholarGoogle ScholarCross RefCross Ref
  51. Kubernetes. https://kubernetes.io/. Last accessed February 2018.Google ScholarGoogle Scholar
  52. Lakshman, A., and Malik, P. Cassandra: a decentralized structured storage system. Operating Systems Review 44, 2 (2010), 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lee, C., Park, S. J., Kejriwal, A., Matsushita, S., and Ousterhout, J. Implementing linearizability at large scale and low latency. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 71--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Li, C., Porto, D., Clement, A., Gehrke, J., Preguiça, N., and Rodrigues, R. Making geo-replicated systems fast as possible, consistent when necessary. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI '12, USENIX Association, pp. 265--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lloyd, W., Freedman, M. J., Kaminsky, M., and Andersen, D. G. Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 401--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. MariaDB. https://mariadb.org/. Last accessed February 2018.Google ScholarGoogle Scholar
  57. Maymounkov, P., and Mazières, D. Kademlia: A peer-to-peer information system based on the XOR metric. In Revised Papers from the First International Workshop on Peer-to-Peer Systems (London, UK, UK, 2002), IPTPS '01, Springer-Verlag, pp. 53--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Mesh Systems. http://www.mesh-systems.com/. Last accessed February 2018.Google ScholarGoogle Scholar
  59. Microsoft cortana. https://www.microsoft.com/en-us/mobile/experiences/cortana/. Last accessed February 2018.Google ScholarGoogle Scholar
  60. Microsoft Intune. https://www.microsoft.com/en-us/cloud-platform/microsoft-intune. Last accessed February 2018.Google ScholarGoogle Scholar
  61. MongoDB. https://www.mongodb.org/. Last accessed February 2018.Google ScholarGoogle Scholar
  62. Microsoft Service Fabric. https://azure.microsoft.com/en-us/services/service-fabric/. Last accessed February 2018.Google ScholarGoogle Scholar
  63. ning Gan, G., lei Huang, T., and Gao, S. Genetic simulated annealing algorithm for task scheduling based on cloud computing environment. In 2010 International Conference on Intelligent Computing and Integrated Systems (Oct 2010), pp. 60--63.Google ScholarGoogle ScholarCross RefCross Ref
  64. Nirmata. http://www.nirmata.com/. Last accessed February 2018.Google ScholarGoogle Scholar
  65. Netflix Open Source Software Center. https://netflix.github.io/. Last accessed February 2018.Google ScholarGoogle Scholar
  66. Perkins, C. E., and Royer, E. M. Ad-hoc on-demand distance vector (AODV) routing. In In Proceedings of the 2nd IEEE Workshop On Mobile Computing Systems and Applications (1997), pp. 90--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Pivotal Application. https://pivotal.io/platform/pivotal-application-service. Last accessed February 2018.Google ScholarGoogle Scholar
  68. Quorum Business Solutions. https://www.qbsol.com/. Last accessed February 2018.Google ScholarGoogle Scholar
  69. Service Fabric Customer Profile: Quorum Business Solutions. https://blogs.msdn.microsoft.com/azureservicefabric/2016/11/15/service-fabric-customer-profile-quorum-business-solutions/. Last accessed February 2018.Google ScholarGoogle Scholar
  70. Ramasubramanian, V., and Sirer, E. G. Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation - Volume 1 (Berkeley, CA, USA, 2004), NSDI '04, USENIX Association, pp. 8--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Redis. https://redis.io/. Last accessed February 2018.Google ScholarGoogle Scholar
  72. Rhea, S., Geels, D., Roscoe, T., and Kubiatowicz, J. Handling churn in a DHT. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2004), ATEC '04, USENIX Association, pp. 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Riak. http://basho.com/products/. Last accessed February 2018.Google ScholarGoogle Scholar
  74. Ribbon. https://github.com/Netflix/ribbon. Last accessed February 2018.Google ScholarGoogle Scholar
  75. Rowstron, A. I. T., and Druschel, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg (London, UK, UK, 2001), Middleware '01, Springer-Verlag, pp. 329--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Saltzer, J. H., Reed, D. P., and Clark, D. D. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (Nov. 1984), 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Skype for Business. https://www.skype.com/en/business/skype-for-business/. Last accessed February 2018.Google ScholarGoogle Scholar
  78. Spring Cloud. http://projects.spring.io/spring-cloud/. Last accessed February 2018.Google ScholarGoogle Scholar
  79. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (New York, NY, USA, 2001), SIGCOMM '01, ACM, pp. 149--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., and Shah, S. Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2012), FAST'12, USENIX Association, pp. 18--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Talk Talk TV. http://www.talktalk.co.uk/. Last accessed February 2018.Google ScholarGoogle Scholar
  82. Tonse, S. Scalable microservices at Netflix. challenges and tools of the trade. https://www.infoq.com/presentations/netflix-ipc. Last accessed February 2018.Google ScholarGoogle Scholar
  83. van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware '98, Springer-Verlag, pp. 55--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), SOCC '13, ACM, pp. 5:1--5:16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Wang, A., and Tonse, S. Announcing Ribbon: Tying the Netflix mid-tier services together. http://techblog.netflix.com/2013/01/announcing-ribbon-ttying-netflix-mid.html. Last accessed February 2018.Google ScholarGoogle Scholar
  86. Wei, X., Shi, J., Chen, Y., Chen, R., and Chen, H. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 87--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Wheeler, B. Should your apps be cloud-native? https://devops.com/apps-cloud-native/. Last accessed February 2018.Google ScholarGoogle Scholar
  88. Xie, C., Su, C., Littley, C., Alvisi, L., Kapritsos, M., and Wang, Y. High-performance ACID via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 279--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Zhang, I., Sharma, N. K., Szekeres, A., Krishnamurthy, A., and Ports, D. R. K. Building consistent transactions with inconsistent replication. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 263--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Zuul. https://github.com/Netflix/zuul. Last accessed February 2018.Google ScholarGoogle Scholar

Index Terms

  1. Service fabric: a distributed platform for building microservices in the cloud

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
        April 2018
        631 pages
        ISBN:9781450355841
        DOI:10.1145/3190508

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 April 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        EuroSys '18 Paper Acceptance Rate43of262submissions,16%Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader