ABSTRACT
We describe Service Fabric (SF), Microsoft's distributed platform for building, running, and maintaining microservice applications in the cloud. SF has been running in production for 10+ years, powering many critical services at Microsoft. This paper outlines key design philosophies in SF. We then adopt a bottom-up approach to describe low-level components in its architecture, focusing on modular use and support for strong semantics like fault-tolerance and consistency within each component of SF. We discuss lessons learned, and present experimental results from production data.
- Adding nodes to an existing cluster. https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html. Last accessed February 2018.Google Scholar
- Aguilera, M. K., Leners, J. B., and Walfish, M. Yesquel: Scalable SQL storage for web applications. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 245--262. Google ScholarDigital Library
- Akka. http://akka.io/. Last accessed February 2018.Google Scholar
- Amazon SimpleDB. https://aws.amazon.com/simpledb/. Last accessed February 2018.Google Scholar
- Andler, S. F., Hansson, J., Eriksson, J., Mellin, J., Berndtsson, M., and Eftring, B. DeeDS : Towards a distributed and active real-time database system. ACM SIGMOD Record 25, 1 (Mar. 1996), 38--51. Google ScholarDigital Library
- Archaius. https://github.com/Netflix/archaius. Last accessed February 2018.Google Scholar
- AWS Lambda. https://aws.amazon.com/lambda/. Last accessed February 2018.Google Scholar
- Azure Container Service. https://azure.microsoft.com/en-us/services/container-service/. Last accessed February 2018.Google Scholar
- Azure Queue Storage. https://azure.microsoft.com/en-us/services/storage/queues/. Last accessed February 2018.Google Scholar
- Azure Table Storage. https://azure.microsoft.com/en-us/services/storage/tables/. Last accessed February 2018.Google Scholar
- Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db/. Last accessed February 2018.Google Scholar
- Azure Event Hubs. https://azure.microsoft.com/en-us/services/event-hubs/. Last accessed February 2018.Google Scholar
- Azure Functions. https://azure.microsoft.com/en-us/services/functions/. Last accessed February 2018.Google Scholar
- Azure IoT. https://azure.microsoft.com/en-us/suites/iot-suite/. Last accessed February 2018.Google Scholar
- Azure SQL DB. https://azure.microsoft.com/en-us/services/sql-database/. Last accessed February 2018.Google Scholar
- Bailis, P., and Ghodsi, A. Eventual consistency today: Limitations, extensions, and beyond. Communications of the ACM 56, 5 (May 2013), 55--63. Google ScholarDigital Library
- Balalaie, A., Heydarnoori, A., and Jamshidi, P. Migrating to cloud-native architectures using microservices: An experience report. Computing Research Repository abs/1507.08217 (2015).Google Scholar
- Balalaie, A., Heydarnoori, A., and Jamshidi, P. Microservices architecture enables devops: Migration to a cloud-native architecture. IEEE Software 33, 3 (2016), 42--52. Google ScholarDigital Library
- Birman, K., and Joseph, T. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (New York, NY, USA, 1987), SOSP '87, ACM, pp. 123--138. Google ScholarDigital Library
- Birman, K. P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., and Minsky, Y. Bimodal multicast. ACM Transactions on Computer Systems (TOCS) 17, 2 (May 1999), 41--88. Google ScholarDigital Library
- Bluemix. https://www.ibm.com/cloud-computing/bluemix. Last accessed February 2018.Google Scholar
- BMW Connected App. http://www.bmwblog.com/2016/10/06/new-bmw-connected-app-now-available-ios-android/. Last accessed February 2018.Google Scholar
- BMW Open Mobility Cloud. http://www.bmwblog.com/tag/open-mobility-cloud/. Last accessed February 2018.Google Scholar
- Burrows, M. The Chubby Lock Service for Loosely-coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 335--350. Google ScholarDigital Library
- Carretero, J., and Xhara, F. Genetic algorithm based schedulers for Grid computing systems. In International Journal of Innovative Computing, Information, and Control ICIC 3 (01 2007), vol. 5, pp. 1053--1071.Google Scholar
- Carstoiu, B., and Carstoiu, D. High performance eventually consistent distributed database Zatara. In Proceedings of the 6th International Conference on Networked Computing (May 2010), pp. 1--6.Google Scholar
- Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., and Woodford, D. Spanner: Google's Globally-distributed Database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI'12, USENIX Association, pp. 251--264. Google ScholarDigital Library
- CouchDB. http://couchdb.apache.org/. Last accessed February 2018.Google Scholar
- Service Fabric Customer Profile: BMW Technology Corporation. https://blogs.msdn.microsoft.com/azureservicefabric/2016/08/24/service-fabric-customer-profile-bmw-technology-corporation/. Last accessed February 2018.Google Scholar
- Service Fabric Customer Profile: Mesh Systems. https://blogs.msdn.microsoft.com/azureservicefabric/2016/06/20/service-fabric-customer-profile-mesh-systems/. Last accessed February 2018.Google Scholar
- Service Fabric Customer Profile: TalkTalk TV. https://blogs.msdn.microsoft.com/azureservicefabric/2016/03/15/service-fabric-customer-profile-talktalk-tv/. Last accessed February 2018.Google Scholar
- Das, A., Gupta, I., and Motivala, A. SWIM: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks (2002), DSN '02, pp. 303--312. Google ScholarDigital Library
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. Dynamo: Amazon's highly available key-value store. In Proceedings of 21st ACM SIGOPS Symposium on Operating Systems Principles (New York, NY, USA, 2007), SOSP '07, ACM, pp. 205--220. Google ScholarDigital Library
- Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., Shamis, A., Badam, A., and Castro, M. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 54--70. Google ScholarDigital Library
- Dragoni, N., Giallorenzo, S., Lluch-Lafuente, A., Mazzara, M., Montesi, F., Mustafin, R., and Safina, L. Microservices: yesterday, today, and tomorrow. Computing Research Repository abs/1606.04036 (2016).Google Scholar
- Esposito, C., Castiglione, A., and Choo, K. K. R. Challenges in delivering software in the cloud as microservices. IEEE Cloud Computing 3, 5 (Sept 2016), 10--14.Google ScholarCross Ref
- Eureka. https://github.com/Netflix/eureka. Last accessed February 2018.Google Scholar
- Ge, Y., and Wei, G. GA-Based Task Scheduler for the Cloud Computing Systems. In Proceedings of International Conference on Web Information Systems and Mining (Oct 2010), vol. 2, pp. 181--186. Google ScholarDigital Library
- Gupta, A., Liskov, B., and Rodrigues, R. One hop lookups for peer-to-peer overlays. In Proceedings of the 9th Conference on Hot Topics in Operating Systems - Volume 9 (Berkeley, CA, USA, 2003), HOTOS'03, USENIX Association, pp. 2--2. Google ScholarDigital Library
- Gupta, I., Birman, K., Linga, P., Demers, A., and van Renesse, R. Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (2003).Google ScholarCross Ref
- Hadoop. http://hadoop.apache.org/. Last accessed February 2018.Google Scholar
- Hasha, R., Xun, L., Kakivaya, G., and Malkhi, D. Allocating and reclaiming resources within a rendezvous federation. https://patents.google.com/patent/US20080031246 A1, 2008. US Patent 11,752,198.Google Scholar
- Hasha, R. L., Xun, L., Kakivaya, G. K. R., and Malkhi, D. Maintaining consistency within a federation infrastructure. https://patents.google.com/patent/US20080288659 A1, 2008. US Patent 11,936,589.Google Scholar
- Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI '11, USENIX Association, pp. 295--308. Google ScholarDigital Library
- Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2010), USENIX ATC '10, USENIX Association, pp. 11--11. Google ScholarDigital Library
- Johnson, D. B., and Maltz, D. A. Dynamic source routing in ad hoc wireless networks. In Mobile Computing (1996), Kluwer Academic Publishers, pp. 153--181.Google ScholarCross Ref
- Kakivaya, G., Hasha, R., Xun, L., and Malkhi, D. Maintaining routing consistency within a rendezvous federation. https://patents.google.com/patent/US20080005624 A1, 2008. US Patent 11,549,332.Google Scholar
- Kakivaya, G. K. R., and Xun, L. Neighborhood maintenance in the federation. https://patents.google.com/patent/US20090213757 A1, 2009. US Patent 12,038,363.Google Scholar
- Kerberos. https://web.mit.edu/kerberos/. Last accessed February 2018.Google Scholar
- Khachaturyan, A., Semenovsovskaya, S., and Vainshtein, B. The thermo-dynamic approach to the structure analysis of crystals. Acta Crystallographica Section A 37, 5 (Sep 1981), 742--754.Google ScholarCross Ref
- Kubernetes. https://kubernetes.io/. Last accessed February 2018.Google Scholar
- Lakshman, A., and Malik, P. Cassandra: a decentralized structured storage system. Operating Systems Review 44, 2 (2010), 35--40. Google ScholarDigital Library
- Lee, C., Park, S. J., Kejriwal, A., Matsushita, S., and Ousterhout, J. Implementing linearizability at large scale and low latency. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 71--86. Google ScholarDigital Library
- Li, C., Porto, D., Clement, A., Gehrke, J., Preguiça, N., and Rodrigues, R. Making geo-replicated systems fast as possible, consistent when necessary. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI '12, USENIX Association, pp. 265--278. Google ScholarDigital Library
- Lloyd, W., Freedman, M. J., Kaminsky, M., and Andersen, D. G. Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 401--416. Google ScholarDigital Library
- MariaDB. https://mariadb.org/. Last accessed February 2018.Google Scholar
- Maymounkov, P., and Mazières, D. Kademlia: A peer-to-peer information system based on the XOR metric. In Revised Papers from the First International Workshop on Peer-to-Peer Systems (London, UK, UK, 2002), IPTPS '01, Springer-Verlag, pp. 53--65. Google ScholarDigital Library
- Mesh Systems. http://www.mesh-systems.com/. Last accessed February 2018.Google Scholar
- Microsoft cortana. https://www.microsoft.com/en-us/mobile/experiences/cortana/. Last accessed February 2018.Google Scholar
- Microsoft Intune. https://www.microsoft.com/en-us/cloud-platform/microsoft-intune. Last accessed February 2018.Google Scholar
- MongoDB. https://www.mongodb.org/. Last accessed February 2018.Google Scholar
- Microsoft Service Fabric. https://azure.microsoft.com/en-us/services/service-fabric/. Last accessed February 2018.Google Scholar
- ning Gan, G., lei Huang, T., and Gao, S. Genetic simulated annealing algorithm for task scheduling based on cloud computing environment. In 2010 International Conference on Intelligent Computing and Integrated Systems (Oct 2010), pp. 60--63.Google ScholarCross Ref
- Nirmata. http://www.nirmata.com/. Last accessed February 2018.Google Scholar
- Netflix Open Source Software Center. https://netflix.github.io/. Last accessed February 2018.Google Scholar
- Perkins, C. E., and Royer, E. M. Ad-hoc on-demand distance vector (AODV) routing. In In Proceedings of the 2nd IEEE Workshop On Mobile Computing Systems and Applications (1997), pp. 90--100. Google ScholarDigital Library
- Pivotal Application. https://pivotal.io/platform/pivotal-application-service. Last accessed February 2018.Google Scholar
- Quorum Business Solutions. https://www.qbsol.com/. Last accessed February 2018.Google Scholar
- Service Fabric Customer Profile: Quorum Business Solutions. https://blogs.msdn.microsoft.com/azureservicefabric/2016/11/15/service-fabric-customer-profile-quorum-business-solutions/. Last accessed February 2018.Google Scholar
- Ramasubramanian, V., and Sirer, E. G. Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation - Volume 1 (Berkeley, CA, USA, 2004), NSDI '04, USENIX Association, pp. 8--8. Google ScholarDigital Library
- Redis. https://redis.io/. Last accessed February 2018.Google Scholar
- Rhea, S., Geels, D., Roscoe, T., and Kubiatowicz, J. Handling churn in a DHT. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2004), ATEC '04, USENIX Association, pp. 10--10. Google ScholarDigital Library
- Riak. http://basho.com/products/. Last accessed February 2018.Google Scholar
- Ribbon. https://github.com/Netflix/ribbon. Last accessed February 2018.Google Scholar
- Rowstron, A. I. T., and Druschel, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg (London, UK, UK, 2001), Middleware '01, Springer-Verlag, pp. 329--350. Google ScholarDigital Library
- Saltzer, J. H., Reed, D. P., and Clark, D. D. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (Nov. 1984), 277--288. Google ScholarDigital Library
- Skype for Business. https://www.skype.com/en/business/skype-for-business/. Last accessed February 2018.Google Scholar
- Spring Cloud. http://projects.spring.io/spring-cloud/. Last accessed February 2018.Google Scholar
- Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (New York, NY, USA, 2001), SIGCOMM '01, ACM, pp. 149--160. Google ScholarDigital Library
- Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., and Shah, S. Serving large-scale batch computed data with project voldemort. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2012), FAST'12, USENIX Association, pp. 18--18. Google ScholarDigital Library
- Talk Talk TV. http://www.talktalk.co.uk/. Last accessed February 2018.Google Scholar
- Tonse, S. Scalable microservices at Netflix. challenges and tools of the trade. https://www.infoq.com/presentations/netflix-ipc. Last accessed February 2018.Google Scholar
- van Renesse, R., Minsky, Y., and Hayden, M. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing (London, UK, UK, 1998), Middleware '98, Springer-Verlag, pp. 55--70. Google ScholarDigital Library
- Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), SOCC '13, ACM, pp. 5:1--5:16. Google ScholarDigital Library
- Wang, A., and Tonse, S. Announcing Ribbon: Tying the Netflix mid-tier services together. http://techblog.netflix.com/2013/01/announcing-ribbon-ttying-netflix-mid.html. Last accessed February 2018.Google Scholar
- Wei, X., Shi, J., Chen, Y., Chen, R., and Chen, H. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 87--104. Google ScholarDigital Library
- Wheeler, B. Should your apps be cloud-native? https://devops.com/apps-cloud-native/. Last accessed February 2018.Google Scholar
- Xie, C., Su, C., Littley, C., Alvisi, L., Kapritsos, M., and Wang, Y. High-performance ACID via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 279--294. Google ScholarDigital Library
- Zhang, I., Sharma, N. K., Szekeres, A., Krishnamurthy, A., and Ports, D. R. K. Building consistent transactions with inconsistent replication. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 263--278. Google ScholarDigital Library
- Zuul. https://github.com/Netflix/zuul. Last accessed February 2018.Google Scholar
Index Terms
- Service fabric: a distributed platform for building microservices in the cloud
Recommendations
Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingCloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
An architectural style for scalable choreography-based microservice-oriented distributed systems
AbstractService choreographies are a versatile approach for building service-based distributed systems. Many approaches can be found in the literature tackling different aspects of service choreographies, such as choreography realizability and conformance ...
Harnessing Cloud Technologies for a Virtualized Distributed Computing Infrastructure
The InterGrid system aims to provide an execution environment for running applications on top of interconnected infrastructures. The system uses virtual machines as building blocks to construct execution environments that span multiple computing sites. ...
Comments