User:Luc taesch/sandbox

From Wikipedia, the free encyclopedia

Logo Chaos Monkey by Netflix The concept of 'Chaos Monkey' was invented in 2011 by Netflix to test the resilience of its IT infrastructure.[1] The purpose of this tool is to simulate failures in a real environment and to check that the computer system continues to work.

Concept[edit]

Historically, in the design of Software s, the concept of non-functional requirement was included in the General Functional Specifications. These requirements included the ability of the software to tolerate failures, to be resilient to ensure optimal Quality of Service. Often due to lack of time to quickly deliver software or lack of knowledge of the field, development teams skipped these topics.

In 2011, engineers from Netflix - Yury Izrailevsky, today Director Cloud & Infrastructure and Ariel Tseitlin, today Director of Cloud Solutions [2], had the idea to change the paradigm by setting up a tool in production environment, the real environment used by Netflix customers, a tool that would cause breakdowns. They therefore propose to move from a model where teams build software hoping that there will be no breakdowns to a model where they will be sure that there will be a failure - provoked. Taking into account resilience in software design is no longer an option, but an obligation:

"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey , a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services. Cite error: A <ref> tag is missing the closing </ref> (see the help page).

Imagine a monkey entering a "data center", these "farms" of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy. </ Blockquote> [Netflix]] released the source code for this tool in 2012. <Ref name = ": 0"> Article ([[Special:EditPage/{{{1}}}|edit]] | [[Talk:{{{1}}}|talk]] | [[Special:PageHistory/{{{1}}}|history]] | [[Special:ProtectPage/{{{1}}}|protect]] | [[Special:DeletePage/{{{1}}}|delete]] | [{{fullurl:Special:Whatlinkshere/{{{1}}}|limit=999}} links] | [{{fullurl:{{{1}}}|action=watch}} watch] | logs | views) </ ref> <ref>

</ ref>.

Different variants of the Simian Army[edit]

Netflix Simian Army

The Simian Army <ref>

</ ref> (ape army) is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure. ref> Article ([[Special:EditPage/{{{1}}}|edit]] | [[Talk:{{{1}}}|talk]] | [[Special:PageHistory/{{{1}}}|history]] | [[Special:ProtectPage/{{{1}}}|protect]] | [[Special:DeletePage/{{{1}}}|delete]] | [{{fullurl:Special:Whatlinkshere/{{{1}}}|limit=999}} links] | [{{fullurl:{{{1}}}|action=watch}} watch] | logs | views) </ ref>.

  • Chaos Monkey

The first tool developed by Netflix, it allows random selection of instances in the production environment and deliberately put them out of service.

  • Gorilla Chaos

At the very top of the Simian Army hierarchy, Chaos Gorilla, drops a full Amazon Availability Zone. <Ref> [http: //www.ictjournal.ch/news/2016-10-27/resilience-cloud-netflix-mises-on-its-shared-subscriptions-website-sites "Cloud Resiliency: Netflix Bets on its Killer Monkeys AWS instances"]. {{cite web}}: Check |url= value (help); Unknown parameter |site= ignored (help) </ ref>

  • Latency Monkey

By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut - an infinite delay ; without having to ask the partner concerned to cut his service.

  • Doctor Monkey

Tool that detects all instances that present health risks - CPU overload for example - and separates them from the system for root cause analysis or even extinction.

  • Janitor Monkey

Tool that disables any unused instances to avoid over-consuming.

  • Conformity Monkey

Tool that disables any nonconforming instances to allow the system to recreate it properly.

  • Monkey safety

Derived from the Conformity Monkey , a tool that disables all instances that have vulnerabilities.

  • 10-18 Monkey

Tool that detects problems of localizations, languages ​​(l10n-i18n) on instances.

Chaos Monkey and Devops[edit]

The Devops Tool Chain

As part of the Devops, movement, special attention is paid to the safe operation of computer systems, thus providing a sufficient level of confidence despite frequent releases. By contributing to the Devops Tool Chain, Chaos Monkey meets the need for continuous testing.

They are part of the pattern "Design for failure" <ref> "The Great Patterns of the Web - Design for failure | OCTO Talks!". {{cite web}}: External link in |Url= (help); Missing or empty |url= (help); Text "accessed on 2017-10-22" ignored (help); Unknown parameter |Url= ignored (|url= suggested) (help) </ ref>, "designed to support failure": a computer application must be able to support the failure of any underlying software or hardware component.

Related projects[edit]

Chaos Engineering[edit]

Chaos Engineering is the discipline of experimentation on a distributed system to build confidence in the system's ability to withstand turbulent production conditions. <ref> [http: //principlesofchaos.org/ "Principles of Chaos Engineering"]. {{cite web}}: Check |url= value (help); Text "accessed on 2017-10-21" ignored (help) </ ref> </ blockquote> This is a community built around the principles defined on the site http://principlesofchaos.org/, initiated by Netflix. <ref> [https: //www.infoq.com/en/news/ 2014/10 / netflix-chaos-engineering "Chaos Engineering by Netflix"]. {{cite web}}: Check |url= value (help); Text "retrieved on 2017-10-22" ignored (help) </ ref>

Facebook Storm[edit]

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures. <Ref> Article ([[Special:EditPage/{{{1}}}|edit]] | [[Talk:{{{1}}}|talk]] | [[Special:PageHistory/{{{1}}}|history]] | [[Special:ProtectPage/{{{1}}}|protect]] | [[Special:DeletePage/{{{1}}}|delete]] | [{{fullurl:Special:Whatlinkshere/{{{1}}}|limit=999}} links] | [{{fullurl:{{{1}}}|action=watch}} watch] | logs | views) </ ref>

Days of Chaos[edit]

Inspired by AWS GameDays <ref> Accessed on 2017-10-21 (edit | talk | history | protect | delete | links | watch | logs | views) </ ref> to test the resilience of its applications, teams volunteer applications from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience. <Ref> Article ([[Special:EditPage/{{{1}}}|edit]] | [[Talk:{{{1}}}|talk]] | [[Special:PageHistory/{{{1}}}|history]] | [[Special:ProtectPage/{{{1}}}|protect]] | [[Special:DeletePage/{{{1}}}|delete]] | [{{fullurl:Special:Whatlinkshere/{{{1}}}|limit=999}} links] | [{{fullurl:{{{1}}}|action=watch}} watch] | logs | views) </ ref>

Presented at the 2017 Devops REX conference <ref> {{Article | language = en | author1 = | name1 = devops REX | title = Days of Chaos: the development of the devops culture at Voyages-Sn ... | periodic = Slideshare | date = 2017-10-03 | issn = | read online = https: //en.slideshare.net/devopsrex/days-of-chaos-the-development-of-culture-devops-your-voyagessncfcom-laid- the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.

Chaos Toolkit[edit]

The Chaos Toolkit was born from the desire to simplify access to the discipline of Chaos Engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application . The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017 <ref> [https: // medium.com/russmiles/introducing-and-extending-the-chaos-toolkit-ddfa142acc2b "Introducing and Extending the Chaos Toolkit"]. 2017-10-06. {{cite web}}: Check |url= value (help); Unknown parameter |firstname= ignored (help); Unknown parameter |name1= ignored (help); Unknown parameter |site= ignored (help); Unknown parameter |viewed on= ignored (help) </ ref>.

Notes and references[edit]

  1. ^ "The Netflix Simian Army". Netflix TechBlog. 2011-07-19. Retrieved 2017-10-21. {{cite web}}: Unknown parameter |DUPLICATE_title= ignored (help)
  2. ^ [https: //medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116 "The Netflix Simian Army"]. 2011-07-19. {{cite web}}: Check |url= value (help); Text "retrieved on 2017-10 -21" ignored (help)

Category: Engineering Category: Software Development