AWS Well Architected Framework (WAF) – a critical view

I am now working with AWS services on a daily basis. I began to wonder what is this thing. WAF. As a software engineer I do have my concerns with architecture. As discussed in my two articles on principal engineering, I do prefer not to use word architecture. I very much prefer principles.

Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves.
Alan Kay

Alan is known for very interesting quotes and a rather interesting vision of software engineering. As a co-creator of term object, and OOP (object-oriented-programming), it might come as a bit of a shock for you, that he regrets that name today. He is right here (the quote) though. As engineers we focus on producing code, more than on finding solutions to a problem. Even less time is spent attempting to understand whether there is a problem to solve in the first place.

A lot of it is challenged by the initial UNIX philosophy, where worse is better. This is still one of the more preached approaches among software engineers. Things should be kept simple, efficiency comes second. Long-term vision depends on what happens in the future. Interfaces are less important as long as you can keep something as a text output and pass it on. Interoperability and portability are key. You should be building individual, simple programs to handle a problem (you still think microservices are a new invention?).

All that worked on smaller scale and academical projects. When engineering went massively commercial, we lost our ways. Complex, monolith behemoths were created. The problem this has created, people tried solving using methods from civil engineering or construction industry. While there were a few analogies that might have worked well, software engineering was changing too rapidly for standards like IEEE 1471 or ISO 42010 to keep up. A lot of similar front-loaded approaches followed. This proven to be a failed approach.

Now you might think I’m ‘anti-architecture’ in the sense of a good architecture description, but the real truth is I am simply tired of seeing architects fail because they focus on conformance and governance based documentation over real transformation. 42010 and 1471 are a part of that problem. I am calling for a new standard in architecture. One that focuses on the architect and their primary value function, business technology strategy, instead of documentation and model wizard.
Paul Preiss, https://itabok.iasaglobal.org/the-biggest-problems-with-iso-42010-and-ieee-1471/

A lot of these early thinking on architecting, computing and processing was created when engineering was a domain, which focused mainly on either engineer-machine or engineer-engineer type of problems. It has changed so much since.

Computer engineering is an area that is constantly evolving and the problems we are trying to solve change very rapidly. One might argue, that early era of computing (1960′ – 1990′) has little to do with the scale and type of challenges we solve for today.

One, the general population have became generally more proficient using computers. There are now generations of people, who can’t imagine their lives without technology. Contrary to UNIX philosophy, which treated UI’s as a secondary thing, it’s the UI’s and UX that rule the general world. Efficiency is important to keep both your costs down and your user base in. Speed of delivery started to matter a lot more than in the past. In general, the more inefficient MVP code you create without knowing on the future use, the more inefficient it will become as you will continue building both extensions and new services.

Two, engineering departments became the centres of value generation for majority of the companies on the market. It is this department that builds and releases either key support for your products or well … your products.

Three, we are operating in web-centric model. Internet is the first medium people will look for or acquire resources; in many cases this includes even the basic needs products.

The old world and word architecture, including many old-school software engineering paradigms fail or deal very poorly with these three major changes. How is OOP and old-school design patterns dealing with the fact most of the world’s population is concentrated around messages? How is planning architecture upfront dealing with the fact you might scale a 100 times within months? How do you deal with the fact your company might need to pivot multiple times a year?

With $1.2 billion in investment, 8 million active users, and 3 million paid users, Slack went from a failed gaming startup to one of the highest-value startups in the world. (in 2018)
John Koetsier, https://www.forbes.com/sites/johnkoetsier/2018/11/30/how-slack-became-the-fastest-growing-enterprise-software-ever/?sh=70a72fbf6e7a

The problem the clash of the old approach to engineering and the new world problems had created is enormous.

But hold on, we were to discuss AWS WAF, so what all these things have to do with it?

Pillars of AWS Well-Architected Framework | VOLANSYS — The five pillars of AWS WAF

The framework from AWS stands on the 5 pillars above. Before I read it, I had the usual attitude of the solution being crafted only against specific sales needs for Amazon. As such I wasn’t expecting much from it, rather than being yet another marketing tool. I was wrong.

The framework is a result of years of experience by Amazon themselves in building modern services.

Every day, experts at AWS assist customers in architecting systems to take advantage of best practices in the cloud. We work with you on making architectural trade-offs as your designs evolve. As you deploy these systems into live environments, we learn how well these systems perform and the consequences of those trade-offs.

Based on what we have learned, we have created the AWS Well-Architected Framework, which provides a consistent set of best practices for customers and partners to evaluate architectures, and provides a set of questions you can use to evaluate how well an architecture is aligned to AWS best practices.
AWS Well Architected Framework, https://docs.aws.amazon.com/wellarchitected/latest/framework/wellarchitected-framework.pdf

The paper presented, followed by multiple, other online resources is a set of guiding principles. And questions. I was surprised by how it avoids any diagrams or very specific uml-driven prescriptions. It does have recommendations on which AWS services can be used to resolve some of it’s issues, but that is not the key message.

The one thing it does perfectly in my opinion is it deals with the real engineering problems we face today. You might decide to use a different technology provider than AWS, but the principles presented will still work.

I have been writing on this blog about outsourcing complexity, in this respect using services, that are optimized with hundreds of use-cases might be a good choice for you, over building tonnes of internal code you will need to maintain. WAF provides you with answers.

In my first article on principal engineering, I had covered the concept of a pattern language. Again WAF responds to this challenge pretty well, normalizing your concerns to 5 most important pillars. Five patterns you need to cater for, then extended to multiple sub-patterns.

The framework answers my three big questions listed above well. There is a lot of effort put into doing things internet-first.

This is how the introductory paragraphs relate to AWS WAF. They state the questions, WAF is the answer.

I wouldn’t be an engineer, if I had not attempted to add a few critical words. It is hard. WAF is a recommendation, that helps you to assess your individual needs through a set of questions. This then helps to guide your future direction. It is not a prescription. At the same time, it is always worth a try to see where WAF recommended approach might be not applicable. I have chosen a few examples.

Stop guessing your capacity needs

With this principle AWS tries to guide us towards planning for capacity only through the actual production load. The advised approach is autoscaling. It works well on paper, though in real-life auto-scaling events might be less smooth than you expect. Some hardcore examples, where autoscaling was not so great, I know of, are s3 partitioning issues (to be fair AWS advises on good data partitioning upfront, but does this to a system, which is still simple?) or when a large portion of cache is dropped from cloudfront.

For simple systems, with low traffic, using ECS, it is good to have 2 standing instances. Just a bit of extra redundancy. Everything fails. As system that is handling 200 messages a day will suffer more from loss of 1 during scaling, than one that handles 1,000,000.

I am also not a huge fan of scaling just by cpu or memory, as have seen number of requests overflowing services without an early-enough warning. Also metrics in AWS, which are exposed from docker might be slightly misleading, sometimes bordering on very misleading.

Solution: I still tend to go for slight over provisioning for odd traffic peaks. Which is guessing capacity.

Implement a strong identity foundation

In this instance AWS would like us to use security-first approach with authenticate by default. Again I am not a huge fan here of just flat applying this everywhere. If we are talking about non-internet facing, internal services, exposing non-personal data. I believe there is more efficiency with having no form of authentication at all. Or alternatively I would go for a model with allow-all read, authenticate writes. This one also saves on some additional terraform or code to process the auth.

Solution: There are cases, where you don’t. You can just do well with VPC rules.

Manage change in automation

I will not be challenging the need of automation. All I challenge is that extensive automation / pipelines might slow you down during service recovery. While AWS advises on having automated recovery procedures, your organization needs to meet a very high standard of maturity before that’s possible. Also building that kind of redundancy to initial versions of services with be slightly over the top.

Solution: If your rollback of previous version taken more than 1 min, think where you got it wrong.

Go global in minutes

Unless you don’t need to go global in minutes. Deploying to each availability zone in the world, where your service is mainly used in one country is definitely a huge overhead. It’s great to think global.

Solution: Go local in minutes, if you need to. Each AWS DC has a few availability zones. In most cases (besides catastrophic crashes) it is sufficient.

If you know the enemy and know yourself, you need not fear the result of a hundred battles.
Sun Tzu

As you have probably noticed by now, my comments to WAF are rather cosmetic. I have not found a better framework to help you guide software decisions against modern problems and I am more than keen to start using it properly and frequently myself.

To summarize learnings of today:

There is still a lot of engineering paradigms in use from the past, that do not deal well with engineering problems of today
Software engineering architecture, is in fact a set of principles, which should be frequently discussed, rather than architecture in it’s construction industry sense
AWS Well Architected Framework provides a lot of principles and questions on how to build applications, products and services solving problems of today (and is based on practical knowledge, not academical assumptions)
Are you well architected? is a question that requires an in-depth review, not drawing of 100’s of useless diagrams.
Discuss, plan, do, discuss, plan, do; to come to a vision of your product, that works for you

And here is a simple video-overview of the AWS framework:

AWS Well Architected Framework (WAF) – a critical view

Leave a comment Cancel reply