Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay opex to run in multiple regions and lay off your SREs, but if you really wanted high availability (and not just some service credits when things go wrong) you would spend the engineering effort to go multi-provider. At that point on-prem looks better, but at least with multi-provider you’re no longer in the business of ordering hardware and power.

The idea that failures between different systems are uncorrelated doesn’t even work at a hardware level (e.g. two different sticks of RAM), it’s pure fantasy when you’re talking about software stacks, especially when you have things like the blow-up-the-world button called “BGP.” I’m deeply uncomfortable reading formulas for availability that add or subtract nines, if there’s real money on the line, you can afford to do some better math.

Disclosure: Work at a cloud provider, opinions are my own.



I can't say for your employeer, but at AWS this is not our intent. Our job is to help customers build better architectures, and if you read the framework you will see its agnostic of vendor.

You can't approach reliability with the "the sky is falling down, so theres no point" approach. You are actually going to have think about component failure, blast radius and what happens afterwards a failure.

rather than spreading FUD, why not help customers by showing them the better math?


> I can't say for your employeer, but at AWS this is not our intent.

Lost the antecedent, here. What is not your intent?

Granted my understanding of cloud economics is somewhat murky. My general impression is that the big selling point of cloud is the move from capex to opex, and the second selling point is that you don’t need the same level of operations expertise to run cloud compared to on-prem. My hot take is that for high reliability you still need tons of operations expertise, and in these scenarios, solutions like (partial) on-prem and multi-provider become much more favorable.

> rather than spreading FUD, why not help customers by showing them the better math?

I feel like I’m being accused of spreading FUD, and I want to know why? All I am really trying to say here is that you can’t just do simple arithmetic on published #nines and end up with something that approximates the truth for your service in a useful way. Depending on the operation of your business this approximation may be acceptable or it may not be.

Just to recap, the bad math is to come up with some threshold for acceptable performance in all components, model each as an independent Bernoulli variable, and then plug them into some boolean formula. This is the math published in the guide here, and you can do it with arithmetic on #nines. The reason why this is bad is because this leads you towards a shallow understanding of your system and creates pressure to inflate estimates of availability beyond actual availability, sometimes much so.

Unfortunately, if you are really interested in calculating availability you need to come up with a model for your particular system. This is a complex subject that involves coming up with a model which strikes some balance between accuracy, simplicity (so it can be understood and used to inform strategy), usefulness (to customers / downstream), and supportability (to engineers working on the service). I can't tell you how to do that, my best guess is to hire someone who knows enough statistics to be dangerous and lock them in a room for three months with a terminal and access to your metrics.

(As a side note, the tools for this kind of analysis are much better for systems like electronic circuits. For example, a part might be labeled as "1% tolerance" but when you are running simulations you use a probability distribution.)

When you actually do this, you often find out that your system is much less available (reliable, durable) than it was designed to be. It’s common that your cloud provider could stay within SLO and your service would still be “down” as far as your definition went. So then you have the engineering problem of figuring out how to improve things, which may involve multi-region, multi-cloud, on-prem, redesigning parts of your system, changing utilization targets, etc. The model helps because it can reveal key insights like “if you improve latency in this part of the system, it improves availability in this other part of the system”, so you can decide where to spend engineering resources.

I can guarantee you that the folks who are in charge of, say, EC2 are not just doing math with #nines of the systems underlying EC2. They are measuring and modeling. If I really wanted to figure out how to calculate availability of my systems, I would want to read case studies of real-world systems.


on the intent question, you said:

> Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay opex to run in multiple regions and lay off your SREs, but if you really wanted high availability (and not just some service credits when things go wrong) you would spend the engineering effort to go multi-provider.

That is not AWS's intent.

And its FUD to suggest that somehow cloud vendors want to make money out of reliability. For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.


> For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.

Whether your costs are similar or not depends on the amount of inter-AZ traffic, which is charged for.


> That is not AWS's intent.

To clarify, I’m not talking about intent or “wants” in any way.

(If you are speaking of AWS’s intent, do you have some kind of privileged knowledge of what AWS “wants” to do?)

> And its FUD to suggest that somehow cloud vendors want to make money out of reliability.

Let’s forget about what AWS “wants”.

I think that AWS deserves to make money for making reliable services, and that it is right to pay them more money to get more reliability. This is not FUD, this is how business works. When I am working with a cloud provider to host my services, I am not trying to extract the most favorable terms possible—I want a deal where both parties are making money. If AWS is not making money from reliability, then that means, conversely, that I can’t buy it from them (more or less—this is a simplification).

And you can see that reliability is all over AWS’s marketing materials, because reliability is important to AWS’s customers.

> For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.

Reliability is more than just a collection of IaaS products wired together. The fact that two particular configurations have similar price points is not informative.

Single-region, multi-AZ is great for a large percentage of customers. When that’s not enough, you pay more money, and you should do some modeling of system availability because there are plenty of categories of errors that will now prevent you from reaching those higher availability targets.


> the idea that you can just pay opex to run in multiple regions and lay off your SREs,

> That is not AWS's intent.

Yet that's what all their produced literature describes.

> its FUD to suggest that somehow cloud vendors want to make money out of reliability.

It's literally a selling point. Not sure how mentioning it is FUD.

It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.


> Yet that's what all their produced literature describes.

No, it's not. I've read it. The Well Architected stuff is actually really good about not being AWS only. These are generally good principals to keep in mind regardless of which provider you go with. And that's how it's written, and that's how AWS teaches it in person (I've actually been through their Well Architected course). Yes, they use AWS tooling to teach it, but nothing about it requires AWS.

So, no. You are completely wrong.

> It's literally a selling point. Not sure how mentioning it is FUD.

Because the pricing AWS has doesn't necessarily increase cost with increase reliability. Suggesting that you need to spend more money to increase reliability is FUD.

> It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.

That's not what was pointed out. Rather, it was pointed out that you can get high reliability without resorting to multi-cloud and higher costs. For you to continue suggesting otherwise, you first have to start off by explaining why you think high availability can't be obtained using a single cloud provider.


> The Well Architected stuff

While that is the topic, that is not what I was referencing. "produced literature" is a bit more expansive than that and casually accessible (eg https://media.amazonwebservices.com/architecturecenter/AWS_a...). If you're going to reply about how someone is wrong, please have the courtesy to digest the statements made. Lashing out with non-sequitors is not constructive, when you could take a statement in good faith and consider you misinterpreted it.


> Because the pricing AWS has doesn't necessarily increase cost with increase reliability. Suggesting that you need to spend more money to increase reliability is FUD.

I do not understand this statement. In general, it costs more to increase the reliability of a system. This includes both infrastructure cost and engineering cost. I also do not understand why this could be FUD. I don’t have any fear, uncertainty, or doubt when I pay more for better services. This is normal and expected. Conversely this means that I can save money if I identify components of my system with lower demands for infrastructure reliability.


You appear to be speaking on behalf of Amazon Web Services but don’t mention this relationship or your company contact information in your HN profile.

Is this normal and allowed at AWS, or do you just not have a Social Media Policy?

You have said you work on the AWS Well-Architected team in your other comments, but it feels like strong statements about intent of a company should be attributable to an employee directly, and not an anonymous internet handle.

Others from AWS like @jeffbarr and @_msw_ seem to be much more transparent when making statements about Amazon.


I generally keep my profiles for work and personal seperate on social media. When I cross that division I disclose if I have some interest that would influence my statements. My work handle on twitter is @WellArchitected, and linked in is https://www.linkedin.com/in/philipfitzsimons/ I'm not in the superstar league of @jeffbarr and @_msw_


Something that often goes ignored is that in high availability situations like that, humans are the biggest risk by a huge margin. Computers (almost always) don't accidentally wipe out the prod database thinking it was beta, that's usually a human that made a simple mistake.

Dumping money into IaaS while neglecting to recognize that high availability requires extreme quality of both engineering, as well as testing/qa is a super common mistake that's being made these days. This means slowing down, which most upper management doesn't want to do.

Things like executing DR plans regularly to test that they wofk are extremely costly to the business,and are one of those "IT" things that you hopefully never have to actually do in production. But without things like that, what's the point of 3xing the cost of your fleet and adding complexity (another great vector to cause failure) for "availability"?

I think the answer often comes down tok the fact that people are hard to hire. Computers are easy to provision.


Yes, humans are really important in these situations, thats why "Operational Excellence" is the first pillar of AWS Well-Architected - you can only get so far in terms of reliability and security if you don't consider people and process. https://d1.awsstatic.com/whitepapers/architecture/AWS-Operat... (I work on the AWS Well-Architected team)


(erased)


Uh, don't you need to HIRE some SREs to get that multi-everything setup engineered correctly?


Depends on the complexity of your service. The question is also not about what is “true” (you need to hire SREs if you want high reliability) but what is believed. Beliefs are what drive sales. Cloud is so pervasive that customer beliefs about what cloud can/cannot do are different from both reality and from marketing materials.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: