Which tool can help you estimate the cost savings when moving to AWS and provides a detailed set of reports which can be used in executive presentations?
Published: 23 March 2020 Show
SummaryManaging costs is a challenge for organizations using public cloud services but also an opportunity to drive efficient consumption of IT. This research provides I&O technical professionals with a framework to manage costs of cloud-integrated IaaS and PaaS providers such as AWS and Microsoft Azure. OverviewKey Findings
RecommendationsTechnical professionals in charge of IT operations and cloud management should:
Problem StatementThe adoption of cloud computing introduces a number of new challenges, but managing cloud spending is proving to be one of the most difficult. When using public cloud IaaS and PaaS, organizations are billed continuously as consumption occurs, instead of once-off as it happens when they procure their data center capacity. In cloud computing, organizations are confronted with the difficulty of creating accurate cost estimates. They are often hit by bills that they apparently can’t explain and struggle to identify items that are responsible for spending. As a result, financial management is often overlooked until spend is out of control. However, cloud computing also allows unprecedented visibility into IT costs that organizations can use to drive more efficient consumption of IT. Traditional enterprise data centers are equipped with finite, preprocured, capital expenditure (capex)-oriented capacity but a more efficient use of that capacity does not automatically translate into cost savings. The cloud computing model reverses this paradigm. Furthermore, the following factors increase the complexity of managing costs of public cloud:
To control the costs of public cloud IaaS and PaaS, prevent overspending and drive more efficient consumption of cloud services, organizations must develop financial management processes. These processes affect multiple roles and departments, including I&O, the cloud center of excellence (CCOE), finance and the firsthand consumers of cloud services. Ultimately, these processes translate into new management requirements and demand the adoption of new tools. This research guides organizations through how to manage costs of public cloud IaaS and PaaS, to address the listed challenges and unlock new savings opportunities. Only via the thorough application of this research can organizations make their use of public cloud services cost-effective. The Gartner ApproachCommon guidance on public cloud cost management is often limited to providing a list of pragmatic tasks, such as turning off unused instances or deleting unused storage. While these practices are certainly recommended — and mentioned in this research — common guidance usually fails at establishing a strategic and comprehensive view on cost management. Focusing just on operational tasks such as turning off unused instances can cause disruption and frustration among your cloud users, leading to shadow IT. The simple execution of cost reduction tasks won’t guarantee that spend will remain within expectations if nobody has previously set an expectation. Furthermore, without the ability to track and organize costs around cost centers and applications, you will not be able to provide the right visibility to make your cloud consumers care about how much they spend. Gartner’s methodology provides a structured framework for public cloud cost management. It provides guidance not only on the operational aspects, but also on architecture, governance, application development and DevOps. Using this structured approach, you will be able to set your priorities, involve key stakeholders and determine the organizational changes required to develop and maintain these new capabilities. By applying this methodology, you will initiate the cultural shift that will make your cloud consumers more accountable of their IT spend. Ultimately, you will learn how to manage your costs in relation to the business value that cloud services generate. Public cloud cost management is part of the broader cloud economics discipline. Cloud economics also includes aspects of total cost of ownership (TCO) and ROI calculation as organizations evaluate the adoption of cloud services. These aspects are out of the scope of this document but are covered in The Guidance FrameworkManaging cloud costs is a multifaceted, complex problem. First, organizations must learn how to forecast consumption and how to set budget expectations. Then, they must gain continuous visibility into what users are spending for each initiative, project or application. Once tracking is established, organizations must seek methods to reduce their monthly bill. Costs can be reduced by leveraging the achieved visibility to detect anomalies and drive corrective actions. As organizations mature through such capabilities, they can achieve scale by automating decision making where possible. Gartner has created a framework to manage cloud spending on an ongoing basis. This guidance framework provides a series of capabilities that organizations must develop to budget, track and optimize cloud spend. The framework applies regardless of where your organization is in its cloud adoption journey and identifies best practices in each of the included components. Managing cloud costs requires the development of capabilities in five distinct areas: The guidance framework is depicted in Figure 1. The logical flow between these five areas should not be interpreted as a mandate to implement them sequentially. Instead, organizations should apply an iterative approach and develop each area as independently as possible. Although there are obvious dependencies between areas, these shouldn’t block the development of subsequent capabilities. For example, you can start reducing your costs even if you don’t possess full visibility of your spending. This framework is provider-neutral and can be applied to all major cloud providers, including AWS, GCP and Microsoft Azure. The described approach helps organizations develop a consistent multicloud governance model to achieve operational excellence in managing public cloud spending. PreworkBefore using the framework, technical professionals must complete some fundamental prework. The items described in this section must be completed prior to using any component in the framework. Specifically, you must:
Public cloud cost management is not only a concern for the I&O organization. Similarly, it’s not only the finance team that should care about budgets and cost containment. This guidance framework impacts several departments that should also cooperate in its implementation. Organizations are transforming their culture and processes to adopt cloud computing. They are adopting a product-oriented delivery and DevOps. This transformation forces organizations to become more decentralized and require granular cost controls. The individuals and teams responsible for cloud costs also change as organizations move through different stages of this transformation. You must determine the stakeholders to involve your cost management practice. As a minimum, the collaboration should embrace the following disciplines:
In many cases, there is a team in charge of both governance and architecture. This team is often the CCOE. Other common names for such a team include “cloud architecture” or “cloud custodians.” A CCOE is a centralized cloud computing governance function for the organization as a whole. It serves as an internal cloud service broker. It acts in a consultative role to the consumers of cloud services within the organization. It is a key ingredient for cloud-enabled transformation and is typically tasked with helping drive that transformation. For more information on the role and responsibilities of a CCOE and how to establish one, see Figure 2 depicts the relationships of the CCOE with the identified stakeholders. Appoint Ownership of the Cost Management PracticeThe finance department does not own the cloud cost management practice. Finance just doesn’t possess the technical knowledge required to make decisions that are heavily tied to technical configurations and operational metrics. For example, resizing a compute instance requires a reboot, and finance isn’t equipped to make this decision while fully considering its technical impact. Cost management is primarily a governance matter and its ownership naturally resides with the IT governance team or the CCOE — when the organization has one. The implementation of this framework starts from its translation into specific policies, which the CCOE is in charge of defining. In this context, IT operations acts as the enforcement arm of the CCOE for certain aspects of the framework, such as cost reduction techniques. In other cases, when there is no CCOE or when I&O is accountable for all IT spending, the operations team can take full ownership of the cost management practice. I&O is already well-versed in monitoring and reporting metrics such as availability and performance. Cost would be simply another metric to track and act on. Although this is an acceptable approach to begin with, at some point the governance team must take over and take the practice beyond the simple operational perspective. Owning the cost management process doesn’t mean having to execute firsthand each of the listed practices. This execution will actually be distributed among stakeholders. Ultimately, many of the cost management practices such as tagging and rightsizing will be executed directly by cloud consumers, as they take on more responsibility of their spending. Owning this process means making sure that the cost policies are successfully defined and enforced. While the CCOE makes the “final call” on some of these policies, their definition must be accomplished with a great degree of collaboration and transparency with identified stakeholders. Define Cost AccountabilityThe traditional consumers of IT services within an organization have never been concerned about costs. In an enterprise data center, I&O owns and is accountable for the entire IT budget. The lack of chargeback models for the calculation of unit costs made IT consumers consider the data center literally as a “free” resource. However, in such a model, I&O is also in charge of provisioning resources and procuring capacity. Consequently, I&O is in full control of the costs of IT and can simply refuse requests when they are not within an established budget. Conversely, the self-service nature of cloud services and the widely available self-service interfaces have shifted some of this control away from I&O. When using cloud services, I&O does not make all resource procurement decisions and should not be held accountable for entirety of the spending. Organizations must work on shifting this accountability toward the consumers of cloud services. This shift requires a cultural change and does not happen overnight. This Gartner framework (see the Evolve component) helps lay out the foundations for this shift to happen over time. However, when starting to implement cost management, organizations can choose to keep spending ownership in the I&O team for an initial — and limited — time frame. This approach will help accelerate the implementation of potentially disruptive processes such as resource decommissioning or rightsizing. In the future, by applying this guidance framework in its entirety, the ownership of cloud spending becomes decentralized and distributed to all teams in charge of deploying cloud applications and projects. Select Governance ModelBecause cost management is primarily a governance concern, you must define the model you want to use to enforce cost policies. Gartner identifies two main approaches to cloud governance:
The two models are depicted in Figure 3. Gartner believes that cloud computing can be adopted at scale only by adopting an “on the side” governance model. Only this model can unlock benefits such as agility and speed that organizations seek from cloud technologies. If you are adopting cloud computing to accelerate business innovation, you must implement the “on the side” governance model as described in This framework was developed for organizations that apply such a model. The “in the way” model has proven not to be effective with cloud computing. The centralized IT organization is typically not staffed to provide timely support to the growing number of provisioning and change requests from lines of business (LOBs). Cloud consumers increase pressure and demand on the IT organization, especially because they have the possibility to experience cloud directly. As a consequence, organizations that implement governance in the way typically experience a higher degree of shadow IT. See for the strength and weaknesses of different self-service approaches. Once this prework is complete, organizations can move to implementing the cost management framework for integrated IaaS and PaaS. The rest of the document describes the framework in detail. PlanOrganizations need to develop capabilities to produce an application budget and consumption forecast as accurately as possible. Setting an expectation upfront creates a baseline against which the organization can measure actual consumption. Develop this capability and run this process prior to deploying applications, projects and workloads in the public cloud. Create forecasts for each new application you deploy in public cloud environments and for each application you migrate from on-premises into a public cloud environment. The Plan component of this guidance framework is depicted in Figure 4. Define RequirementsThis capability is a consultative function provided by the CCOE to uncover the application’s nonfunctional requirements that have an impact on costs. The goal is to identify the precise outcomes that drive the subsequent cloud services design and avoid overarchitecting applications. Collaborate with product owners and business stakeholder to understand the purpose of each application and the value it delivers to the organization. Determine the key metrics that demand the use of specific architectural principles. Even when your cloud consumers believe they have predetermined the required set of cloud services, challenge their assumptions and clarify what they’re trying to accomplish. The more cloud consumers are accountable for their costs, the more they value the support of the CCOE in this panning activity. Consumers shouldn’t consider this activity as an administrative burden that slows down their cloud projects. Conversely, you must position the CCOE as a subject matter expert that helps cloud consumers accomplish their goal and determine the most cost-effective architecture to deliver their requirements. To make users more accountable for their cloud costs, see Shift Budget Accountability in the Evolve component. It is easier to define requirements for existing applications because you have historical usage data to observe. New applications will require more assumptions about their expected utilization, and you’ll possibly need to employ an iterative approach. Regardless of the difficulty, defining upfront as many requirements as possible is instrumental to a precise cloud services design. Such services are available with many configuration options, each of them bearing its own cost. Certain options should be enabled only if truly needed. For example, you will choose to enable (and pay for) cross-region data replication only if the application is critical for the business continuity. Conversely, overarchitecting cloud services can only result in overspending. Ask your cloud consumers the following questions to determine their workloads’ requirement:
The definition of requirements is part of the extended function of IT as a broker of cloud services. The role of a service broker is to enable developers, consumers and lines of business to quickly access technology services while safeguarding the interests of the business through the application of centralized policies and procedures. Taking on this role involves a delicate balance between enabling agility and maintaining governance and control. For more information, see Architect With Cost in MindYou must use the defined requirements to help your cloud consumer design the most cost-effective architecture for their application. For example:
All of the architectural components carry a cost tag. Therefore, you must design your cloud architectures with cost in mind. In the past, organizations used to design for availability, performance and security to be delivered from a finite set of resources. The cost of servers, storage, network and data center staff was already in the books and the efficiency goal was to maximize utilization and return on investment. Therefore, there was a natural tendency to overarchitect. The cloud reverses this paradigm and allows for a more precise design that is perfectly aligned to workload requirements. Cloud services are provided with two different charging models: allocation-based and consumption-based services. Allocation-based services require users to preprovision capacity, and cloud providers charge for that provisioned capacity as long as it exists and regardless whether it is used. Conversely, consumption-based services don’t require preprovisioning and are billed based on units of consumption. Figure 5 illustrates the differences of these two charging models. Organizations must design architectures that leverage cloud services based on the expected usage. An application with expected spikes in usage will possibly be more cost-effective if powered by consumption-based services. A stable application will be best for allocation-based models. Architecting with cost in mind means picking the right services to deliver the exact set of known requirements and not more than that. Because not all requirements may be known at this stage and you may be making assumptions, you won’t be able to produce the definitive architecture at this stage. The final architecture will result from multiple iterations and optimizations that you’ll conduct in the first few weeks or months after deployment. However, designing an architecture as part of the planning capability is the fundamental starting point to produce your consumption forecast. Choose Pricing ModelsCloud providers offer services through multiple pricing models. For example, you can buy the same instance with pay-as-you-go (PAYG) or by committing to consume it for a given term, such as one year or three years. The instance will be priced differently based on the chosen model and it will be delivered with a different service level. Figure 6 is a screenshot from the Gartner Cloud Decisions tool that indicates the price range for an m5.xlarge instance on AWS. Note, there is an order of magnitude of difference from the cheapest option (spot, Linux OS, us-east-2 region) to the most expensive one (on demand, Dedicated Host, Linux, sa-east-1 region). The difference in pricing shown in Figure 6 means that planning your cloud consumption using the wrong pricing model can have a huge impact on your cloud forecast. Your forecast can be up to 10 times more expensive — or 10 times cheaper — than the actual future spending. Therefore, it’s important that you learn how to make pricing model decisions upfront. Switching pricing model after deployment is possible and also recommended by this framework. However, choosing a model upfront allows you to produce a more accurate estimate and reduce the need for future budget adjustments. In general, cloud provider pricing models vary based on the following attributes:
Scrutinize your cloud provider’s price lists and learn the combination of attributes that influence pricing. You will likely end up with multiple pricing models applied to different groups of resources. Typically, organizations commit for longer terms for their baseline demand and use PAYG for temporary spikes or bursts. If you’re not sure about which model to choose at this planning stage, prioritize models that allow you to retain maximum flexibility. Prioritizing flexible, low-commitment models will protect you from overbuying. It will also give you freedom to change configurations once you have a clearer picture of your needs. However, the most flexible pricing model (PAYG) also comes at the highest price. If you’re expecting a general increase in the use of cloud services within your organization, you may opt for the more flexible commitment-based models such as AWS Savings Plans, Google CUDs or EA-based negotiated discounts. Such models can pay off as your cloud usage ramps up because they apply to a broader set of services and resources. Forecast ConsumptionOnce you’ve designed your architecture and chosen your pricing models, you can create your consumption forecast. Create a workload model with the data you’ve gathered and input this information into purpose-built tools that can calculate an estimate of your monthly charges. Such tools include cloud providers’ pricing calculators or third-party tools such as Gartner CloudMatch (part of the Gartner Cloud Decisions tool). At this stage, your forecast is based on assumptions and, as a consequence, it probably won’t be a perfect match with the actual bill. On the contrary, it will contain a margin of error. This margin will likely lower in future as you learn how to better collect requirements and model your workloads. It is important to create forecasts even if you expect a wide margin of error because they’re fundamental to set macrolevel expectations of your cloud spending. Without such forecasts, you won’t be able to understand whether you’re spending less or more than you expected, and you won’t be able to improve your forecasting ability. As you progress through this framework, you will learn how to adjust your cloud consumption forecasts when piloting your workload (see Deploy Pilot Application in the Plan component) or even once you start looking at your actual spending (see the Track component). Automate Forecasting in the Continuous Integration/Continuous Delivery (CI/CD) PipelineMany organizations adopt a “shift left” mentality to place the onus for quality, reliability and uptime with application delivery teams that practice DevOps. Such organizations place increased expectations on such teams to forecast costs, optimize resources and implement continuous capacity management. This requires pulling traditionally manual processes for capacity management and forecasting into the automated CI/CD process. In such a process, your CI build generates a release candidate of your application. The release candidate includes an application manifest with metadata such as versioning, system requirements, application configuration and potentially a run book. You can use this application manifest to make the CI/CD toolchain aware of the application components and the resources the application consumes. Application teams can use this additional metadata to produce a cost forecast of the application in a production environment. Once produced, the forecast could be fed into an issue tracking system to manage the request for additional resources or to collect approvals from finance. In addition, this data can be utilized to baseline and measure the forecast of several releases and produce a historical view. This historical view will allow you to improve your forecasting ability, much in line with the agile principles used to measure the difficulty level of implementing user stories. Although Gartner sees early adopter organizations experimenting with it, the practice of automating forecasts in the CI/CD pipeline is still emerging. The tools available to calculate forecasts are still maturing and they were not originally intended for automation pipelines. Deploy Pilot ApplicationTo improve the accuracy of your forecast and refine your estimate, you must deploy a pilot of your application before deploying it in production. The pilot stage allows you to detect misconfigurations early, as you realize they’re not suited for your actual demand. Cloud cost calculators provide an initial baseline to create an estimate for the designed architecture. However, their forecasting accuracy is proportional to the one of the assumptions made when modeling the workload. Some people use the analogy “garbage in, garbage out” to describe such tools and highlight the importance of the quality of the inputs. Deploying pilot applications before production is a best practice that serves many different purposes, such as finding bugs, discovering architectural issues or functional testing. Once completed, pilot deployments are often promoted to production, without the need to redeploy new infrastructure. However, few organizations include cost monitoring during the pilot stage. To monitor consumption of your pilots comprehensively, you must first develop some of the capabilities described in the Track component of this framework. Once you have visibility into key metrics of your pilots, you can monitor utilization and cost and make the architectural adjustments that improve the accuracy of your consumption forecast. Look for utilization patterns and study the consumption trends. Use performance management tools to understand the behavior in terms of CPU, memory, IOPS and data transfer. Look for specific time frames when the application may not be utilized or when it would possibly require additional resources to deliver the required performance target. The length of this phase will vary. Gartner typically observes organizations running pilot applications to monitor consumption from one to three months. The actual duration will heavily depend on your degree of familiarity with the involved cloud provider and the maturity of your cloud adoption. The more workload you deploy in the public cloud, the more accurate your planning ability will become, allowing you to shorten the length of your pilots. Establish BudgetComplete the Plan component by establishing a budget figure for each application, project or workload you’re deploying in a public cloud environment. This figure helps set expectations around cloud costs and lower the general anxiety of uncontrolled spending growth. The cloud spending owners — whether I&O or the individual cloud consumer teams — should ask the finance organization for formal approval of this budget. Learning how to build budgets upfront and allowing the organization to approve spending before it occurs is fundamental for enabling cost governance. Having cloud consumers ask for budget approvals make them more accountable for their spend, as described in “Shift Budget Accountability” in the Evolve component. Once established, the budget figure should be configured in a budget tracking system, which can compare actual spending to the established expectations and alert owners when needed. Examples on how to configure budget alerts for GCP and Microsoft Azure can be found in and Developing cost planning capabilities is key to set agreed-upon expectations on cloud spending. Skipping this component of the framework and not establishing application budgets would cause concerns on the lack of cost discipline. Furthermore, without planning, organizations will struggle to make their cloud consumers accountable of their spending. TrackOnce your budget is established and your application is deployed, you must maintain visibility into cloud spending. Many companies save money by simply gaining visibility into who is spending money and for which projects. With the right insights, your organization may begin to question whether deployed resources are adding value, and whether or not they are necessary. Tracking spending requires the creation of an organized view of costs. Organizing cloud costs is not an activity to conduct on the bill itself. Trying to manually attribute each line item to your cost centers is likely to be unsuccessful due to the large amount of data to process. Furthermore, processing issued bills won’t give you a daily view into your spending, which is compulsory for containing waste and optimizing resources. Native providers’ hierarchy and tags are the foundational mechanisms to organize resources on all major cloud providers. With well-organized resources and a cost allocation strategy in place, organizations can monitor cost and utilization metrics to detect anomalies and implement chargeback and showback. The Track section of this guidance framework is depicted in Figure 7. Design Native HierarchyAll major cloud providers offer native mechanisms to classify resources in a hierarchical structure. For example, AWS offers “accounts” and “organizations.” Microsoft Azure offers “management groups,” “subscriptions” and “resource groups.” GCP offers “folders” and “projects.” The resource placement within these native constructs is mandatory at the time of provisioning. The element of the hierarchy where a resource is placed appears in the provider’s bill, next to each line item, such as an hour of consumption of a compute instance. Therefore, technical professionals must also think about cost allocation when designing their resource placement strategy in a provider’s native hierarchy. However, this hierarchy isn’t designed primarily to implement a cost structure. Rather, it is intended to provide for resource isolation and management at scale. An AWS account or a Microsoft Azure subscription bears many constraints that organizations must be aware of and that should take priority over cost allocation. For example, organizations shouldn’t be using one account per application only to have the ability to track how much each application costs. Using multiple accounts complicates resource management because each account acts as anchor for quotas, permissions and other policies. This management overhead largely outweighs the benefits when multiple accounts are used solely for cost attribution. For more information about designing a provider’s governance structure and related constraints, see and Although a provider’s native hierarchy is fundamental to enable basic cost allocation, organizations must complement it with other mechanisms such as tags or labels to implement cross-cutting resource metadata (see the following section, Implement Tagging Strategy). Implement Tagging StrategyTags (or labels) are a fundamental governance construct offered by all major cloud providers, including AWS, GCP and Microsoft Azure. Tags implement metadata that apply across the elements of a native provider’s hierarchy. Tags appear in the provider’s bill next to each line item and can be used to break down cost reports. For example, by tagging development resources with the “environment = development” tag, organizations can group the spending of all development environments across all their accounts, subscriptions or projects. Tags provide maximum flexibility and minimal constraints to implement a multifaceted cloud resource classification strategy. For cost tracking, Gartner recommends the use of tags in addition to and independently from other native governance constructs. Tags provide several advantages compared to solely relying on native hierarchical constructs, specifically:
Tags will appear in your bill from the moment they’re implemented. They will not apply retroactively to bills that were issued prior to the application of tags. To enable cost tracking, implement your tagging strategy as soon as possible. Figure 8 shows an example of how the combination of native hierarchy constructs (“Account” in Figure 8) and tags (“Application” and “Environment” in Figure 8) leads to cost breakdown reports that allow organizations to gain insight in their spending. Although most organizations understand the importance of tags, even beyond the cost allocation use case, Gartner inquiries show that many tagging initiatives fail. This is due to high management complexity, low maturity of providers’ tooling and a general low perceived value from cloud consumers. To address these issues, Gartner has developed a guidance framework for This framework provides a sample tagging dictionary, and helps mitigate risks and avoid common pitfalls, setting organizations up for success in their tagging initiatives. Gartner recommends defining a tagging dictionary and promoting it internally through workshops and other dissemination activities. Organizations must establish an audit process that allows them to detect and remediate mistagged resources. Furthermore, organizations must use automation to mitigate the administrative burden of implementing tags. Lastly, enforcement measures must be put in place to prevent resource provisioning when tags are not implemented in accordance to the guidelines. Just like cost management, the CCOE leads the development of a tagging strategy. The CCOE should involve all stakeholders from the start, to allow them to grasp the value of tags and their use cases. For additional details on this Gartner approach to tagging, see Allocate Costs of Shared ResourcesIn some situations, even the combination of tags with the provider’s native hierarchy is not enough to properly allocate spending across cost centers. This happens when resources are shared between multiple projects, departments or by the entire organization. For example, a single e-learning application may be used by multiple departments to train their teams. As another example, the network connection (e.g., AWS Direct Connect or Microsoft Azure ExpressRoute) between the organization’s data center and a public cloud provider would be used by everyone accessing cloud services. In these and other similar situations, organizations must determine how to split the costs of shared resources. This cost allocation activity is typically handled manually and, therefore, it does not scale. Therefore, Gartner recommends minimizing the use of shared resources by:
Sometimes, the effort required to allocate costs for shared resources may outweigh its benefits. Therefore, you may want to develop this strategy only for the most expensive shared resources that experience heavy unbalanced usage from your cloud consumers. Define Metrics to TrackOnce resources are classified with the desired set of metadata, organizations must establish visibility into cost metrics. Organizations must define the metrics they want to track to enable cost governance. Such metrics are used to:
Besides the cost of services, organizations must track other related metrics, such as utilization, capacity, availability and performance. For example, to identify spending waste, it is fundamental to look at how much we’re using a resource and compare utilization with its provisioned capacity. Furthermore, if we take actions to reduce spend by reducing the infrastructure footprint, we want to make sure we’re not impacting availability and performance. Gartner recommends building the following dashboards and reports and updating them at least daily. These reports should be available for each project, application, department and any other resource metadata:
You will be able to calculate the estimated spending waste once you’ll have developed some of the capabilities described in the Reduce and Optimize components of this framework. You can estimate the savings that would derive from the identified cost optimization opportunities and build reports that showcase the most and least disciplined teams and individuals. Ultimately, these reports will allow you to increase your consumer’s spending accountability. This practice is described more in details in the Incentivize Financial Responsibility section in the Evolve component of this framework. Alert on AnomaliesMonitoring cloud spending can be overwhelming, especially when cost must be continuously correlated with metrics such as utilization or performance. Technical professionals should not spend time monitoring metrics when the metrics are simply portraying a “normal” situation. In light of this, organizations must introduce automation to detect and alert when there is a deviation from a normal trend, i.e., anomalies. To do so, define the conditions that represent an anomaly by using a policy. For example, a department’s daily spending that is 10 times bigger than the day before can be the symptom of a problem and should be flagged as an anomaly. However, a rule-based policy might also trigger false positives, for example when anomalies are occurring on a regular basis. As a more sophisticated practice, you can build machine learning models that learn the normal trends by consuming historical data points and predict what the normal value range might look like over time (see the gray band in Figure 9). Once the model has predicted a normal value range with confidence, any metric value outside of that range would be flagged as an anomaly (the red line in Figure 9). Anomalies on metric values should draw your attention and, therefore, you must trigger alerts when anomalies are detected. Notify resource owners, product teams, finance, the CCOE or any other individual or team that must be aware of the potential issue. For example, if the cloud estimate for a given project is $10,000 per month, and the consumption is already at $8,000 after the first week, the organization should be made aware of it. Alerting on anomalies enables organizations to promptly undertake corrective actions, instead of realizing the issue once the bill arrives. In nonproduction environments, alerts can also be used to trigger other and more disruptive actions. Such actions may include the shutdown of all cost-accruing services until the cause of the anomaly is found and resolved. Implement Chargeback and ShowbackThe unprecedented spending transparency provided by cloud services enables organizations to quickly implement chargeback and showback strategies. Resources classification with tags and other metadata allows you to precisely attribute costs to your internal departments and cost centers. With chargeback, each department gets an internal bill with the cost of services they generate. Organizations use chargeback to enable spending accountability, visibility into the costs of IT from senior management and the ability to respond to unexpected demand. Showback is a form of chargeback that provides cost breakdown and visibility, without the need to issue an internal bill. In traditional data centers, both chargeback and showback have been very complex, due to the difficulty of calculating the costs of each infrastructure item. Cloud services solve this problem by providing granular cost metrics programmatically, making chargeback and showback much easier to implement. In a chargeback scenario, IT acts as an internal service provider to the organization as a whole. Sometimes, IT chooses to apply service charges to each internal client to cover the cost of shared or centralized services. When implementing a chargeback strategy, you must choose whether you are charging services “at cost” or at a different price. To price differently you can:
Ultimately, IT can choose a chargeback strategy that implements both models, possibly using a lower markup. Developing cost-tracking capabilities is foundational to enable cost governance. Having visibility into cloud spending is fundamental to verify the correctness of expectations, detect anomalies, increase accountability and provide observability into the metrics that can drive costs down. Skipping this component of the framework would make the entire cost management initiative fail. Without visibility into each metric and their trends, organizations wouldn’t know if spending is under control and will likely overspend. ReduceUsing the gained visibility into spending metrics, you must seek opportunities to reduce your monthly bill. The importance of this and the following Optimize components are further reinforced by Gartner’s prediction that: This framework component highlights common methods that organizations use to reduce their spending, as shown in Figure 10. These methods can be applied without the need to change the application architecture or code. Therefore, they are easier to implement and have an easy-to-calculate ROI. For example, you can estimate the savings from the rightsizing of a compute instance that has been overprovisioned for several weeks. Dispose Unused ResourcesYou must look for resources that have been deployed but not being used at all. These would be allocation-based resources that, once provisioned, will accrue cost irrespective of their usage. Such resources require the specification of a certain capacity at provisioning time, and that capacity determines their cost. To detect such resources, look for extremely low utilization metrics along a period of time. Once found, initiate a workflow that ultimately disposes the unused resources to gain cost savings. Although it sounds obvious, disposing unused resources is not a common practice within traditional data centers. On-premises, organizations operate resources in a finite, preprocured capacity. Money is spent upfront for procuring the overall capacity and not for its actual utilization. Furthermore, once capacity gets allocated to projects, people are reluctant to give it back, in case they won’t be able to obtain it once they’ll need it again. Consequently, organizations are not prepared to manage the disposal of unused resources. In cloud computing, capacity allocations (such as the number of CPUs and gigabytes of RAM of a compute instance) are extremely granular, can be changed frequently and are billed down to one-second increments. Such characteristics of cloud computing makes the disposal of unused resources highly impactful to reduce your monthly bill. The definition of what “unused”means is described using policies that define rules based on metric values. For example, a compute instance whose CPU has been used, on average, below 1% for at least 24 hours should be considered unused and should be disposed. To increase result accuracy, organizations should be refining this policy by using multiple metrics. For example, compute instances utilization can be determined by inspecting RAM, network bandwidth and Secure Shell (SSH) or Remote Desktop Protocol (RDP) login sessions when these are relevant (such as in the case of development instances). There are several resource types that organizations must monitor and dispose of, if unused. As a minimum Gartner recommends looking for the following resource types:
To minimize disruption, your disposition policy should encompass multiple stages. First, it should mark identified resources as unused. Then, it should notify the owners and solicit an action from them. In absence of change in the resource utilization pattern for a grace period (for example, 48 hours), the disposition policy should eventually execute an administrative deletion. Schedule ServicesYou may have resources that remain idle only in certain hours of the day or certain days of the week. This is the typical behavior of dev/test workloads that depend on the presence of developers at work. In this case, a retroactive detection mechanism that looks for idle resources is not efficient. With such a mechanism, you’d spend time and money just to deem resources “idle” before you can actually turn them off. In such cases, you must schedule cloud services to be on and off based on expected utilization patterns. If you know what the expected utilization is, you can describe it using a “duty schedule” tag. Then you can use any cron-like scheduler to read that tag value and turn services on and off accordingly. If you don’t know what to expect in terms of utilization, you can make assumptions on future behavior based on historical data. You can observe your cyclic workloads over a period of time and draw utilization patterns over a defined working cycle, such as a week or a month. Then, develop a scheduling policy that matches the identified patterns and that will proactively turn services off when the expected usage is low. As an example, Figure 11 shows a CPU-based utilization pattern of a compute instance over a week, with one-hour granularity. When building utilization patterns, organizations should refine the policy that defines the boundary between the used/unused conditions using multiple metrics. For example, compute instances metrics should include CPU, RAM and network bandwidth, but also SSH/RDP login sessions, especially for development instances. Scheduling services can be highly impactful. You can save up to 70% on development instances if you schedule them to be on only for eight hours a day and five days a week. If developers need to work off-business hours and they find their instances offline, you can allow them to turn them on manually but specifying how long they need this exception for. You should also cap the amount of time a developer can ask for this exception to a maximum number of hours. Not all cloud services can be turned on and off while persisting data. Compute instances do persist data when the data is stored on a decoupled block storage volume. Conversely, other services such as Amazon Redshift do not persist data when its nodes are turned off. In such cases, you must build in your start/stop operations the required tasks to backup and restore the data from an external storage service. Scheduling services is not recommended for production workloads. To address similar variable usage patterns of production workload, see the Optimize component. Rightsize Allocation-Based ServicesAllocation-based services require that you request a specific allocation at provisioning time. This allocation could be the number of CPUs, the amount of RAM or the maximum number of IOPS of the underlying infrastructure. You pay for this allocation irrespective of whether you’re using it or not. Often, you end up using resources at a much smaller percentage than what they can deliver. When this happens, you must rightsize your resources to reduce costs. Rightsizing is the practice of adjusting a cloud service allocation size to the actual workload demand. Examples of allocation-based services that are good candidates for rightsizing are:
Traditional data centers are commonly underutilized. Deployed resources are often overprovisioned because consumers compete for the same finite IT capacity. Consumers tend to provision bigger resources than they need just to secure that capacity for their projects in light of an expected (or hoped for) future growth. This practice is well known to I&O departments. But because a data center’s capacity is procured upfront, driving more efficient usage of data centers does not have an immediate impact on cost. On the contrary, more efficient usage would translate into larger wasted capacity and questionable investments. Cloud computing reverses this paradigm. Client organizations can count on infinite capacity and can focus on managing virtual resources on demand. Cloud providers bill organizations based on the provisioned virtual resources and allow their clients to adjust service allocations with an immediate impact on billing. As a consequence, rightsizing cloud resources can have a huge impact on reducing your monthly bill. To implement rightsizing, you must monitor resource utilization over a defined period (for example one week), compare it with provisioned capacity and change the allocation size if a resource is found larger than necessary. Because your demand may vary over time, you must be ready to rightsize in both directions — down and up — by also increasing resource size when performance is suffering. Your ultimate goal is to develop a continuous rightsizing process that can enforce a defined target utilization threshold. Continuous rightsizing is no different to what is also known in the industry as “vertical autoscaling.” For example, if you set your utilization threshold for compute instances at 70% on the CPU metric, your rightsizing process would change size accordingly to keep the utilization line as flat as possible, as depicted in Figure 12. Rightsizing is one of the most effective cost optimization best practices for public cloud IaaS and PaaS. Together with unused resources, overprovisioned service allocations are among the top contributors to public cloud spending waste. If you need to quickly reduce your costs, Gartner recommends that you prioritize rightsizing among the capabilities to develop. When developing rightsizing capabilities, Gartner recommends:
Rightsizing is an efficient capacity management practice for any allocation-based cloud service. This practice is necessary to achieve savings because cloud providers ask their client organizations to choose an allocation size for their provisioned services. However, as providers increase their serverless capabilities, the concern for dynamic capacity management will be shifted to the cloud providers themselves. Providers that implement continuous rightsizing will be providing serverless capabilities that dynamically scale services based on observed demand. Microsoft Azure SQL Database Serverless is an example of a cloud service for which the cloud provider implements continuous rightsizing behind the scenes, discharging clients from this concern and unlocking cost benefits for dynamic workloads. More information on using serverless technologies for cost optimization can be found in the Use Serverless Technologies section in the Optimize pillar. Leverage Discount ModelsNot all workloads benefit from the flexibility of the low-commitment PAYG pricing model. Some workloads are stable and their future utilization is predictable. To address such situations, cloud providers offer discounted prices in exchange for the client’s commitment to use their services for a period of time. There are two types of discount models for cloud services:
Negotiated discounts are part of an enterprise agreement that your organization may sign with cloud providers. The primary purpose of signing an EA is to receive better terms and conditions than those offered by the standard provider’s click-through agreement. One such condition can be a discount applied to the billed cloud services. If your organization doesn’t have an EA in place with your cloud provider, ask your procurement and vendor management department to negotiate one. Contact your sales representative to initiate the discussion. Although EAs are negotiated, cloud providers have a pretty standardized framework for their discount models. Discounts are applied as a percentage of reduction (such as 5% or 20%) and can cover your entire bill or a specific set of services that have a higher volume of utilization. In exchange for a negotiated discount, you will need to commit to a certain minimum spend along the validity of the EA. Programmatic DiscountsCloud providers offer discounts that can be purchased programmatically. Such discounts do not require a negotiation with the provider’s sales team. Client organizations can purchase such discounts in the form of “vouchers” using a management operation, which can be automated. The purchased discount is normally billed with a one-time charge and has a specific time validity, after which it expires. Example of programmatic discounts are:
During its validity, any existing cloud resource that matches the discount conditions can “consume” it in exchange of receiving a zero-dollar charge for a specific billing period (normally one hour). Discounts are not purchased for a specific resource, but they can match multiple ones during their period of validity. In the hypothesis of always finding a matching resource through its validity, a programmatic discount can make the actual resource costs up to 70% lower than the PAYG model. When a purchased discount exists but no matching resource is found, that would constitute spending waste. When more resources exist than those covered by purchased discounts, cloud providers bill them using the standard PAYG pricing. Cloud providers offer several types of programmatic discounts, which differ based on their applicability such as a specific service, a provider’s region or a resource type. All discounts provide a trade-off between flexibility and benefits. The more stringent conditions organizations are willing to commit to, the more benefits they will appreciate, such as higher discount levels. The more flexible discount will increase the likelihood that they will match your actual usage. Some discounts require you to manually change their flexible attributes to match your utilization. For example, AWS Convertible RIs require that you convert them to leverage their flexibility. Other discounts, such as AWS Savings Plans or Google CUDs, automatically apply across a wider spectrum of resources. Because AWS RIs and Savings Plans offer similar discount levels, Gartner recommends prioritizing Savings Plans over RIs due to their wider applicability. Programmatic discounts can significantly reduce your cloud bill. Determine your baseline and purchase enough discounts that allow you to cover your stable, predictable workloads. If you are unsure about how much to commit, you can observe your past utilization and use it to make assumptions about the future. Then, you can decide your level of “aggressiveness,” bearing in mind that more aggressive commitments bear higher risks of spending waste, as shown in Figure 13. Managing your programmatic discounts centrally — and not by workload or department — will improve the accuracy of your utilization estimates. It will also increase the likelihood of consuming purchased discounts and will reduce the risks of spending waste. Deciding to sign up for programmatic discounts and managing your discount portfolio to ensure maximal coverage is a very complex matter. Although cloud providers are introducing more simplification, Gartner recommends relying on tools that help determine your baseline and suggest discount purchase and modifications. When managing programmatic discounts, Gartner recommends:
Programmatic discounts are a cost reduction practice that can quickly drive your cost down. Together with deleting unused resources and rightsizing, this practice should be on your priority list if you urgently need to reduce your monthly bill. However, you shouldn’t rush your commitment decisions because they bear consequences for a medium-to-long term. Although they’re both very effective in reducing your bill, negotiated and programmatic discounts also bear the risk of causing the wrong consumption behavior. It’s easy to fall into the trap of buying larger commitments and then artificially driving utilization up just to match those prepurchased commitments. That’s exactly the same logic that we used to apply in traditional data centers and that causes many inefficiencies. For example, Gartner does not recommend changing a compute instance size to a bigger one just to match an unused RI that’s sitting in your portfolio. If the RI is unused, it’s probably because you have overcommitted, and you should take this fact into account when deciding on the RI renewal. If you size an instance to match the RI, this RI would be considered as consumed and you’ll end up overcommitting again in the future. Upgrade Instance GenerationOver the course of the years, cloud providers such as AWS and Microsoft Azure have refreshed their compute platform a few times. The newer platform is based on new hardware, and new processor and memory technologies, and usually comes with storage and networking updates. At every refresh, cloud providers also launched new instance types from the new platform, grouped under a new “generation.” The new generation instances are supposed to address the same use cases as their previous generation, but with renewed power. Often, these new instance types are less expensive because they are more efficient. Figure 14 shows some metrics tracked by Gartner Cloud Decisions over the years. It indicates the relative progression of price, CPU, memory and network performance for three generations of AWS’s “M” general-purpose instance. The chart shows that, while the price has been decreasing slightly, the CPU and memory performance have been increasing over time, especially with the introduction of the Nitro technology in the fifth generation. Whenever a new instance generation is available, consider the performance increase as your ability to achieve more by spending the same amount of money. As a consequence, develop your compute instance rightsizing practice to also work across instance families. You may be able to save money by choosing a smaller size for a new instance generation and deliver the same performance. For more information on rightsizing, see Rightsize Allocation-Based Services in the Reduce component of this framework. Establish a DevOps Feedback LoopCI/CD platforms are configured with the metadata about software releases describing the infrastructure components that an application need to run. For example, a Kubernetes deployment manifest contains the number of pods, RAM and CPU. Such deployment manifests are managed through a version control system and maintained as part of the CI/CD process. Organizations must establish a DevOps feedback loop between cost reduction methodologies and the CI/CD pipeline. Such feedback loop allows CI/CD platforms to be aware of the changes made by the cost management practice. Otherwise, it would be counterproductive to rightsize resources and then have a CI/CD platform overprovision them again at the next release. A robust CI/CD process and application platform includes the capability to track metrics such as utilization and capacity of resources as they move from development into production. Organizations must make these metrics available to the application development teams or to whoever is responsible for the deployment manifests. Furthermore, the cost management practice can publish the recommended sizes in public repositories that can be automatically read by the CI/CD platform at the time of the software release. Developing cost reduction capabilities is the quickest way to access cost savings. Because these practices do not require architectural changes to your applications, they are more easily applicable to a large set of use cases. Skipping this component of the framework will make your organization overspend for cloud services and won’t allow you to profit from the elasticity of cloud computing. OptimizeOptimizing cloud spending goes beyond the tactical cost reduction techniques mentioned in the previous Reduce component. Conversely, strategic optimization techniques often require application architectural changes to reduce the need for resources. Cloud computing has inspired modern application architectures that are also referred to as “cloud-native.” Such architectures are designed around the native features of cloud services and can often deliver more favorable ROIs compared to traditional ones. The Optimize component of the framework (depicted in Figure 15) illustrates optimization best practices that you can adopt to optimize your monthly bill. Use Preemptible InstancesSometimes, the choice of desired service availability determines the price of a resource. Some cloud providers offer compute instances at a much lower price compared to the standard PAYG model. However, their availability is also lower. Preemptible instances are based on a provider’s spare capacity and can be terminated by the provider at any time when standard demand raises. Examples of preemptible instances are:
Assess your application’s architecture and find those components and use cases that may be suitable for infrastructure that might become suddenly unavailable. For example, batch workloads may simply pause when infrastructure goes down and restart once provider’s spare capacity becomes available again. Also, stateless workloads can take advantage of preemptible instances, leaving it up to load balancers to handle the sudden unavailability of nodes. You can further mitigate the risk of unavailability by:
Leverage preemptible instances to gain significant cost benefits if your workload can adapt to their limitations and if you can mitigate the risk of unavailability. Set Up Data Storage Life Cycle PoliciesOrganizations can use multiple cloud services to store data. You can use not only traditional block or file storage, but also object storage or database services designed for different use cases. The line between all of these data storage services is blurring, but besides their functional differences, they also come at different costs. Each storage service may also be provided with different tiers at a different price. Storage tiers provide equivalent functionality, but can differ based on their degree of availability, redundancy and retrieval latency. Selecting a low-latency, georedundant tier with 99.99% availability for data that is not critical for your organization may be a waste of money. Selecting the right storage service and tier is key to make cloud services cost-effective. However, some data may be characterized by usage patterns that differ at each phase of its life cycle. For example, some data may become less frequently accessed as time goes by (for example, a social network timeline). In this case, you can optimize your costs by moving older data to less expensive tiers or services. Other times, you may need the ability to real-time query-only data that’s older than a number of months. In such cases, you may want to use a mix of database and object storage services at different phases of the data life cycle. But while changing a service tier is a fairly simple management operation, changing service type is much more complex as it may require data transformation. Table 1 shows the main pricing and functional differences between the tiers of Amazon S3 object storage service. Develop a strategy to select the right service and tier at each phase of your data life cycle. Optimize storage tiers by automatically moving objects across tiers based on detected usage patterns. Because this practice is highly cost-effective — yet it also constitutes operational overhead — some cloud providers have started to offer it as a managed service. For example, AWS launched the Amazon S3 Intelligent-Tiering S3 class, which automatically optimizes object placement in storage tiers based on observed access frequency. Implement Horizontal AutoscalingCloud platforms provide elasticity that enable applications to grow and shrink the resource footprint in response to both internal and external events. Such behavior is called “autoscaling” and is governed by metric-based policies. Leveraging autoscaling can optimize your costs because it dynamically aligns your resource footprint to workload demand. Autoscaling is either “vertical” — making a single instance bigger — or “horizontal” — adding more instances of the same type and distributing workload across. This section provides best practices for horizontal autoscaling. Vertical autoscaling is covered in the Rightsize Allocation-Based Services section in the Reduce component. Horizontal autoscaling requires specific design principles in the application architecture. Specifically, it requires the application to allow multiple instances to run in parallel. The application must also be able to start and shutdown gracefully and it doesn’t have to rely on local dependencies. explains how to design an application that allows for horizontal autoscaling. Autoscaling is triggered by policy-based thresholds that instruct a cloud platform to automatically scale applications by adding instances, typically as a reaction of an increase in load. The policy includes a limit for the scaled resources (such as the maximum number of instances to automatically provision) and thresholds to remove instances once the load goes down. Horizontal autoscaling can be grouped into four categories that differ based on the metrics used to trigger events. These categories are:
Horizontal autoscaling in cloud platforms can function both at the IaaS level (more coarse-grained) and at the application PaaS and container as a service (CaaS) levels (more fine-grained). For certain services, autoscaling capabilities are natively built into a cloud platform and are handled automatically by the cloud provider. In such case, this “zero-touch” autoscaling becomes an inherent characteristic of what is called “serverless computing,” described in the next section. Horizontal autoscaling is an effective cost optimization practice that leverages the elasticity of cloud computing. It is in line with cloud-native architectural principles and also makes applications more resilient and scalable. Horizontal autoscaling should be used in conjunction with rightsizing, because these two techniques normally apply to different sets of applications. Balance Usage of Consumption-Based ServicesMany cloud services are billed with a consumption-based model, whereby you don’t pay for the provisioned capacity. Conversely, you pay for each handled request and the amount of data transferred. While this is the ideal PAYG model, it also adds new challenges in predicting and controlling how much you will spend. Examples of consumption-based services are:
Optimizing consumption-based services for cost is more complex because you do not control the capacity provisioning. Because charges are directly tied to usage, you can optimize your costs by reducing the use of such services. To achieve this, you must transform your application behavior and architecture. For example, design your application to make use of compute as close as possible to where data resides. In traditional data centers, certain resources (such as network bandwidth) were literally considered free of charge. As a consequence, changing your application architecture to reduce the use of consumption-based services can be especially effective for applications migrated from on-premises data centers. Use Serverless TechnologiesIn serverless computing, you are relinquishing control, flexibility and ownership of the application infrastructure to the cloud provider. In return, you get a more dynamic deployment experience, zero-touch autoscaling, increased efficiencies in resource utilization and no more need for capacity management. The unit of compute is more fine-grained than a virtual machine or a container as it is scoped to a single unit of custom application logic. Figure 16 illustrates the key characteristics of serverless technologies. Serverless technologies introduce a microbilling model by which you pay only for the number of transactions and the memory and CPU of the compute instance handling the transaction for the time it takes to execute it. Furthermore, you’re billed for additional services that may be required to execute the function (e.g., API Gateway that collects inbound requests). Serverless computing services such as AWS Lambda, Azure Functions or Google Cloud Functions may seem like the more cost-effective solution because you pay only for what you use. However, there is a tipping point where it becomes cost-prohibitive and you reach a position of diminishing returns. As a consequence, organizations should adapt their application architectures to leverage serverless technologies only for appropriate use cases. Two case studies illustrate opposed returns for the use of fPaaS serverless computing:
In the cost-prohibitive case study, the self-managed option turned out to be far more affordable than using fPaaS. However, the author also had the luxury of having in-house expertise and people to manage the software components required to run the service in a self-managed environment. For modern organizations that don’t have IT operations in place, the premium of serverless computing may still be a better choice than hiring a full team. If you’re unsure whether your application will be more or less cost-effective when using serverless technologies, you can build estimates using purpose-built tools. Aside from the cloud provider’s cost calculators, Serverless Cost Calculator and Servers.LOL are two community projects that help build a forecast for serverless. Use these calculators to mimic your application usage and assess whether the adoption of serverless computing may serve to optimize your cloud costs. Factor in operational costs as you make your comparison with self-managed alternatives. Many organizations start consuming cloud computing by rehosting (aka lift and shift) applications from their on-premises data centers to a public cloud provider. A rehost migration strategy does not require changes in the application architecture. Despite being easier to migrate, rehosted resources are typically unable to leverage key characteristics of cloud computing, such as elasticity and on-demand. As a consequence, rehost strategies tend to have a low-to-negative ROI. provides a framework for selecting a migration strategy that aligns to your goals in terms of speed of migration, ROI and other desired benefits. Rehosted applications primarily make use of IaaS services such as virtual machines and storage volumes. These services provide dedicated allocation-based resources that organizations pay for, regardless of their usage. Furthermore, IaaS services have an operational overhead. Organizations must pay for the team in charge of managing the software running on top of operating systems. Conversely, platform services, such as application PaaS, databases, load balancing, caching and message queuing services, include a management layer that cloud providers offer in an as a service model. Modernizing your applications for PaaS allows you to optimize costs due to:
To modernize your application for PaaS you can, for example, replace the instances that host load balancer virtual appliances with Amazon Elastic Load Balancing (ELB). You can use Microsoft Azure SQL Database to replace the instances that host Microsoft SQL Server and reconfigure your connection strings without changing much of the application code. Amazon Kinesis Data Stream or Microsoft Azure Event Hubs are typically more cost-effective than provisioning, maintaining and exposing APIs from self-maintained Kafka clusters. Kafka is complex open-source software that requires high degrees of availability and reliability, which is difficult to set up and maintain for most IT departments. However, just like for serverless technologies, using PaaS does not imply a cost reduction compared to an equivalent self-managed option. Use cost calculators and mimic your application usage to assess whether the adoption of a PaaS may serve to optimize your cloud costs. Include an estimate of the reduction of your operational costs as that is key to making PaaS more attractive. Take advantage of PaaS by modernizing your application to better operate in the context of cloud computing. Analyze your application dependencies and seek opportunities to replace them with PaaS where requirements and constraints allow. Developing cost optimization through changes in your application architecture allows you to modernize your applications and better align them to cloud-native principles. Although such optimizations may take longer to materialize compared the techniques in the Reduce component, they come with other side benefits such as increased resiliency and scalability. By skipping this framework component, you will not fully maximize your savings opportunities and you may leave behind the cost benefits that derive from the adoption of cloud-native principles. EvolveThe Evolve component of this framework (see Figure 17) illustrates the strategic capabilities to apply the cost management practice throughout the organization. You must adopt the right set of tools for financial management. You must drive cost optimization through optimal workload placement between multiple cloud providers. You will continue to shift budgeting accountability to your cloud consumer and incentivize them to take more financial responsibility. Ultimately, you will identify which business KPI you can correlate with your cloud costs to measure the return of your investments in cloud services. This component of the framework brings to fruition the rest of this cost management framework and evolves the practice to achieve scale. Adopt ToolingTo implement financial management processes, you must use purpose-built tools. The high dynamism and the scale of cloud deployments do not make cost management suitable for spreadsheet-based management. You must employ real-time tools that can read metrics from APIs and provide the automation required for this practice to scale. You must adopt the management tools that cloud providers provide natively. But you also must augment them with third-party tools and possibly develop your own extensions when necessary. See for the Gartner methodology on developing your management tooling strategy. Adopt Native ToolingMajor public cloud platforms are equipped with a broad set of native management tools. Such tools are highly integrated with the cloud platform and provide a high depth of functionality. For example, Amazon CloudWatch and Microsoft Azure Monitor can gather unique metrics about their respective cloud platforms that no other tools can aspire to collect. Native tools are available to all client organizations with no additional deployment effort required. Some of these tools come free of charge, while others may be charged with a consumption-based model. Cloud providers continue to invest in their native management toolset with frequent additions of new features and services. For providers, management tools are also a vehicle to make their cloud platform stickier by improving their customer experience. Due to their depth of functionality, integration and readiness, Gartner recommends organizations to develop their cloud management strategy starting from the adoption of cloud provider’s native tools. Such tools include cost management functionality. Figure 18 provides an example list of the native cost management tools of AWS, GCP and Microsoft Azure with reference to three components of this framework. AWS provides cost management through a series of tightly scoped and loosely coupled tools. Microsoft Azure has strengthened its native functionality by acquiring the multicloud cost management tool Cloudyn in June 2017. Microsoft intends to continue migrating Cloudyn functionality into the Azure native portal and rebrand it as Azure Cost Management. However, at the time of the writing, the migration hasn’t been completed and Cloudyn continues to be available as a stand-alone tool. Google provides minimal tooling for cost management, and client organizations need to rely primarily on BigQuery and Data Studio to get a handle on their costs. Cloud providers’ native tools come with some limitations, for instance:
Despite these limitations, native tools continue to be the fastest route to start controlling your costs. You must prioritize the adoption of these native tools before considering the addition of third-party or in-house developed solutions. However, once mastering these native capabilities, you must conduct a functionality gap and identify the cost management requirements that remain unaddressed. To help with this identification, Gartner has assessed and compared the native cost optimization capabilities of major cloud providers in Adopt Third-Party ToolingTo address functionality gaps in native tools, for multicloud management or simply to gain an independent point of view, you may want to adopt a third-party cost management tool. Managing costs and reducing the cloud bill is a compelling functionality that third-party vendors have built within their product set. Such functionality helps build a tool’s ROI because it provides tangible financial results to the investment in the tool’s adoption. Due to such quantifiable returns, third-party cost management tools have achieved good market traction and have driven several M&A events (see Note 1). However, organizations must be wary of vendor promises and should thoroughly assess a tool’s added value. Many third-party tools in the market provide functionality that’s barely equivalent to what AWS or Microsoft Azure already natively provide. Sometimes, their multicloud capabilities suffer from poor feature parity between supported providers. Therefore, organizations must thoroughly assess the capabilities for each provider they intend to use provides several cost management and resource optimization criteria that organizations can use to develop their evaluation framework. Organizations can find aspects of cost management functionality in the following types of tools:
To help organizations assess the depth of functionality of cloud cost optimization tools, Gartner has published a report that compares five vendors, selected based on Gartner client interest. provides an assessment based on the same set of criteria as those used for assessing the cloud provider’s native functionality in Gartner clients can use the two reports jointly to determine the best combination of tools. Third-party cost management tools provide functionality that can exceed what cloud providers natively implement. Furthermore, their support for multiple cloud platforms allows organizations to implement a multicloud management strategy. The compelling story and provider-independence of such tools will allow them to continue to receive investments in the near future. Assess the addition of a third-party tool as part of your management strategy to extend the cloud provider’s native functionality and to gain independence. Develop ExtensionsAlthough in rapid movement and expansion, the cloud cost management market is far from mature. Cloud providers’ native tools are just beginning to build functionality. Sometimes, cloud providers take a “building blocks” approach, leaving it up to client organizations to develop what’s necessary to glue blocks together. Even if more advanced, third-party tools still focus primarily on IaaS and are just starting to address the PaaS space. As a consequence, you may have to develop your own extensions when functionality is not available or when it requires integration. Gartner doesn’t recommend developing an entire cost management system in-house. When developing your own extensions, you must:
For example, you may want to develop code that terminates all the cost-accruing services within a development environment as it violates a budget policy. Other times, you may want to develop a policy that deletes unused capacity of a cloud service that is not supported out of the box by the tools you’re using. Although I&O technical professionals have traditionally operated with point-and-click interfaces, cloud computing makes code increasingly important for cloud management. Certain cloud providers’ functionality may be accessible only through coding, such as policies written in JSON. You can also drive automation by developing code through fPaaS as described in Although third-party tools may abstract the need for coding, they also may introduce constraints on the available functionality. As a consequence, you must learn how to code to develop the extensions required to implement this framework in its entirety. Onboard New ProvidersCloud providers offer similar services but with different capabilities and prices. Although their services are designed to address similar use cases, the differences in implementation may result in cost savings when running an application in one provider versus another. Organizations that want the most cost-effective provider for each workload must develop multicloud strategies, which start with the onboarding of new providers. Cloud technologies are faster to adopt than data center technologies. The adoption of cloud services does not require lengthy vendor selection, procurement processes, capacity allocation and contract negotiation. However, the adoption of a new cloud vendor that is suitable to run enterprise-grade workload still requires a number of onboarding tasks, including:
Most organizations are already using or planning to use multiple providers and Gartner expects that most enterprises will end up in a multicloud scenario for both IaaS and PaaS. Developing the competency to operate alternative providers allows you to mitigate concentration and lock-in risks. Furthermore, multicloud strategies allow you to define workload placement policies based on cost drivers as described in the next section, Broker Cloud Services. For more information on the benefits of multicloud adoption, see Broker Cloud ServicesMulticloud strategies require you to develop a workload placement policy. This policy governs the decision of the target cloud provider for your applications. As part of this framework, you must develop the cost-based policies that allow you to place workloads in the most cost-effective platform. Comparing costs between cloud providers is no easy task. Often, an application requires a different architecture in each provider to deliver the same set of requirements, in terms of performance, availability, integrity and confidentiality. This is due to different technology platforms, available services, design principles, HA strategies, security and SLAs. Before producing a comparative forecast, you may have to adapt your application architecture using the principles described in the Architect With Cost in Mind section of this framework. Sometimes, organizations also use cost drivers to govern placement decisions between public cloud providers and on-premises data centers. In such cases, to build an “apples-to-apples” comparison, Gartner recommends developing an on-premises cost model as described in Furthermore, you should not focus simply on the pure infrastructure cost comparison. Gartner recommends building a multiyear TCO and ROI that takes into account future cost savings from cloud-induced efficiencies. This process is described in More information on developing workload placement framework for multicloud and hybrid cloud can be found in Shift Budget AccountabilityThe self-service nature of cloud services is fostering a decentralized approach in IT resource procurement. Many organizations experience scenarios in which LOBs, departments and government agencies independently start IT projects using cloud services without involving central IT. As IT service consumers become more autonomous, they also must take partial responsibility for disciplines that were once at the remit of central IT only. These disciplines include monitoring and security and should also include cost. This shift in responsibility does not mean that central IT will eventually no longer be relevant. On the contrary, it will continue to act as an enabler and as “second line of defense” to protect the business from risk. In this scenario, having more autonomous users means they can make procurement decisions that you don’t control but that have an impact the cloud bill. Once the bill arrives, it is central IT that gets called out to manage the economics of cloud usage. You can certainly apply a centralized cost reduction practice to remove detected waste. While effective, this practice can also be disruptive and a potential source of frustration. A centralized-only practice does not scale when more users gain power to provision resources. As a consequence, you must shift budget accountability to cloud consumers to influence their provisioning decisions. An accountable user would be motivated to more precisely size resources or remove those that aren’t necessary. Shifting accountability requires a cultural change and does not happen overnight. To initiate this process and lay out the foundations for this shift to happen, you must: Formalize the budget approval as you enable user access to cloud services: Application or project owners who request cloud services for their workloads must commit to an amount of monthly spend and be held accountable for it. As you create cloud accounts for your consumers, help them build a spending forecast as described in the Plan component of this research. Then, formalize the budget approval based on the produced forecast. Having users ask for the authorization to spend money helps shift accountability more than having IT force a budget upon them. You can automate the budget request and approval using a service management platform as described in Provide consumers with visibility into their spending: Provide them with dashboards and reports that help track their actual spending on a daily basis and compare it against their commitments. Set up alerts when their spend is on track to exceed the approved budget. See Alert on Anomalies in the Track component for more information on alerts that help proactively address spending issues. Cloud consumers who feel accountable for their spend will consider cost optimization as key to hit their budget goals. In this context, central IT must not present itself as a law enforcement body. With this mindset, providing visibility into spending and recommending actions that help drive cost down will be well received by your cloud consumer. Shifting accountability does not mean you won’t have to centrally control costs and reduce spending waste. But having more accountable cloud consumers will lower the need for centralized corrective actions, allowing you to drive efficiency at scale. Incentivize Financial ResponsibilitySome cloud governance bodies don’t possess the authority to centrally remediate spending issues. This authority may sit exclusively with the resource owners. In other cases, your measures for shifting budget accountability may not be sufficient to make cloud consumers care about how much they spend. For such situations or simply to accelerate the shift in budget accountability, you must further incentivize your cloud consumers to take ownership of their spend. As an example, you can “gamify” the cost management practice and create healthy competition between the teams in charge of cloud provisioning. You can maintain and share leaderboards that rank the several teams based on their spending discipline. The position of a team within the leaderboard can trigger behavioral changes, making users more attentive about what they spend and how they reduce their waste. The leaderboard should contain the following metrics for the current month and for each team:
All metrics should also include an increase/decrease indication from the previous month’s value. You can also define scoring rules based on tracked values and establish a rank. These rules should consider that the absence of spending waste is preferable compared to a high number of pursued cost optimization opportunities. Lastly, you can award winners with team dinners, team-building activities and other incentives. Correlate Costs to Business ValueThe ultimate goal of a cost management practice is to correlate cloud costs to business value. Driving costs down as a principle must not be done at the expense of being unable to fully support the business goals. To avoid this, you should stop considering cloud costs as such and start considering them as investments. Then, you must correlate them to business KPIs and calculate the return of these investments. For example, Netflix measures its business value by the total number of active streams; that is, how many people are currently watching content online. Correlating that KPI to their cloud costs allows Netflix to ensure spending growth does not outpace one of their active streams. Figure 19 shows Netflix’s “normalized cost per active stream” over time and the goal to keep that line as flat as possible. The growth of such metric would be a signal of cost inefficiencies. The dropping value of this metric would be a sign of better economies of scale. Depending on your industry and organizational goals, you must identify which KPIs you can correlate to cloud costs. For example, a KPI could be the number of billable air miles per seat for an airline, the number of monetary transactions for a bank or the number of issued passports for a government agency. Even if you’re adopting cloud to increase your internal efficiency, this efficiency must eventually translate into the growth of business KPIs. Developing capabilities to operationalize and evolve your cost management practice is the last fundamental component to control your cloud costs. Defining your tooling strategy and evolving the process to embrace multicloud spending decentralization and correlation to business metrics will allow you to develop a more strategic approach to cost governance. Risks and PitfallsI&O technical professionals in charge of managing cloud costs must be wary of the following risks and pitfalls:
Related GuidanceThe following documents constitute an integral part of this research:
Gartner Recommended ReadingSome documents may not be available as part of your current Gartner subscription. The following M&A events have occurred in the public cloud cost management space in the recent months (in chronological order of the announcement):
What tool allows you to better approximate the cost savings of choosing AWS over on premises options as well as produce granular reports?Use AWS Cost Explorer Resource Optimization to get a report of EC2 instances that are either idle or have low utilization. You can reduce costs by either stopping or downsizing these instances.
What AWS tool lets you explore AWS services and create an estimate for the cost of your cases on AWS?AWS Pricing Calculator is a web-based planning tool that you can use to create estimates for your AWS use cases. You can use it to model your solutions before building them, explore the AWS service price points, and review the calculations behind your estimates.
Which cost management tool allows you to see the most detailed information about your AWS bill?The AWS Cost & Usage Report is your one-stop shop for accessing the most detailed information available about your AWS costs and usage.
What tool is best for forecasting your AWS spending?Forecasting with Cost Explorer - AWS Cost Management.
|