<< Back to article Print this page Loading page, please wait...

Operational Excellence Starts with a Strong Foundation

Why IT professionals must make reliability a number one feature.

Cynthia Stoddard (CIO (US))
26 March, 2019 05:40

Cloud computing is growing by leaps and bounds. Over 70 percent of companies have shifted at least part of their IT operations to the cloud, according to the International Data Group.

To be competitive in the growing cloud-based business market, enterprises need to bolster their cloud services’ reliability. Customer-centric attributes like continuous uptime and ease of use are table stakes. Honing these attributes isn't easy. Today, every service – whether internal or customer-facing -- is part of a larger ecosystem. A break or outage can have a ripple effect.

The key is understanding and anticipating your internal and external customers' challenges.

Operate at Scale

As we move workloads to the cloud, we have to design for reliability from the start. At the engineering design phase, we need to think about how the service will perform at scale -- and then apply engineering principles to predict, observe, recover, and learn about products. If we put the right design principles in place, and build reliability into the code, the service will support customer growth with consistent availability.

Anticipate Failure

Given all the interdependencies between services, microservices and platforms, failure is inevitable. We need to work toward understanding what could possibly go wrong, and engineer around those failures so that services remain resilient.

Netflix has led the way in troubleshooting cloud technology. During its initial shift to streaming, Netflix purposely unleashed failures and abnormalities into its cloud infrastructure to find the best ways to respond. Inducing failures in a controlled environment prevented consumer frustration down the line.

Unleash Automation

Automation can help cloud operations teams identify and fix issues with speed and precision -- allowing humans to focus on more creative problem-solving.

At Adobe, our self-healing platform, powered by AI and machine learning, identifies patterns and learns how to solve problems based on past experiences. Failure remediations that took IT workers 30 minutes to fix manually now take under 3 minutes to remediate automatically.

Automation plays a key role in helping firms operate at scale and anticipate failures.

Build a Strong Foundation

I used to have a team in Chennai, India. Whenever I would visit them, I marveled at the Shore Temple, an eighth-century stone building on the Bay of Bengal. When I went back after the tsunami in 2004, I was certain the temple would be damaged or destroyed. To my surprise, it was untouched, thanks to its solid foundation.

That's how I like to think about cloud operations. Growing a cloud business has its challenges. But if you have the right foundation, and spend time up-front thinking about operating at scale, anticipating failure, and unleashing AI, you can handle whatever storm comes your way.