This topic is a useful jump point for conversations around identifying Best Practices, and making their application reliable and consistent across the enterprise. This is what The APM Practice is focused on – helping to ensure that your organization’s DNA lives on, as new tools and technologies are brought to bear in the evolution of your enterprise.
Originally posted on Quora
IT (Information Technology) is about managing the application of computing technology (hardware, software and networking) to business problems and environments. In the not so distant past, absolutely everybody had some capability, ranging from the college student that kept your PC’s up to date (software and hardware), to hundreds of individuals overseeing and contributing to the implementation and operation of the computing resources of a large enterprise. It required significant budgets, planning and project management to keep up with the changing technology, development and testing, and operation of the resulting physical plant.
Software Engineering is about reliably building systems of software that full-fill business/commercial and personal requirements. It is largely independent of any specific hardware or environment (platform independence) and a single individual can literally “change the world” with a novel application, and with little more than investment than their time to gain the expertise.
With the world literally moving to cloud-based computing resources and applications, the need for private computing centers is fading, and along with it the ‘profession’ of IT. The efficiencies and automation of cloud operations have a single individual responsible for the same work that literally hundreds of staff were required – and that was only a few years ago.
Even the concept of a computer workstation is shifting to mobile and assistants like Alexa. Still early days, for sure but the glory days of the CIO and an army of IT staff are over and done.
What remains is to preserve the useful practices of IT as a foundation for enhanced automation – the lessons learned and tools developed (that are not already commercialized or open-sourced). Everything that IT performed, as individual agents in a large and hierarchical organization, becomes subsumed into the practice and domain of the Software Engineer, and their specialist roles in Systems, Architecture and Security.
If you are going to do Enterprise APM, you need a full APM stack: logs, commands, synthetics, real user and byte code instrumentation. There is always an advantage in going with a single vendor vs. ‘best-of-breed’ and the consolidation is just a reflection of the industry maturation. Sure, it might look like the “Death Star” but it is still smaller the hundreds of apps present in an established enterprise – even if you limit your APM to tier-1 services.
Regards the “declining shares” of the big players, I think this has more to do with the decline in infrastructure management (IM) vs. growth of APM. You become proactive in your performance management by catching problems pre-production – rather than precise measurements of the crater you make in production.
Requirements are nice but an APM initiative need not begin at the start of a new application life-cycle. Reality is that a menu of apps are available; some mature, some problematic, some of little consequence – and the very few ready to begin with a ‘greenfield’ of clean requirements. You need only assess what proportions are on the ‘menu’ and you can devise a program that will bring them all into good alignment.
Most important is to actually begin with the ‘stable’ applications so that you can practice your deployment and configuration of APM and learn what your stakeholders really want for performance metrics. Then move on to problematic and ‘greenfield’ apps.
Every client environment is different and simply taking an academic and rational approach to performance requirements is going to leave a lot of gaps – and support will evaporate. Better to get going now and implement some visibility quickly and show what you can do – rather than what would be ideal to have.
The very last thing I did at CA – and I mean right before I turned in my badge, was to do a Q&A about why I wrote the APM Book and what is means for CA, customers and the industry at large. I never checked if it was actually used… and now I find it on the YouTube! CA Press Author Series: Mike Sydor
So if you have missed having me on a conference call or leading a customer initiative, you can re-live the magic. All done in one take with the video team pleading: “please don’t move – we keep getting glare from your glasses” – Arg! I can still feel the cramps.
You can find more links on the book and related articles and publications here.
For sure, transaction volumes and response times are the first two metrics to consider.
But that is not what makes APM a complex undertaking.
Everybody, for years, has been MONITORING volumes and response time. Actually DOING SOMETHING with the data – this is the real gap.
There are many organizational impediments that conspire to bog down the sharing of this critical information. The real focus of APM (where the ‘M’ == management) is to facilitate collaboration on the data. Getting the data to the right people in the organization and giving them a mechanism to effect the change needed to improve or restore performance – this is the gap that the APM vendors still need to work on addressing.
Monitoring without paying attention to collaboration – and you miss the APM value. Nobody picks up a “hammer” and instantly knows how to build a house! The tools today are great ‘hammers’…
Monitoring + Organizational Processes (Collaboration) – and APM delivers. No matter whose ‘hammer’ you happen to have.
Don’t count out Synthetics from your APM strategies
For sure, synthetics are an important strategy, especially after hours when traffic is too low for real-transaction volume to be consistent. But if synthetics are all you have – you’re going to be in trouble! Those marketing sites are apparently not being managed. There isn’t much point in monitoring if there is no one to respond to the incident!
Any APM strategy has to allow for those apps/services that are initially unmanaged and could quickly be illuminated with log monitoring or synthetics. For tier-2 and tier-3 apps – this could be all they ever need. But for any app/service of significance you need to quickly follow up with deeper visibility – to whatever that implementation can support. Our first job is to get some visibility, any way we can. Our second job is to start using this information to improve the software service – not simply alert that it has ‘hit the fan’. Otherwise, you end up with a practice that measures “craters” with great precision – but never learns how to avoid the accidents in the first place.
Completely Automatic Performance Regression!
I always use a load profile that ramps-up for 5 minutes, then steady state for 10-20 minutes and then ramps down for 5 minutes. I need to focus my analysis on that steady-state period . How do I achieve that with your tool set?
Log Data Outranks Traditional Data Sources for Network Operations Management
Key phrase:: Network Operations – and that should be no surprise.
While the network is a important component of an APM initiative, it is usually the last area to participate collaboratively. (I’ve seen things!) And logs are always part of the APM toolset. But logs alone only help a few folks, leaving nothing for the application stakeholders.
Successful APM is achieved when all of the ‘traditional IT silos’ are contributing and participating – network, security, middleware, mainframe, operations. APM focuses on transaction and application characteristics but you cannot get to “root cause” unless the other silos participate – otherwise, full-on “blame game”. Getting those silos to collaborate – it’s all about the process – and not the technology – as this article reminds us.
So if your network group lives on logs – that’s cool. Can they baseline performance? Do they share the KPIs? Do they proactively notify when things ‘go south’? Do they post configuration changes so that you can correlate your transaction degradations with ongoing network issues?
Building data landfills, based on logs, and consumed by a few specialists – I don’t get the benefit. But reduce the data to KPIs and baselines, update proactively and contribute to the single view of end-2-end service performance… that’s what I would put some budget chips on! 😉
From what angle should we be looking at APM as it relates to IT strategy? | response only
How about the angle of “value to the business”? It doesn’t matter if you are monitoring infrastructure, real-time performance or end-user experience – or all three – if the business is not deriving a value from that investment. DevOps especially is really about software/service quality. Monitoring is just a means of getting there.
So does infrastructure monitoring help me improve software/service quality? Nope. Just lets me measure the size (and frequency) of the crater. A lot of folks cluster applications not to get performance but because they can’t fix the memory leaks or other instabilities. So the gap is in getting code fixed – not measuring the craters!
So does real-time performance monitoring improve software/service quality? Nope. I get to see what is breaking, in great detail. But I don’t have a business-impact context to let me prioritize where I should be spending my spare software-fix budget. So I throw a few more instances into the mix… and hope I meet my SLA.
So does end-user-experience improve software quality? Nope. I finally get to see the business context, specifically which transactions and what business impact – but I don’t get much of a clue to what is broken – other than it is transaction #47… and it’s costing us a ton!
To make sense, APM needs all of the info. So let’s throw in network traffic monitoring (different from infrastructure) and logs (for those apps we can’t instrument or are using RPCs, whatever). And hopefully I’ll have visibility into what needs to get fixed. But then you have to have processes on which to employ all these measurement, and when you catch problems, you have a mechanism to correct them. And this is something best done before you even hit production – but that’s a lesson for another day.