Is Big Data Analytics Critical for Smart IT Management and Support?
What does it take for the online applications all running 24x7, solve problems faster when they do happen?
Weekend is the time for me to check my bill payments, bank accounts, taxes, insurance questions, and of course connect with my friends – everything online -- the sheer convenience of technology makes my life simpler. But the pleasure is sometimes interrupted with “strange” errors on the online portals that refuse to cooperate with my sense of urgency to get the weekend chores done.
My pet peeve is with my mobile bill payment site. I click on the “pay” button a few times despite warnings from the website to avoid clicking at will. I see no response and inadvertently need to dial the call center to solve my problem each time. At every attempt I am given 24-48 hours time frame within which my problem will be resolved but by then I have already got late fee charges added to my next bill as I was making the payment on the last day of my billing cycle!
Since I work in IT management technologies and products, I get to hear the other side of the story often from clients that run these online applications. So, I understand what is happening behind-the-scenes and typically, I am more patient and accept the long wait. But many of us must be thinking what’s really going on! We have technology helping us to live an “online” life for most aspects, then why can’t we manage this more efficiently and avoid such big glitches? What does it take for the online applications all running 24x7, solve problems faster when they do happen?
Well just like regular life, the online world is complex as well – there are glitches due to bugs in software and there are operational issues. We just need to solve them quickly to hide the complexity underneath. I like the website of my mobile service provider but I don’t want to be exposed to the complexity of its IT systems when something does not work.
Let’s continue with my favorite example of the telecom bill payment portal– not because I have the most problems with it personally, but I have spent some time building management tools for such applications and got to know the complexity behind the scenes. The applications depend on many technologies and run across many servers, store data in many databases, interact with third-party financial systems, and are all connected by networks. Already sounding complex? With so many things in there, anything may fail and take time to find and fix.
Here’s a possible flow of what happens when I click on the “pay” button and wait. A back end software is invoked, that takes my user details from a database, matches the amount I am trying to pay with what’s due and when, asks for further payment details from me, invokes the third-party payment gateway that processes the payment with a financial institution, and based on that update informs the user as well as updates the user account details in the database, finally sending confirmation to the user. Now imagine thousands of users trying to pay their bills on the same billing site and millions of transactions accessing the financial institution’s systems across many such online payment sites.
Systems and software, especially for financial transactions, are well tested. They are designed to avoid incorrect transactions at all cost, so that no one looses money but they may still fail causing irritations. I have seen applications that have been configured without judging peak loads appropriately. For example, in one case, a promotions campaign was announced for bill payments during a period, but the number of users, who accessed went beyond what the application was configured to handle! So problems can be anywhere in this complex system to fail my bill pay transaction.
The key point is, when I lodge my complaint with the call-center and the application support engineer is called in, how does he detect the problem in minutes and not 24-48 hours? The IT systems produce a huge amount of data with wide variety to trace most activities, such as those on the payment transaction flow. This is much like the health monitoring parameters a human body produces. The engineer has all the data but he struggles to easily access all the data and then make sense out of it to find the problem. Often, few people know enough about these applications to find these problem patterns buried deep inside the data. For example, one can generate a lot of health test data and medical/family history for a patient, but it takes an experienced doctor to put it all together and diagnose the illness.
Currently, I have seen IT support engineers do this sort of analysis manually. They manually collect and dig into application and systems logs to look for problems during the same time when the end user faces a problem on the website. This takes hours and rarely captures shareable knowledge! Compare this to a manual crime scene investigation and the engineer is like a detective searching through all sorts of data, trying to find hints to link together and eliminate alternatives, and solve a puzzle.
This is exactly what BigData analytics tools are supposed to help with. They are supposed to make data gathering and handling simpler and reduce the time it takes to do the manual analysis. Imagine the support engineer has a Google like system, where all the data gets accessed, indexed and a search tool bar is provided to ask ad-hoc questions on the data and analyze patterns. This is the way my kid now uses Google to answer all her questions before coming to me.
Given a BigData analytics system, when I log a complaint ticket, the engineer can immediately “search” for problems at the same time when Anindya (that's me) had a problem, what was the error he saw, what context he mentioned in his complaint, what were the errors/warnings seen across the IT systems around the same time, did other users complain about the same problem and were accessing the same system? The BigData analytics tools help to answer these ad-hoc queries and piece the answers together to solve the puzzle in shorter time. We have seen application support engineers cut their problem diagnosis time to minutes from hours in large production systems.
Besides reducing the time to diagnose issues, one of the goals should also be to reduce skills requirement and enable more people to do the job faster. As experts, save their knowledge about known patterns to search for and how to resolve problems, it becomes community knowledge that others can find easily and apply. For example, when many users complain about failed payments, 90% of the time it is about a configuration of the payment gateway that connects to the banks. This sort of automatic expert guidance helps to scale up the workforce that solves our IT systems problems faster.
So, I am waiting for the day when my support call can solve the problem in minutes! Ideally, the IT systems are supposed to work behind the scenes and not make their presence felt through problems that take long to fix. BigData analytics sounds very promising to take us closer to that goal of online nirvana.