Everything you need to know about Known Error Database (KEDB)
- select the contributor at the end of the page -
Here you'll find everything you need to know about KEDB, along with real life IT examples from my consulting experience.
What is KEDB?
There are three ITIL® terms that you need to be familiar with to understand KEDB. These include incident, problem and known error.
When you face an unplanned interruption to an IT service, it is referred to as an incident. For example, if your email service goes down without notice from your provider, this could be tagged as an incident.
A problem is the underlying cause of an incident. Simply put, this is the thing that caused the issue in the first place. In the example above, the reason behind the email outage is the problem. Let's say that the root cause of a problem is identified. Now it's no longer a problem, but a known error. For the email incident, the root cause is identified as one of the critical services on the email server which was in hung mode. So, what was once a problem is now a known error.
A KEDB is a database of all such known errors, recorded as they are and when they happened – and they're maintained over time.
Why do you need KEDB?
How exactly does a business justify expending capital and operational costs on the database?
Getting back to the email incident, let's say that the critical service was in hung mode after running a number of diagnostics and carrying out a series of tests. After identifying it, the resolution might have been quicker where the service was stopped and restarted. But to get to the resolution, it took plenty of effort and, more importantly, cut into some precious time. While the diagnostics and resolution were being applied, the email service was down. This could result in penalties imposed by customers, and intangible losses like future business opportunities and customer satisfaction.
However, this organization that provides email services to its customers maintains a KEDB, and this particular incident was recorded. When the email service goes down again, the technical support team can simply refer to the previous outage in the KEDB, and can start diagnosis with the service that caused the issue last time. If it happens to be the same service causing the issue, resolution now happens within fraction of the time. As you can see, this greatly reduces downtime and all other negative effects that stem from service outages. This is KEDB in action!
A KEDB record will have details of the incident, when the outage happened and what was done to resolve it. However, for a speedy resolution, the KEDB must be powerful enough to retrieve relevant records using filters and search keywords. Without a KEDB in place, service management organizations tend to reinvent the wheel time and again, rather than working toward building a mature organization that allocates its funds toward improving services.
Workaround and permanent solution
When there are service outages, there are two ways of restoring them. The first, and most ideal, is a permanent solution. A permanent solution entails a fix that guarantees no more outages, at least on a certain level. The second, and most common, type of restoration is the workaround, which looks for a temporary, alternate solution. A workaround is generally followed by identifying and implementing a permanent solution at a later date.
In the email service outage, restarting the service is a workaround. The technical staff knows that this will solve the issue for the moment (which is of vast importance), but that it is bound to repeat in the future. Before the incident occurs again, it's on the technical team to investigate why the service is unresponsive and to find a permanent solution.
Let's look at another classic example that I have used time and again during trainings – this one really drives home the concepts of workaround and permanent solution. Imagine that the printer in your cabin stops working and you need it right away. You log an incident with your technical staff, stating that you are about to get into a client meeting and you need to print some documents. The support person determines that he is unable to fix the printer in time and provides you a workaround to send your files to a common printer in the foyer.
The workaround helps, as your objective is to get the prints and run into a meeting. But, there's no way you want the hassle of having to do this every time you need to print. So, when the meeting is over, you push for a permanent solution. When you return, your printer is working and there is a note from the support staff stating that the power cable was faulty and has been replaced. This is a permanent solution. And while there's a chance that the new cable could also go faulty, the odds are in your favor.
In a nutshell: Workaround is a temporary fix. Permanent solution is, as the term states, permanent.
Why did I discuss workaround and permanent solution on a post that is aimed at KEDB? Known errors exist because the fix is temporary. The known error database consists of records where a permanent solution does not exist, but a workaround does. For a known error record, if a permanent solution was to be implemented, then the record can be expunged or archived for evidence. Known error records with an implemented permanent solution must not be a part of the KEDB in principle.
This concept is further built upon in the next section where we'll talk about the various process trees for creating, using and archiving known error records.
KEDB in action
Now that you know what a KEDB is and what it contains, let's talk about how and when it gets recorded, used and maintained.
These are the three streams where KEDB is leveraged:
1. When an incident is resolved using temporary means, a known error record is created with the incident summary, description, symptoms and all the steps involved in resolving it.
Suppose a user has reported that MS Outlook application crashes when emails start to download. The technical staff, in order to minimize the service outage, advised the customer to access webmail until the issue is resolved. The details of the incident, along with the symptoms and temporary resolution steps, are to be recorded in a new known error record.
2. When an incident is reported, the support team refers to the KEDB first to check if a workaround exists in the database. If it does, they will refer to the known error record and follow the resolution steps involved. Suppose the fix provided is inaccurate, the support staff can recommend alternate resolution steps to ensure that KEDB is high on quality.
Let's say that at another time and place, MS Outlook application starts to crash. The technical staff can refer to the KEDB to check what was done on previous occasions, and can recommend the workaround to the customer until a permanent solution is in place.
3. If a permanent solution to a known error is identified and implemented, the incident must not happen anymore. So, the known error record is either taken out of the KEDB or archived with a different status. This is done to ensure that the database is optimized with only the known errors, and accessing records does not become cumbersome due to a high volume of known error records.
While the user is accessing email service via webmail, the issue is being investigated to identify that a Bluetooth extension is causing the outlook to crash. The permanent solution is to disable the extension or even uninstall it. This solution is implemented not only on the Outlook that crashed, but on all the systems accessing Outlook, to ensure the same incident doesn't happen again. After implementing and testing the permanent solution, the known error record can either be archived with a pre-defined status or deleted.
ITIL® is a Registered Trade Mark of the Cabinet Office.