There is no day in the life of an engineer without resolving problems. Often, we, as engineers, focus on building new tools or delivering increased performance of an existing function (usually ten-folds). This focus on the new and cool is great in the fast paced areas of emerging business lines.
However, once we develop a business function or a capability, we need to quickly operationalize it and optimize it to function well. When things go wrong with that operational business function, we need to quickly identify the underlying problems and correct them to avoid wasting time and money that adds no value to the platform users.
In the end, if we can’t deliver on our basic processes, no one will trust us with anything else. This is where problem management is crucially important. Here are the 5 basic steps of problem management:
- Problem Identification
- Categorization and Prioritization
- Root Cause Analysis
- Resolution
- Closure
The first step in solving problems is realizing you have a problem. Problem consciousness is really important. In the early stages of problem management, problems are everywhere. There is typically little effort required to find something broken and attempt to fix it. Some organizations use an incident management function to feed their problem management process. They review and postmortem every major incident to find problems, expose their root cause, and drive organizational change to fix them. Other problems are harder to find when the systems and volumes of incidents become more complex.
Some operational entities for instance, perform about 400,000 server repairs a year. These repairs come in the form of server repair tickets. Often, the tickets are driven by various factors like component quality, software tooling, software automation, or even transactional defects. It is challenging to spot the systemic problems when they are buried in a sea of data.
Once a problem is detected, one needs to quickly categorize it to help understand its significance and the appropriate urgency required to resolve it.
A few basic categorizations of a problem include:
- Impact of problem (pain point): How does this problem prevent us from operating normally?
- Value if resolved: When presented with a problem, we can either ignore it or attempt to fix it. Some problems are so small, we ignore them. They are not worth fixing. Other problems are huge and need a solution right damn now! Thus, it’s important to weigh the value of the problem.
- Problem complexity: When we try to prioritize which problems to fix first, it’s really helpful to understand the complexity of the problem at hand. Sometimes complex problems require a great deal of support and preparation to solve. Others, not so much. It is good to become a big fan of solving those high impact, low complexity problems first.
- Availability of required skill sets: Solving problems in a big complex organization requires people who are knowledgeable in the key problem areas. It is super helpful to know, at a high level, what resources are required to solve the problem. Possibly keep an inventory of those resources to help avoid wasting time searching for them later.
- Estimated time and cost to resolve: Often a rough order of magnitude (ROM) is all that’s needed during the categorization phase of managing a problem. Form a combat math estimate of the time and resources required to resolve the issue.
Once you have the problem adequately categorized, move on to the next phase.
1. Paper repro: “Learning to Learn by Gradient Descent by Gradient Descent”
2. Reinforcement Learning for Autonomous Vehicle Route Optimisation
3. Basics of Neural Network
4. AI, Machine Learning, & Deep Learning Explained in 5 Minutes
In this phase of the problem management process, one needs to dig deep into the problem to identify the root causes. This often requires some form of a root cause analysis process to be followed. There are all kinds of processes for this step:
- Fishbone diagrams
- Drill downs
- 5 Why’s,
- Brainstorming
- Chronological analysis
- Failure mode and effects analysis
The key is to make sure your approach is systematic and thorough. Separate the symptoms of the problem from the root causes of the symptoms. Improperly identifying and treating a symptom of the problem will not fix the problem — just mask it. Once you know the root causes of the problem — put a fix(s) in to correct them.
Some organizations go further to define counter measures (CM) to contain the spread of a problem. Engineers often call these counter measures things like approved exceptions, work arounds, etc. Think of these counter measures as a cone in the road to alert you to a pothole. Also, some organizations define corrective actions or long-term corrective actions (LTCA) to define the steps required to fix the root causes. Think of this as a paving crew resurfacing that road with the pothole mentioned.
In this phase of problem management, we go back to the basics of middle school science class, The Scientific Method. It has been a basis of all science since the 17th century. We know it works.
The Scientific Method steps:
- Make an observation
- Ask a question
- Form a hypothesis
- Conduct an experiment
- Accept/reject hypothesis
Rinse and Repeat this cycle until you have identified and implemented all of the required changes to properly address the root causes of the problem.
Granted, we don’t always need to follow the scientific method if the fix is obvious. Some fixes are super obvious and easy to measure for success.
Don’t be in a hurry to make changes and close a problem until you absolutely know with confidence that the problem is resolved. Prematurely resolving a problem will only yield new problems which will need to be researched. Also, you will lose credibility that you can solve problems if your investigations fail to address the root causes.
Run a postmortem on the problem investigation. This can be helpful to guide and develop organizational learning. Keep track of your problem investigations over time. Data regarding these efforts can help continuously improve organizations.
That’s pretty much it. Good luck investigating problems. If you have any feedback or suggestions to improve this note, please share.
Credit: BecomingHuman By: The AI LAB