There is no single day in the life of a software engineer when it comes to tackling complex and large problems in a fast-paced tech environment. Before delving into details of solving problems of complex nature, let’s have a look at the process of problem-solving in general.
At a high level, there are six phases to tackle a large scale problem:
- Identify the opportunity
- Analyze and look for patterns
- Define a high-level strategy
- Create simple, fast solutions
- Deliver, quantify, and communicate
- Refine the vision and scale the solution
We will go over the each of the six phases in detail and talk about:
- The objective of each phase
- Duration and typical actions taken
- Goals we set and expected outcome before move on to the next phase
Rendering reliability will be used as a working example to give more color and depth to each phase for software engineers.
Software engineers are not short of opportunities for problem-solving at the companies. The first step is to eliminate the noise and identify the fundamental problem(s) that is most impactful to the business. In this step, you validate the list of opportunities with additional supporting data points, that include SEVs, SLA tasks, logging data.
There is no set time duration for this phase; looking out for large class of issues (rather than fixing constant small bugs) should be part of our day-to-day operation.
Let’s take the following example of identifying rendering reliability opportunity
Product teams have usually been relying on manual QA tests to prevent incorrect rendering online ads. However, a large number of rendering variations based on device, platform, and ad product types can make it hard to ensure reliability. Even small corner cases can have a significant revenue impact, and these cases are usually introduced by ad or non-ad teams since rendering is built on top of a large shared codebase. Since these could impact advertiser trust, revenue loss and engineering productivity, rendering reliability stands out to be a top problem for a tech company.
Expected outcome before moving to next phase will be to find one or few investments that can give high impact for the business. Moving to next phase without addressing this may yield into many parallel analysis, which will be very time consuming. Based on the example above, rendering reliability is one of the key investment that could have high business impact (improve advertiser trust, reduction of refund, avoid bad PR that may tarnish company’s reputation).
It is often tempting to avoid time for further investigation as everyone wants to deliver a solution and show impact as soon as possible. Before diving into a solution, there should be an identification of a few critical subsets of problems that make up the bigger problem. Not investing time on this phase may lead to solutions that aren’t highly impactful.
Time boxing this phase would help avoid delving too much into the tail-end issues. This phase usually takes 1–2 months. Typically we set “understand” goals and aim for developing conviction for the problem space and the impact we can generate by eliminating it.
No one would prefer to jump-in and start fixing problems by picking one Product or one Surface or fixing problems for one format at a time. Rather, time should be spent in analyzing classes of problems, impacts, recurrence using data.
Given the case of rendering reliability, below are some examples of major issues which are commonly encountered in tech companies:
- Cropping and sizing errors: These include cases where images were incorrectly cropped, resized or zoomed, or thumbnails for video ads not correctly rendered.
- Missing components or empty attachment: These include cases where attachment is empty, attachment has missing components (partial rendering) or images don’t load.
- Wrong content: These include cases where wrong creatives or thumbnails from other ad campaigns or organic posts are rendered for an ad.
Expected outcome before moving to next phase should be to dissect the opportunity further into classes of problems. From the above example, it was clear that focusing on image cropping, wrong image and missing component will yield higher success.
Instead of doing a tactical solution of fixing each problem on isolation, you can invest a few weeks in delving into strategies in solving the classes of problems. This includes analyzing possible ways to solve the problem, investigating the tradeoffs of different approaches, brainstorming with broader team and coming with a unified approach. It is equally important to identify key metrics that will measure the impact, although it is difficult to bet on goals during this phase. At this phase we slow down a bit in order to run faster.
Let’s take a deep dive into rendering pipeline, understand the complexities, touch points, code complexity, and potential stages the problems can be detected and prevented. We can broadly group the opportunities in three major stages as below:
- Development stage detection — Detect rendering issues at the time of coding/development
- Pre-rendering detection & prevention — Detect & prevent rendering issues at the time of ads delivery
- Post-rendering detection — Detect rendering issues on the client-side, validate the final rendering
Expected outcome before moving to next phase should be having a clear strategy on how to develop a solution is an important outcome before we come-up with a solution. In the above example, we can create a three phase strategy to improve rendering reliability. We then can create three smaller teams to start working towards a solution.
Once the team start producing impact, the team can incrementally add additional layers of reliability, scalability, and monitoring.
Having concrete goals that define the failure and success states is critical. The solution will most likely won’t cover the whole problem domain yet, however, parts of the problem domain covered should yield results that will help you build confidence in your solution.
The main goal is to create a solution that can help us succeed fast of fail fast. Although, scalability is one of the considerations, it isn’t the primary goal for many tech companies:
- For the detection during the development phase, we can start with screenshot testing with automated email alert when mismatch occurs and have the engineer manually analyze the emails, tasks and evaluate the business value.
- For pre-rendering detection, we can leverage logging, manually analyze the false-positives, create rules to detect rendering problems at the server-side. This process can be laborious. After adding some rules, we can come up with a solution that helped us shorten the dev cycle.
- For post-rendering detection, we can enable client-side logging, analyze those logs to identify rendering problems, eliminate false-positives, and add rules.
Expected outcome before moving to next phase should be to demonstrate simple, working solution that help address the problem. The solution at this phase is neither highly mature, nor scalable. However, it is important to have basic building blocks to make it a working solution. This includes monitoring and test automation. In the above example, we can create three simple solutions just to prove that we are able to detect and prevent rendering problems.
Creating the right communication channels and cadence with customers, leadership and partners will help increase the visibility of the work and allow critical feedback to flow in. It is vital to have key metrics to measure the impact and make them a prominent part of the communication. The following are some examples of such improvement points:
- Rolling out some new rules to detect rendering problems
- Improved rendering problem detection impact
- Increasing partnership with related parties to increase traffic coverage for the existing rules
Expected outcome before moving to next phase should be a clear set of metrics to measure the impact, regular consistent communication with partners and leadership. In the rendering reliability example above, there can be some types of bi-weekly and monthly communication channels among partners and leaders.
As the solution gets to a mature state, ongoing investment can be reduced, however, adequate monitoring should be in place to look out for potential regressions. In many tech companies, systems usually evolve quickly, which may require adjusting or re-thinking the solution.
In the rendering reliability example, we can figure out the end goal, prioritize them to get to the completion and reach out to other partners to leverage the solution.
The sample chart below displays how the number of discovered issues can increase at the early stages while adding more detection. Once a solution that can help prevent issues is in place, the number starts decreasing. Increasing the coverage of the solution will usually push the number further down to a state where the problem is not “big” anymore.
Thank you for reading so far! One take-away from this article should be that given these tips and tricks, engineers should really no longer worry about tackling complex problems. So, let’s nail down your next challenge!
Credit: BecomingHuman By: The AI LAB