Troubleshoot Code Like a Homicide Detective

Troubleshoot Code Errors like a Top Homicide Detective

5 Memorable steps to debug like a pro once and for all

It’s getting late, and your deadline is quickly approaching.

You’ve got to get your code checked in to source control in the next couple of hours. Then it happens…

A strange message glares at you from just feet away.

You’re in the middle of a frustrating scene again.

The message is covered in red – it’s misspelled… it’s ambiguous… it’s a fragment with poor grammar… and it’s very difficult to understand.

It just so happens it also provides your only starting point for what comes next… troubleshooting!

What is Troubleshooting?

According to Wikipedia, the answer to ‘what is troubleshooting?’ is “a systematic search for the source of a problem so that it can be solved.”

The word “systematic” is easily overlooked in this definition but it is critical.

“Systematic” implies  a set of repeatable steps are used to determine the source of the problem.

That’s exactly what’s missing from most programmers faced with troubleshooting.

They have never developed a system for troubleshooting steps.

Instead, they rush to apply a solution before absolutely identifying the source of the problem. They naively believe they’ll get it working in ‘just one more minute.’

As a result, they end up wasting hours and hours guessing about the right fix.

You see, consistently effective troubleshooting requires good habit. That good habit  can literally shave hours of debugging, making you a much more efficient developer.

There is no doubt that it can be  difficult to know which steps to take when a bugs appear. Each attempt to troubleshoot seems to need a custom approach.

But by definition, troubleshooting requires a set of repeatable steps to be effective. That’s why I developed a troubleshooting framework to follow for every bug I encounter.

I have found that using this framework consistently saves me hours of frustration and makes me much more efficient when dealing with bugs.

I use a crime investigation analogy to help me remember the steps. I imagine I am a detective opening a case in pursuit of a suspect.

It works like this…

Step 1 – Visit the Crime Scene

Homicide detectives are called to the scene when a body is found – a body must be dealt with immediately. Detectives are pulled away from whatever they were doing so they can start their investigations.

Software developers have a similar experience when a bug appears – it must be dealt with right away, and they cannot move on until it’s handled.

Examine the Scene (Clarify the Issue)

I need to make sure I attempt to solve the right problem.

It is easy to waste time going off track early without a clear understanding of the real problem.

I ask myself the same 3 questions every time I start an investigation. Forcing myself to answer these questions ensures I attempt to solve the right problem.

Answer 3 Starting Questions. Answers to these 3 initial questions bring immediate focus to my investigation:

    1. What happened?

      For example, “An error message popped up.” Or, “My form did not respond when I clicked the ‘Submit’ button.”

    2. What did I expect to happen?

      For example, “My form would submit data and a success message with a welcome statement would appear.”

    3. How do the two differ?

      For example, “My form has always submitted in response to clicking on ‘Submit’ and I have never seen an error message pop-up.”

Collect Evidence (Gather Reference Material)

Next, I note anything that may be useful as a reference as I dive deeper into my investigation.

I start with the following:

      1. Note the time of the bug. Knowing the time my bug appeared will help me focus on the events that occurred around the same time. This will help me isolate relevant information later. For example, I can find relevant messages within error logs recorded around this time.
      1. Record which software component displayed the error. I pay attention to where signs of the bug appeared because I know each programming language can only communicate through specific channels. Knowing how the bug came to my attention helps me focus my investigation. For example, a “404 error” resulting from an AJAX request will never appear on the “Console” of my browser’s Developer Tools, such information is limited to appearance on the “Network” tab. A an access error message from an Apache web server won’t appear in a browser’s pop-up window.
      1. Note anything familiar. I can speed up my investigation by taking a few minutes to recall whether or not I have seen something like this before. I avoid jumping to a solution, but I want to take advantage of anything that might be relevant from past experience. Sometimes I see behavior I recognize from a previous application or bug. I jot down any thoughts I have about when and what was involved.

For example, I was once troubleshooting a custom web site plugin permitting patients to compare nearby medical treatment costs. Patients selected a diagnosis code from a dropdown menu listed on a form. The database was full of records for every option listed, but no results were being returned.

I recalled a similar experience when working with an unrelated reporting application for energy industry software. That software offered its users filters to determine which data was included with each generated report. In that case, the problem had been caused by improper mapping between report filter inputs and their corresponding SQL statement parameters.

I noted the similarity in symptoms between the problems along with a note about how its cause was related to SQL – this note brought initial focus to my debugging effort later.

Identify Initial Suspects (List All Software Components)

I make sure I know all the software components of my application. This includes the server and client components of my code as well as their dependencies.

Today, many technologies are often used to implement a single fully-functional application. This is becoming increasingly common as new frameworks and libraries are being introduced to speed up development. So I am careful to document the following:

      1. List server and client applications. I make separate lists of server-based code and client-based code that is part of my application. I list out all the server-side components (snippets, plug-ins, source files, extensions, etc.) written in PHP, ASP .NET, Java, and other server programming languages. Next, I list out all the client-side components such as JavaScript source, HTML and CSS file names.

It may same tedious and time consuming to take inventory of all server and client components in the application. But I have found this information to be vital when problems arise.

Once, I was developing a new PHP code snippet to collect a list of data through an AJAX request. I was sure I had properly queried the database and packaged the results in my new code. However, initial AJAX requests continued to report a response message I didn’t even recognize.

I spent hours carefully verifying each aspect of the new code snippet only to discover I had forgotten about a second PHP snippet – it was executed with every AJAX request. Sure enough, the second snippet included code returning the mystery message I saw.

If I had taken the time to complete this step, I would have saved hours of frustration.

      1. List application dependencies. I often use third-party libraries to speed up development. When a bug appears, it can be valuable to catalogue which ones I am using. It’s easy to overlook this, but sometimes an incompatibility appears between my code and the thousands of lines of code these dependencies often include. This list along with the previous list gives me a comprehensive view into all of the application code during debugging.

For example, it’s common to include the latest jQuery library early in web application development. But months into development that script reference is easy to overlook as having a role in creating errors. As new dependencies are added, they may generate competing jQuery references.

Just as a wide variety of suspects could be responsible for committing a crime, a variety of technologies could be responsible for my bug.

Having this handy list of components and their dependencies will also speed up any future communication I have about the bug.

 

Step 2 – Interview Witnesses

Just after examining the body at the crime scene, a homicide detective conducts interviews. First-hand account from those present around the time a dead body appeared can prove invaluable in steering the investigation. Likewise, troubleshooting can benefit from information recorded about events near the time a bug appeared.

Get Eye-Witness Statements (Recall My Steps)

I need to know what happened just before the bug.

End users typically make statements like “it didn’t work” and “it just stopped working.” And when pressed for details about what they did differently, I get “nothing.”

That’s just not helpful.

So I will take the time to give myself the information I need to be effective.

I ask myself for the same information I would ideally get from users:

      1. Record my last 3 to 5 steps just before the bug appeared. In my experience, knowing those steps has almost always been the key to unlock the cause of any bug.
      1. Jot down a couple details for each step. I go through each step and write down additional details that may prove useful during debugging later. The information helps me to set the debugging scenario correctly. For example, I may record that I filled out a form using copy-and-paste. Or, I may write down the exact option I clicked on from a dropdown just prior to a bug. Or, I may note which user I logged in as.

Get Other Witness Statements (Review History & Configuration)

My next step is to review settings that control software behavior and examine the activity reported while running.

I do the following:

      1. Check Configuration. There are two areas of configuration to consider – configuration files and software component configuration.

First, I note all files known to contain settings used by my application. For example, an ASP .NET application uses a file named “web.config.xml” to store application settings. A content management system like WordPress uses a file named “wp_config.php” to store system settings.

Second, I note all dependencies known to require configuration. My application may use libraries that rely upon me to specify settings in order to work. I make note of which dependencies require my configuration. I also write down where I have specified those settings. For example, whenever I use Datatables (a JavaScript UI library), I specify the configuration for each table instance in my own JavaScript file.

I have found that knowing where to find configuration that impacts my application speeds up my understanding of observations made while debugging. For example, I may see an unknown value appear while watching a variable. I quickly avoid confusion by recognizing that value has been set by default through configuration.

      1. Inspect Logs. Most software records at least some activity while running. Those records may be stored in a log file or a database.

I find all log files I can access and open them. I look inside for anything unusual reported around the time I noted in Step 1 – Visit the Crime Scene – when the bug appeared. I have consistently found invaluable clues about what happened to trigger bugs by examining logs.

Examples of log sources are:

      • The “Console” panel in web browser Development Tools
      • The log4net file (like “log-file.txt”) in ASP .NET applications
      • Files like “php_error” and “apache_error” on web servers
      • Records in a proprietary table named “MESSAGE_LOG” from stored procedures running on a database

Talk to Informants (Collect Additional Information)

Next, I  see I if anyone else has clues or has experienced a similar bug.

I check with others as follows:

      1. Ask coworkers and colleagues. I find out if my coworkers and colleagues have seen a similar bug. Or, they may have been involved in recent changes that may impact my code. I want to talk to them and find out – their comments may provide context for my investigation.
      1. Perform an internet search. I can use a web search engine to produce links to thousands of online pages with topics related to my questions. I use caution, here, as this can be a time consuming means of finding answers. I can get sucked into hours of scanning Q&A web sites. I often find myself jumping ahead to “trial and error” solutions before I have enough information about my issue to troubleshoot effectively. I also find sources of any online documentation about the libraries I use.

A successful homicide investigation is well informed by the amount of reference material collected. Similarly, I need as much information as possible to get to the true source of my bug.

 

Step 3 – Retrace the Crime

Homicide detectives know the importance of determining an accurate sequence of the events leading up to the discovery of a dead body. An accurate timeline can support the testimony of witnesses and corroborate the evidence collected. Likewise, determining which steps to take to duplicate a bug is critical to troubleshooting efficiently.

Model the Crime Scene (Attempt to Reproduce the Bug)

I have learned how valuable it is to exactly reproduce the bug during troubleshooting.

This is the step that separates great debugging from the rest.

I follow these steps to determine exactly how to reproduce the bug:

      1. Follow steps recorded in Step 2 – Recall My Steps. I start by following the sequence I wrote down in Step 2. I combine each step with the details noted and follow them as written. I expect to see the same bug appear.
      1. Make adjustments until necessary. Sometimes following my written steps does not immediately reproduce the bug. That’s okay – this is an iterative process. I expect to experiment with the sequence of steps and the data until I match the bug I am troubleshooting. To help me arrive successfully at the matching sequence, I ask myself questions such as:
      • What did I fail to record?
      • What could be different about the data I am using in my model?
      • How might I have interacted differently the first time I saw the bug?

Of note, this is really unconventional. Most developers briefly attempt to reproduce reported bugs with minimal effort. They are quick to dismiss reported bugs as being due to “user error,” leaving many bugs unresolved.

Document the Official Timeline (Document Confirmed Steps)

It is important to record the list of steps finally proven to reproduce the bug.

It is easy to lose track of details such as which user, what user permissions, which data, and what specific sequence was eventually determined to produce the bug.

I insist on doing the following:

      1. Assemble the final steps. I collect notes from the previous note and highlight information proven to produce the bug. Then, I arrange the sequence and record notes needed to perform each step.
      1. Write down the confirmed sequence. I make sure to note those steps and follow them one more time before moving on.

I can recall several times when I wished I had completed these tasks. Times when I was able to reproduce an issue but was pulled away from my work to address something unrelated. By the time I returned to troubleshooting, I couldn’t understand my scattered notes or had forgotten what I had done. Inevitably, I spent at least another hour repeating the work I had already done.

Having a written sequence really helps speed up my next step – debugging.

 

Step 4 – Interrogate Suspects

Skilled homicide detectives make the most of the time they spend talking to suspects. Detectives combine the information they’ve collected with clever lines of questioning to separate fact from fiction. Similarly, I combine the knowledge I’ve discovered during my investigation along with strategic debugging to reveal the true bug source.

Identify Suspects (List Possible Causes)

I just reproduced the error. While attempting to do so, I built an understanding of some relevant and irrelevant factors.

I use my investigation experience so far to:

      1. Ponder possibilities. I now have at least a few ideas about potential causes. I pause to collect my thoughts, consider what might be happening, and combine those thoughts with the information from my investigation.
      1. List up to 5 possibilities to target. Next, I apply my experience to determine likely culprits. I list out up to 5 potential causes and note to which software technology or code each belongs.

One time, I was troubleshooting a problem related to a Sencha Ext JS table. The table was supposed to display a grid of database records, but it wasn’t rendering.

I developed the following list of possible causes during this step:

      • Broken query or failed AJAX interfering with display? (SQL)
      • Poor configuration? (size or field mapping)
      • Wrong layout or component type? (GridView)
      • Bad HTML structure (no DIV tag defined?)

That list became a nice reference as I began debugging.

 

Strap ‘em to the Polygraph (Open the Debugger)

Now that I have potential suspects, I start to analyze each.

A debugger allows me to isolate the code execution path that produces the bug. I set breakpoints and add variable watches to create my “lie detector” as follows:

      1. Launch the debugger. This varies depending on the type of application I am building and the type of development environment in which I’m developing.

Popular Integrated development environments (IDEs) like Visual Studio and Eclipse have powerful built in debuggers. Those are accessed just by executing the application in a “debugger” mode.

Popular web browsers such as Chrome, Firefox, and Internet Explorer offer “Developer Tools” accessible through their browser menus. I have found these tools to be helpful when debugging CSS and JavaScript bugs found in web applications.

      1. Set breakpoints to determine code execution path. I consider which functions (a.k.a. procedures, subroutines, etc.) are executed in response to the steps I take to reproduce the bug. I set breakpoints to stop code execution right at the start of each targeted function. This helps me determine possible differences between the expected code execution path and the actual code execution path. I record any discrepancies and identify their causes.

When I am unsure where to start setting breakpoints, I consider the error message that appeared. I locate the same message within my source code. Then, I set breakpoints around the line of code that displays the message. When code execution pauses at that breakpoint, I use the call stack to determine which functions were executed that led to the display of the errors.

      1. Set variable watches to spot inconsistencies. While code execution is paused at a breakpoint, I set watches to allow me to view variable values during execution. I monitor their values to see when and how they differ from my expectation.

For example, I watch variables used to control program execution flow. I will watch variables used in the conditional statement of an “if clause” such as “if (fruit_type == ‘apple’) { … }.”  Since the value of “fruit_type” determines whether or not a block of code is executed, it is a good candidate to watch.

Rule Them Out (Avoid Jumping to Conclusions)

Like a homicide detective, I try to rule out every possible cause. Otherwise, it could lead to a wrongful conviction.

I do this To avoid jumping to conclusions:

      1. Go through the list of possible causes from Step 4 – List Possible Causes I use a methodical approach to ensure I reach a complete understanding of the cause of the bug I’m troubleshooting. I run each through my debugging analysis to reveal everything I need to be effective.
      1. Maintain discipline to complete the list. I commit to working through my entire list before starting. This commitment will ensure I reach the right conclusion about the source of the bug I am troubleshooting.

Let’s say I have recorded 5 potential bug sources on my list. I work through every item listed. I make sure none of the remaining 4 listed isn’t also a possible cause. Even if I confirm the first suspect on my list as a source, I don’t dismiss the remaining listed items. This discipline gives me confidence that I will have identified the true bug source after completing my list.

In homicide, jumping to conclusions can put an innocent man in jail. In software programming, jumping to conclusions can lead to further disruption, corrupted data, and a whole lot of time wasted.

 

Step 5 – Close the Case

Homicide detectives close cases in one of two ways: they either get a confession from a suspect, or they collect enough damning evidence to go to trial in certain victory.

Hand Over the Evidence (Report Key Findings)

I put together all the information, conflicting testimony, and evidence I collected to give me a clear and certain explanation for the source of my bug.

      1. Isolate key findings. Now I either have a “confession” in the form of a message pulled from an error log file, or I have proven beyond doubt why my bug appeared. I’ve processed a lot of information, had a myriad of thoughts, and ruled out many possible causes of my bug. None of that matters, now. I only need the relevant information to communicate about the bug and to apply a fix. I summarize my findings.
      1. Record key findings. I find a place to record those key findings and any additional thoughts to use when designing a fix. This could be in notepad, Word, Excel, or OneNote. The exact format and recording location depends upon how I need or want to communicate my findings.

For example, I may only need something for my own future reference. I know I will want the technical details, but the format can be loose.

Or, I might want to communicate my findings to another development team. A development team will need the technical details to understand, and I include more descriptive information in my format.

Or, I may want to communicate my work to management. They are less likely to be interested in the technical details, but they will need to know about the troubleshooting work completed.

I have found that taking time to record the troubleshooting work is helpful in several ways.

First, it helps me learn. Writing it down increases the likelihood that I will remember what I’ve found and be able to apply it in future development or troubleshooting.

Second, it is helpful when communicating the value of my troubleshooting effort to others.

Third, my records have become a resource for my own reference during future troubleshooting.

Administer Justice (Apply a Fix)

Ultimately, good homicide detectives want nothing but justice for the victims of the crimes they work. They use a disciplined, systematic approach proven to reveal the identify those truly responsible for murder.

Likewise, a systematic approach to troubleshooting will consistently expose the true sources of bugs. Once the true source of a bug is revealed through such an approach, a solution is typically apparent.

I follow these steps to move from troubleshooting to fixing bugs:

      1. Study key findings. I examine my notes regarding key findings from Step 4 – Report Key Findings.
      1. Design the fix. I combine my experience, my knowledge of the application, and the key troubleshooting findings to generate ideas on how to fix the bug.
      1. Apply the fix. I realize my design by modifying or introducing code.
      1. Verify the fix. I follow the steps recorded in Step 3 – “Reproduce the Issue” after the fix has been deployed. That action proves the bug is fixed.

 

Summary

Although programmers are logical people, they sometimes forget to apply logic to the process of troubleshooting.

In their haste to solve a problem or meet a deadline, they skip the systematic process crucial for effective and efficient troubleshooting.

It isn’t their fault. We all hope to fix our code quickly by skipping steps and jumping to a solution.

But the reality almost always ends up much differently. We waste hours precisely because we skipped a systematic approach to fixing our bugs.

It can be difficult to remember to apply these steps.

To overcome the tendency toward bad habit, I developed a troubleshooting framework built on a memorable analogy – as if I were a homicide detective.

I can easily recall how detectives approach each case, which helps make sure I follow steps necessary to get me to the true source of bugs efficiently.

 

The Complete Framework

I’ve condensed all the steps of my troubleshooting framework for your easy reference.

Step 1 – Visit the Crime Scene

Part A.  Examine the Scene (Clarify the Issue)

      • Answer 3 Starting Questions.
      1. What happened?
      2. What did I expect to happen?
      3. How do the two differ?

 

Part B.  Collect Evidence (Gather Reference Material)

      • Note the time of the bug.
      • Record which software component displayed the error.
      • Note anything familiar.

 

Part C.  Identify Initial Suspects (List All Software Components)

      • List server and client applications.
      • List application dependencies.

 

Step 2 – Interview Witnesses

Part A.  Get Eye-Witness Statements (Recall My Steps)

      • Record my last 3 to 5 steps just before the bug appeared.
      • Jot down a couple details for each step.

 

Part B.  Get Other Witness Statements (Review History & Configuration)

      • Check configuration.
      1. Configuration files
      2. Dependency setup
      • Inspect logs.
      1. Log files
      2. Console messages
      3. Database records

 

Part C.  Talk to Informants (Collect Additional Information)

      • Ask coworkers and colleagues.
      1. Ever seen this?
      2. Recent impactful changes?
      • Perform an internet search.
      1. Q&A web sites
      2. Find documentation

 

Step 3 – Retrace the Crime

Part A.  Model the Crime Scene (Attempt to Reproduce the Bug)

      • Follow steps recorded in Step 2 – Recall My Steps.
      • Make adjustments until necessary.
      1. What didn’t I record?
      2. What is different about my data?
      3. How might I have done things differently before?

 

Part B.  Document the Official Timeline (Document Confirmed Steps)

      • Assemble the final steps.
      • Write down the confirmed sequence.

 

Step 4 – Interrogate Suspects

Part A.  Identify Suspects (List Possible Causes)

      • Ponder possibilities.
      • List up to 5 possibilities to target.

 

Part B.  Document the Official Timeline (Document Confirmed Steps)

      • Launch the debugger.
      • Set breakpoints to determine code execution path.
      • Set variable watches to spot inconsistencies.

 

Part C.  Rule Them Out (Avoid Jumping to Conclusions)

      • Go through the list of possible causes from Step 4 – Debug the Code methodically
      • Maintain discipline to complete the list.

 

Step 5 – Close the Case

Part A.  Hand Over the Evidence (Report Key Findings)

      • Isolate key findings.
      • Record key findings.

 

Part B.  Administer Justice (Apply a Fix)

    • Study key findings.
    • Design a fix.
    • Apply the fix.
    • Verify the fix.