Recently in an interview I was asked what the biggest bug I came across was, and the hardest to fix. I thought I'd do a write up on that bug, mainly so I can understand what lead to it myself, but also hopefully so people can prevent this happening to themselves.
The code side of the bug is actually relatively simple to fix; it was a simple inheritance problem due to a mix of inexperience on my part (I was a Junior Front-end Developer doing Back-end work), a lack of tests and the way that the Python framework we were developing on handled inheritance. The real hardship of the bug was the way that the team worked together to fix the issue.
Back in 2014 I was a Junior Front-end Developer working at my first start-up job. The team at the time consisted of two Senior Developers; One focused on Scala, the other on Odoo/Python and two Junior Developers; One focused on the front-end (JS, Play!) and the other on the back-end (Odoo). We all worked under the CTO but he never worked directly with the team on the product. I found myself drawn more to Python than to Scala so I started to do some development on the back-end. One of the things I built was a Werkzeug Web Controller and PhantomJS report printing module.
As noted in my Root Cause Analysis post the culture at this point in time was more around getting stuff out the door over approaching things properly and ensuring we specified, tested and implemented things correctly. This of course meant that when it came to create a customisation of the functionality for a client, I made my changes and assumed that everything else was going to be ok.
However, they were not, for three reasons:
- Odoo, the framework we were using handles classes and inheritance using a database driven module graph and essentially looks at what modules are installed on the instance. I made my changes in a way that meant they changed the superclass, even if you didn't have the module installed
- We had a CI, but no tests around the original functionality or the customised version
- The way we deployed the code was to download the latest version of all modules, regardless of the client, so every client had all client customisations sitting on their servers, although not installed.
I'm sure you can see how this bug was raised to us. We did a deploy to a number of clients; they all suddenly could not print their reports (which were used as a business continuity plan) and I'm sat there looking at the core functionality, wondering how something that hasn't changed in 2 months is now broken.
Luckily I was able to figure out that the one field I had added to the customisation was in all of the stack traces and started to investigate it. Ultimately, my inexperience with the way that Odoo handled inheritance led to me asking for help from the Senior Developer who worked with Odoo, this was probably the worst move I made.
Everyone in the team, apart from the Odoo based Senior Developer used a Jet Brains IDE (PyCharm/IntelliJ) and
in doing so was used to using the built-in debugger when squashing bugs, as it was much cleaner than littering
the code with
pdb statements. This, however, was the preferred method of this particular developer.
To make matters worse, as this was a web controller called by a PhantomJS subprocess; the way the developer addressed the issue was not to set a break point in the controller and directly load the page. Instead they simple had the controller return 'this is broken' only. This of course showed them that the problem existed and allowed them to verify that they had resolved the global override, but they also left this in and pushed that code to production (we only had a master branch, so this meant every push went straight into production).
I'll reiterate the bit about a CI, but not tests around the functionality and the fact that we found out about bugs when the code was already in production at this point. So 2 months later when the client has started to use the system, we receive an email stating that all the reports say 'this is broken' ; I get taken outside by the CTO to ask why this is happening, I mention how I was stumped and asked the Senior Developer to help me and ultimately the Senior Developer shrugs, telling me he solved the problem I came to him with - it's not his fault he left that 'this is broken' string in - I should have written some tests around it.
Unfortunately, he made a very valid point about the tests and as much I would love to blame it on the lack of a quality focused culture, a lack of experience, a lack of guidance from the Python Senior Developer, or the fact that he didn't write any tests either - I do ultimately have myself to blame for not writing them. This experience eventually led to me aggressively driving TDD when I started leading the team a year later.
The Real Bug
For me the real bug wasn't the issue with the reports, It was:
- The hero complex the Senior Developer exhibited
- The lack of communication around the changes being made, how the fix solved the problem and how my initial module and it's customisation fit into the product
- The lack of quality assurance we practiced as part of our development work
I'm sure everyone has read at least one article by now about how Heroes, Ninjas, Rockstars etc. don't play well in teams (here's an excellent one from Christian Heilmann).
There were a few things I feel may have led to this behaviour:
- The Senior Developer was hired for their knowledge, not for their ability to mentor
- When they started trying to improve the existing product before fully trying to understand it they spent 3 months writing their own version in isolation
- At the time there wasn't a culture that allowed them to grow into mentoring, so they had no reason to do so
- There were no introspective activities such as retrospectives, one-to-ones or appraisals
This meant that the Senior Developer turned up for work, built his masterpiece and dealt with any requests from people trying to use his solution for their means. So when asked to solve a problem with someone else's code, he couldn't look past the impact it was having on his work, instead of looking at it in a holistic manner.
Lack of Communication
One thing I've learned is that everything in work is down to communication; ensuring that people are all on the same track, building shared understanding of the problems and solutions found in a code base and making sure people are having fun doing the work.
Communication in a team suffers when:
- You allow people to work in silos
- You create hierarchical/functional barriers between roles such as Senior/Junior or Front-end/Back-end
- You don't set aside time to measure the way that communication happens inside the team
- You assume people are collaborating with others so you leave them to get on with it and don't take part
My story has many examples of poor communication:
- The Senior Developer worked in a silo for 3 months to architect the new solution for the product we were building. It was only after they had done all the work that they worked with others to get that solution back to the functional standard of the old one.
- When I built the initial module I also pretty much worked in a silo which is what ultimately led to the inheritance bug as I failed to collaborate with my colleagues to understand how to properly integrate this feature into the product
- As I realised the bug's solution was beyond my (at the time) basic grasp of Odoo's inheritance mechanism I decided to communicate the need for assistance, but due to the hierarchical and functional barriers between Junior Front-end and Senior Back-end we essentially chucked the problem over the fence to each other instead of collaborating to solve the issue
- During this time we did next to no introspecting into the way the team functioned. Instead we just shut up and got on with things. This lack of discussion on how we could work better together led to the aforementioned barriers and ultimately meant the hero complex exhibited by the Senior Developer went unchallenged
- The CTO's lack of understanding of how the bug was verified and fixed meant they didn't work with the team to discover the practices that led to the initial bug and the bug present in the solution. They assumed that the team were working together like a well oiled machine when in reality we were in need of some WD-40.
Lack of Quality Assurance
As I mentioned this experience impacted the importance I place on validating ideas. By ideas, I don't just mean features to be built but a more holistic view of validating the ideas that lead to features, validating the ideas that lead to writing code for features, validating the implementation of the features and validating the ideas of how we're going to deploy and maintain the features.
If I apply this validation to the story, I would have:
- Validated how the report functionality would have worked for clients, instead of blindly building it for the generic use case and struggling later to customise it for clients
- Validated how the report functionality would have been integrated into the product and the impact it would have had on the codebase
- Validated how the report would be tested as there were many components such as the functions to collect the data for the report, render the data in the report and print the HTML from the report into a PDF document
- Validated my implementation of the feature via a suite of unit and integration tests
- Validated my implementation of the customisations I made via a suite of unit and integration tests
- Validated the impact of having the changes I made on a number of different client configurations via automated or manual QA testing
- Validated the maintainability of the code via a code inspection tool like SonarQube and reviewing the implementation with colleagues
Then came running the tests locally while developing and on the CI as I pushed to the repo, the team and I would have been assured that everything was working as intended and we could have prevented the first bug and caught the bug created by the Senior Developer push his debugging code. Additionally this validation would have forced us to introspect on the way the team worked.
The outcome of the scenario in the story was the Senior Developer and the CTO being fired within a few months of each other and the CEO asking me to lead the team. This experience was the driving force behind the work I did to open up communication within the team, stop any hero complexes and start viewing the way the team worked as essential as the code they wrote.
As you can imagine the interview wasn't too pleased with my response, I took a question about programming and answered it with solution for running a team better. I do think it's still a valid answer. Most bugs are caused by people either by carelessness, misunderstanding how to use a technology or building the wrong thing (which in a way is another kind of carelessness).
If people cause bugs then we should look to validate their output; not only their code output but the output of the interactions they have with the team they work in, the output of the team in the organisation they work for, the output of the customers using the features they build and the output of those who supply the the team with input.
In this analogy I view regular code review, retrospectives, one-to-ones and appraisals as a form of debugging with break points being represented as time taken to inspect how things are running. To be proactive (like TDD for teams) then tools and techniques like this can help:
- Continuous integration
- A process framework that makes sure time is taken to introspect such as Scrum
- Continuous inspection
- User stories & acceptance criteria
- Team metrics such as burn down, cumulative flow and health & happiness
- Key Performance Indicators that the team's output is measured against
- Semantic monitoring and User Experience feedback sessions