A couple of years ago, a client contacted us asking for help with an app they had been developing and recently released. It was a product that already existed but they were making a full app redesign, this means several months of hard work! It was certainly a beautiful app but they were facing several issues. The main symptom was that the app was having low reception with two-star reviews in both the Apple App Store and Google Play Store, and complains were mostly due to app performance and crashes. The work had been done by an external agency and they had been chasing their tails with app performance for a couple of months without a clear path for improvement.
When we got involved and dug deeper we discovered several issues, the first one was that there was no clear way of tracking app performance and crashes, thus app performance discussions were guided mostly by app reviews and also by someone’s experience trying the app. Something along the line of “Alice at sales had this issue”. Not that this is bad and we shouldn’t pay attention to it, but there are tools that can help to have a clearer and more objective perspective of the situation.
When we activated the tracking tools on the app it became evident that the most pressing issue was the app had quite a high crash rate (>6%), we also discovered other non-crash errors that users were seeing. After a couple of iterations tracking and prioritizing and continuously updating the app we got quickly to <1% crash rate and then to <0.2%. The apps now have really high App Reviews 4.7 and performance issues seem like a story from the past.
As time goes by, we’ve noticed that it is actually a very common struggle that many Product Managers go through. This is why we wanted to share our experience with 5 easy-to-implement practices of how to solve and prevent high crash rates and buggy apps.
“You can’t manage what you don’t measure”. This quote is attributed to Peter Ducker and although you can argue its validity in many scenarios, it is definitely a useful guideline when working in app performance.
There are several tools you can implement to start collecting data and thus be able to detect the root of the problem. In the past, we used to work with Crashlytics which was quite useful. Many enterprise clients tend to have New Relic, which has a Mobile apps tracking suite too and not only for mobile but for OTT and Unity too. No matter what tool you use, it’s normally an SDK that once implemented, sends a report to their site for each crash, containing relevant information that will help us determine what’s causing the crashes.
Something to bear in mind is that it’s not only about crashes. Users face errors that are not necessarily a crash, so tracking user-facing errors such as error messages and API response times is also important. This type of tracking is often referred to as telemetry. A handy tool that we have come to enjoy is New Relic Insights, which allows us to create custom dashboards of the App’s health. It also allows you to send custom error events that you can then track. We have found that this tool is more personalized and allows to see API errors, server errors or information errors that you can then use to improve the user experience.
There’s also the Vitals feature in the Google Developer Console, that, as the name implies, displays the health of an application by showing the crash rates and errors that exist.
Using tracking tools will make easier to detect what was causing the app to crash.
“On the way to the woods, Hansel crumbled his piece in his pocket, then often stood still, and threw crumbs onto the ground.” — Hansel and Grettel
When tracking, make sure you add enough information to the events for your development team to follow in case the crash information is not enough. An important step in this direction is activating symbolication. Symbolication allows you not only to detect the crash but also know the exact line of code where it happened and the parts of the code that had been called before reaching that point. It’s basically reverse engineering the code. If you are using a common tracking tool, then symbolication should be supported, but it’s something that you normally need to add to your code, as a script, especially for iOS/tvOS.
Another useful trick that we follow is to add analytics information to our bug tracking tool, so we can then link the crash to a specific user and with that, we can follow the user path that caused the crash. You can perfectly add a user id as metadata to the new relic and then search that user Id on your analytics tool such as Amplitude.
When tracking non-crash errors we also add the message the user saw, what endpoint caused it, what error did we get from the API / Server if any, besides any other useful information that will provide us valuable information and help us solve the problem.
“80% of the effects come from 20% of the causes” — Pareto Principle
It’s common that teams tend to tackle bugs on a first come first served basis. Especially if the report comes from a loud voice or a key stakeholder. Although that kind of makes senses in certain situations, you should normally prioritize the problems most users are facing first. The Pareto Principle comes handy on those situations, and although it’s just a rule of thumb, it’s a good guiding principle. Normally there are a couple of bugs or crashes that are causing complaints, solve those first and iterate your priorities accordingly. You will see your crash rate and complaints dramatically decrease as you make progress increasing user satisfaction, recovering trust from stakeholders and boosting the team’s morale.
Something that we also evaluate is if the problem is widespread, meaning does it happen on all device models or does it only happen on the old ones (e.g. iPhone 4, yes sorry it’s a bit old now). Android has come a long way on stability but is still challenging to test because of the device fragmentation. A good rule of thumb is to test and certify using most common models such as Samsung Galaxy phones and then expanding from there. You can even exclude models where your app is available at the beginning if you want to, some teams do this. We don’t recommend it unless there is a hardware compatibility issue or OS version support issue.
Software is complex and in an environment of continuous delivery, it’s even more. Plus, unless you have a 24/7 platform reliability team working for you and even then you will probably need a systematic and scalable way to manage your platform stability. There are a couple of simple actions you can take to simplify your process.
First, create an app health dashboard that you can easily check on specific times a day or when incidents pop. The dashboard can have different things including key events from your app, crashes, API performance and user errors/error rates. Our product owners and lead developers have a tab open that they constantly check. In some cases, you can have a monitor that everybody can see. The dashboard will save you time detecting where the problem is.
Something that is also important is setting alerts, most probably you and your team will have off times, weekends or others were not many people is available and in some situations nobody is online. Alerts with a proper baseline will help you discover issues as soon as it happens. Did crash rates bump all of the sudden? Did the server start having too many time outs? Did users start to complain on twitter? Did we start getting bad reviews? All those questions can be automated with alerts making your life and of your team's much easier.
“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” — Norm Kerth, Project Retrospectives: A Handbook for Team Review
Having poor reviews and high crash rates can be stressing, and in situations of stress, people tend to default to defensive and sometimes destructive behaviors that can be demoralizing for teams and will get you nowhere close to solving your problems.
First of all, don’t play the blame game. It’s easy to try to pinpoint the “exact responsible” of a crash. But as mentioned before software is complex and it’s a team sport. Make sure you focus on the bug or crash that you are trying to solve and not the person or team that developed that piece of code. Help them understand the issue and help them solve it. After the issue is solved practice an honest retrospective where you discuss 3 important questions: 1. What went well? 2. What didn’t go well? 3. How can we improve? Some teams read the prime directive stated above to help them set the tone at the beginning of a retro. The retro is not a support group only to pat your back, the third question of the retro should result in clear actions that you can take to prevent that errors like this happen again. Make sure you follow those actions and practice retros on a regular basis. We tend to have team retro’s every sprint and more macro-level retros every 3 months.
A philosophical guideline that we have found handy is the concept of Collective Code Ownership, where there is no single owner of a part of the code but all the team is the owner of the code. This way quality becomes a team responsibility and any member of the team can pinpoint an issue on any part of the software and help to solve it. It has helped us a lot breaking the territoriality that sometimes appears when you become the single responsible person for a module, a feature or a product. Any person can jump to any part of the product and at the same time, you become responsible for making clear and maintainable the part that you developed.
Another important aspect is remembering that quality doesn’t come from the QA team, it comes from development and QA will help test more thoroughly and provide more certainty about the quality and reliability of the software that you are creating. I mention this because it’s common, especially on young developers and teams to finish a feature, merge and send, without any testing whatsoever. Of course one can argue that you can do Test Driven Development or any other methodology but a simple rule of thumb is the following. When you finish a feature take a pause, merge the code locally and start testing the most important scenarios, not only what you just created but things that can be affected by your changes. If and once you consider it stable then integrate it to the main code. A code integration process such as Gitflow is recommended for this. I’ve noticed that following this rule of thumb is useful because when developing you tend to be in 2 modes (3 if you include ideation): 1) Production 2) Quality. Finishing a feature is pure Production mode, you are creating, but you may cut some lines in the process. Making a pause and changing the mindset to Quality mode will help you look at the feature with a clearer perspective. When finishing heavy features I normally recommend people to go to sleep and come back the next day in Quality mode, in order to have a fresh look at the code.
Finally, as any software team would recommend making sure you have common good practices such as defensive programming, creating test cases, naming conventions, having a simple architecture, using Gitflow and following a design pattern such as MVC or similar. There is vast documentation about these subjects but it’s always important to remind teams about them and that they are important in building quality software.
A common hack on getting better reviews is to encourage good reviews and capture bad reviews before they happen. Tools like Apptentive can help you do this by asking the users how they feel about your app. Users that love your app are prompted to leave a review and users that hate your app are prompted to fill out a form. This is useful because you can get more information and at the same time reduce the number of bad reviews. Apptentive also provides a nice Love graph that can help you have a better understanding if users love or don’t love your app.
Be careful though because no trick will help you get good reviews and user love if your product doesn’t solve a problem or if the product has poor quality so this is why I think these practices will be of help.
I hope you find these practices useful and I’m happy to discuss and hear your feedback.
See some examples of the types of projects and ongoing clients we work with.