We learn the most from our own mistakes. They can completely alter the perception of the work we do and teach us things we were not even aware of before.
In this article, I will share one of such meaningful fails - one where I also contributed my two cents.
For a good start, let’s review the definition of the vendor lock-in, as found in my favorite encyclopedia.
Also known as proprietary lock-in or customer lock-in makes a customer dependent on a vendor for products and services, unable to use another vendor without substantial switching costs. Lock-in costs that create barriers to market entry may result in antitrust action against a monopoly. Via Wikipedia
Sounds familiar? Read on…
I’ll tell you a story
In the beginning, there was an inherited application with all its nuanced architectural decisions and legacy code. As we were working under constant business pressure to rapidly add and change functionality, there was little time to improve on the technical debt we incurred. We struggled to balance what was currently important for the business and what was important for the business to develop further. We made improvements at the code architecture level in the order of our ongoing development priorities.
As we continued our efforts, implementation of a crucial architectural improvement remained ahead of us. A decision to do it should have been made early and it honestly was right at our fingertips, but for one reason or another, we let it slip. The consequences of this turned out to be dire, but more on that later.
One Tuesday our application stopped working. The situation was really bad:
- nothing worked - blank white pages appeared everywhere - on test servers, in the customer’s environment
- there were no details about what happened
- it was Tuesday, and on Wednesday there was a statutory holiday
- in two days, we were supposed to give an important project presentation to potential clients
Our initial analysis indicated a huge failure. On a closer look it turned out that the support for the old version of a library responsible for servicing one of the main elements of our application - the Bing map - was withdrawn.
To be fair, Microsoft gave us a good amount of lead time, notifying us about the situation a few months earlier. They'd also written about it on their blog and even sent reminder emails. Unfortunately, nobody supervised this on our side.
As a result, we had to rapidly perform a major upgrade, involving changes at the library API level. Some elements had to be removed, functions of others had to be changed. Fortunately, all differences between the versions were well described on Microsoft TechNet.
The root of our problem was the fact that our application, inconsistently but simultaneously, used two ways to integrate the map:
- the angular-bing-maps library
- direct calls to the native library API
The readily available library for the Angular framework gave us some abstraction over the native library, but at the cost of significantly reduced performance. So, to quickly improve performance, we bypassed the library and made direct calls to the native library.
The resulting code was a real mishmash, in which native library calls gradually became prevailing. At the time of our failure, in most of our code, the native library was used directly.
Now imagine that you must quite literally make changes to the entire application code as soon as possible because a library API has changed, and you used it everywhere directly. If only we encapsulated the use cases in a module (or even in one class), this would have been a much easier task.
We knew we needed abstraction here, but the library we used for this was chosen poorly. In our decision log, no one even mentions the reasons for picking this external library for the project, despite the lack of active development.
By the time the old version of the Bing library was switched off, the authors of angular-bing-maps were not able to prepare their solution for the new version. This also became part of our problem.
All in all, even though we were working on an inherited codebase and development was still in progress, I feel I failed (as a Lead Dev) in a few areas:
- I didn't give high enough priority to the work of abstracting over the Bing map library
- I didn’t update the libraries we used
- I didn’t follow the development of a service on which we became highly dependent
- I didn’t verify the impact and quality of the libraries we used in the project
As I “_took the command" _I failed to pay enough attention to these and the unfortunate end result was a complete stop of the frontend application. We had a big problem, a broken interface, and an important presentation the next day. We had to act fast...
Solving the problem
Our response to the situation came in two phases:
- the ad hoc phase, where we aimed to bring back the application as fast as possible
- the long-term reaction, aimed at building the abstraction for maps and removing the angular-bing-maps library
Ad hoc reaction
We created patches for the code and introduced direct changes that altered the way maps were loaded in the application. In places, we were forced to use hacks. For example, we had to put in an infinite loop to verify whether the new native library for the Bing map was initialized because the way the library was loaded did not match the way our application was built.
We did whatever was necessary to bring the application back up, under time pressure. But it was enough.
All the necessary fixes were implemented over our statutory day off (sic!) so that the presentation of the project planned for the next day could actually take place.
With no automated tests (or in fact a negligible amount that I can confidently round down to zero) and changes to the entire codebase, we had to run thorough and comprehensive manual tests over the entire functional scope of our application.
As soon as the application was back up, we started to plan for how to avoid similar errors in the future. We focused on the following elements:
- getting rid of the angular-bing-maps library
- building a module to abstract out the map use cases of the application
- verifying and introducing Leaflet, which allowed us to seamlessly switch between different map providers
From that moment on, we consciously included the previously missing element to our planning of improvements and technical debt elimination - risk. We implemented technical risk analysis across the entire project.
The entire situation we went through allowed us to draw conclusions that apply regardless of the application size and magnitude of the failure. They come down to the following three high-level aspects that are worth working on and paying attention to while conducting an IT project.
In this context, risk analysis means the identification of elements that threaten the proper functioning of the project. The main purpose of this practice is to prepare a response plan for risks introduced by any component that will negatively impact our application when it’s unavailable or modified. This means all external services (AWS / Azure services, email sending, reporting systems, analytics, etc.), repositories (npm, Docker Hub, etc.) and libraries. For every such element, ask the following questions:
- What if the service stops working - temporarily or forever?
- What impact will this have on the functioning of the application?
- What is the probability of this occurring? Here it is worth understanding what SLAs are, what level is ensured by the services used, and what this means in practice. For example, did you know that 99.9% uptime means 8 hours 45 minutes and 57 seconds during which the service may be unavailable in a year? (simple SLA calculator).
- How do we protect ourselves against failure of the external service?
Doing this helps us realize the degree to which we are dependent on external factors, over which we have no influence. The analysis itself is not enough. Once we know the weaknesses of our application, we need to move on.
Immediate and long-term response
When something breaks down, it’s invaluable to have a prepared plan of action. As you remember, we responded to our crisis on two levels:
- the ad-hoc level - where we worked fast to restore the application and minimize the impact on the users. This meant hot-fixes and patching things quickly, even if this means using metaphorical tie-wraps
- the long-term level - where we worked out a plan for what needs to be done to prevent similar situations from happening in the future
In our situation, the most important thing was to bring the application back into operation within (more or less) 24 hours. We had to present a working application to our potential customers, not clean code. We thus chose to consciously increase short-term technical debt and repay it over the next several days. As we did this, we also introduced long-term solutions from our plan, vastly improving our application’s resistance to similar failures.
npm install library and that's it. But, how often do we analyze each library in terms of:
- maintenance - how’s the response to bugs and their elimination?
- active development - is the library more than a shared piece of code representing a temporary need of the author(s)?
- community - how many people actually use the library?
- quality - what’s the number of unresolved errors? How’s the performance? What’s the test and overall code quality?
Systematic updating of dependencies is also important here. A major version update may require significant changes and full regression tests - which is work that you should plan for properly.
Just as we approach solution implementation with caution, we should perform each
npm install newcoollibrary with the same care.
Vendor lock-in is not just a technical issue
Vendor lock-in does not only result in technical problems like ones that we had to deal with. These are also problems on the business side of things. When calculating the price of your product, you may take into account various factors, including the value added by the product, but also costs of infrastructure, external services, and many more.
Now imagine that you are building a solution that is based on a map, as in our example. Let’s say you depend on Google Maps because you have learned the solution well and come to rely on its great UI and API, the large opportunities it brings, and high usage limits that make it affordable.
One day you receive an email that informs you about a drastic change in the pricing of the service. Mild chills go down your spine, as you know that the cost of the service has so far been zero but you don’t yet know how badly this is going to change. You calculate your new service cost based on the price list and your usage and your new monthly fee is "just" $ 5,000...
We've shown you how a few bad practices combined together can cause a project near-death experience.
The key takeaway is that strong reliance on one supplier may bring upfront time and financial savings, but at a cost of increased risk of complications down the line. In our case, the supplier simply updated a library to a newer version and switched off the old one.
Lack of interest in the development cycle of APIs and SDKs your project depends on may easily become the final nail to its coffin
Manage dependencies, do not depend on them.
Manage dependencies carefully, regardless of whether they’re programming libraries or external services. Follow and be aware of their development plans and roadmaps. Make sure you respond ahead of time to all changes that may adversely affect your application.
A healthy abstraction over an external library or service will save you from the need to refactor your entire application code when the underlying dependency changes. In our case, clever separation of the map module would vastly reduce the time we needed to bring back the application and save us from significant technical debt. Repaying this debt took many times longer and cost way more than introducing it ahead of time.