Engineering a successful product launch

comments 3
Growth
Photo by SpaceX on Unsplash.

Panic, panic!

It’s the day before launch.

The engineering team look frantic. There are empty takeaway coffee cups across their desks, in the bin, and on the floor. Kelly is slouched over her keyboard looking at her monitor through her fingers. It’s the loading prompt, and it’s still loading, even after three minutes.

“This just can’t be possible,” she remarks. “How come we’ve never had any issues with the loading speed until now?”

A Slack message from QA.

Ash: Why is it throwing an Internal Server Error every time you change the date range to last month?

Kelly: What?

Ash: Try it out, see if it happens for you too.

Kelly: ARGH, let me have a look…

Bringing up the Chrome developer console, she feels the subtle change in air movement as someone approaches her.

“I’m really sorry to interrupt you, K.”

It’s Evan, the infrastructure engineer. Kelly tries not to get frustrated, and scoots back on her chair.

“What’s up?”

“The database just isn’t coping with the amount of requests from your service. I think it might need reworking. When is this meant to ship?”

Kelly feels her cheeks flush red. “Tomorrow.”

Another Slack message pops up. It’s Marketing.

Jordan: I’m just about to send out the countdown video on Twitter. App doesn’t seem to load at the moment. What’s going on?

Jordan: Are you there?

Kelly feels the world spin, and would rather be anywhere else right now.

The most wonderful time of the year

Software launches are one of the most anxiety-inducing things about being a professional developer. No other work event, apart from giving a talk to a room full of people, feels as full of terror, mishap, last minute stress and adrenaline as the day that the new application or feature gets switched on to a fanfare.

Building software is hard enough in the first place. Building it to a deadline is even harder. Building it to a precise deadline where all of the company – and soon your whole customer base – is looking at you, is terrifying.

Nothing ever goes entirely right in software. The bits of the project that you thought were going to be difficult turn out to be straightforward, and the bit that was going to be simple takes four times as long because of some obscure networking issue.

To top it all off, the kraken-like mega problem that nobody could have predicted beforehand eats all of your contingency time, and here you are again, once again – and you will forever be here, no matter the project – fixing bugs and performance issues right before the deadline, overtired, over caffeinated, and overstressed.

The big bang theory

Big bang launches are a very bad thing. We, as an industry, and as professionals working in software, need to do our best to persuade those that we work with that big bang never works.

What do I mean by big bang launches? I’m talking about launches where the application or feature:

  • Is shipped to production only as the marketing launch goes out.
  • Is enabled for all users at once.
  • Hasn’t been profiled against real load in production.

The attraction of the big bang launch to the outside observer is clear: it is the ultimate demonstration of the whole company being in tight formation. There was nothing, and now there is something big. Magic.

The engineering, the marketing, the salespeople, everything – all of the planets align at just the right time – and the curtain opens in front of the ballet to rapturous applause. What a display of coordination and synergy!

But, this just-in-time delivery never goes right. Ever.

What kind of strategy can we adopt for making it look like we’re performing miracles but we’re actually being measured and safe instead?

How can we allow all of the space that we need to roll things out in production, test them and tweak them, whilst still factoring in contingency in such a way that allows for flexibility when things inevitably go wrong? How can we do this with our existing and prospective users being none the wiser?

Engineering strategies for a smooth launch

The best engineering strategy for shipping a big splashy feature for the first time is to make sure that when everyone starts to use it in production, you already know everything about how it runs in production. Broadly speaking, this involves:

  • Planning for usage that is many times beyond what your current system can handle.
  • Making use of feature flags in order to constantly ship the feature into production where you can see how it behaves.
  • Doing extensive load testing early enough to be sure of your architectural decisions.
  • Taking advantage of beta programs with trusted customers to get real, non-internal feedback on the production code as early as possible.
  • Using shadow loading to see the real production footprint, without those users knowing they’re using it.

Let’s visit each of these in turn.

Planning for usage

When planning the approach for the new feature, you should be already thinking of future load, rather than current load. Take the number of users and then stick a couple of zeros on the end and think about whether it will still perform.

If it doesn’t, is it possible to scale it horizontally? How would that work? More applications, more servers? How will requests be routed? Round robin, or sharded by client or user? Is it possible to get away with estimates of data rather than needing to count and aggregate everything? Will values be computed in batch, real time or pre-computed?

Get creative. There may be a really neat solution.

Asking these questions with a diagram on a piece of paper or on a whiteboard is easier, cheaper and less stressful than doing it a week before launch. I once read that most of the effort in system design should be on picking through the edge cases and the contingency plan, rather than trying to make the core design perfect.

Make sure that you have a clear route for future scale, otherwise you’re going to be replacing the wheels on a moving car, rather than on a stationary chassis.

Feature flags

Extensive use of feature flags can save all manner of headaches. Continually releasing code behind a flag means that large features don’t end up in branches of the codebase that remain unmerged for long periods of time, needing painful rebasing before they go into master. Instead, shipping code behind a flag means you can merge small increments of functionality as you go, without the user ever knowing.

Feature flags, especially those that are highly customizable such as the ones provided by Launch Darkly, mean that you can test functionality with a percentage of your customers, or you can enable features to internal staff who will give you valuable feedback without the feature needing to be polished. You can also use flags to coordinate beta programs. When it comes to shipping time, you can roll out new functionality to customers in incremental cohorts to measure the impact.

Feature flags are great.

Load testing

Prototype architectures can be tested by generating simulated load to prove that what you’re building is going to take the strain of real users. I’m used to backend tools such as Gatling, that let you simulate a large number of users and usage patterns hitting your services, and also easily collect the data from your tests to analyze the results.

What’s your 99th percentile case versus your 50th percentile? Is it acceptable? What will your biggest customer experience versus your mid-tier one?

Beta programs

As well as generating simulated load, it’s always valuable to get real users involved ahead of time. Identify users of your application that would be happy to provide feedback in return for using new – but potentially buggy – software ahead of everyone else.

With the help of feature flags you can give them the unpolished functionality ahead of time, monitor how the system performs under real usage patterns, and also speak to them for qualitative feedback.

Implementing their suggested improvements makes the final product better for everyone.

Shadow loading

Before doing a general rollout of your feature, you can route all traffic to it behind the scenes, but not let the user know that it is happening. This is often called shadow loading.

For example, if your new feature is going to be shown on the top of your homepage, why not have all users unknowingly call that new endpoint on page load, with the results being discarded?

This way you can measure what the load on the system is going to be like during normal conditions. You can feel assured that the functionality is ready for showtime.

In summary

Think about ways in which you can ensure that your new application or feature has been planned for scale and has been subject to production load a long time before it gets shipped to real users. Use feature toggles, load testing, beta programs and shadow loading to ensure that launch day is one where you can celebrate success, rather than tend to fires.

By following some or all of the techniques above, you can ensure that on the day that you flick the big switch on for all of your users, the system has already been doing all of the work, predictably, for some time. You can go out for a celebratory lunch with the team without a feeling of paranoia that everything is about to blow up catastrophically.

How to tackle technical debt

comments 2
Growth
Photo by Ehud Neuhaus on Unsplash.

Mining for debt

Recently on Slack one of my colleagues shared this comic from Monkey User.

I thought it was a great metaphor. 

The world of software moves extremely fast. Inside a given company the codebase is constantly changing with the addition of new features. Outside the company is an entire world of open source software development, shipping updates to all of the libraries, frameworks and databases that are being used.

With time, piling on ever more code creates moments where the team needs to stop and take a step back. They will need to think of a different way of moving forward that is more maintainable, controlled and less prone to bugs. 

Even if the internal codebase changes extremely slowly, external dependencies are always releasing new versions, requiring the team to upgrade them before they reach end of life. This can create further technical debt as APIs deprecate or breaking changes are introduced.

Quite often engineers struggle to make their case for prioritizing tech debt work. Why?

  • Lack of empowerment: They might not think it is their place to speak up about it; instead they expect that more senior people will dictate when to take stock, refactor, upgrade libraries or storage or so on.
  • Inability to persuade: They might not be able to construct an argument to spend time on it in a way that non-technical people that dictate the work streams will understand.
  • Apathy: They may have already lost hope that Product or any higher-ups will listen to them, and therefore silently let codebase or system degrade. “Features are more important,” they say. “They’ll never listen to us.”

All of these situations are a shame. They’re also not acceptable. But they’re fixable. Let’s have a look at them in turn.

It’s someone else’s problem

As an engineer, if you think that it is someone else’s problem to point out that there is a technical debt issue beginning to get out of hand, then – and I’m sorry to say it – you’re wrong. There are a number of traits that an excellent engineer will have, and a pride for their work and keen interest in the future state of the code are two of them.

Those committing code will know best about how the codebase is currently written and organized. They will be the first to begin to notice the bad smells. As they realize that continual dirty hacks are the only way of moving forward, it’s their duty to raise the flag.

The creation of technical debt is inevitable; as inevitable as the slow erosion of a chalk and lime coast by lapping waves, or the weathering of a old building. We should be comfortable with the fact that it is going to happen, and is likely happening right now, and we should be especially comfortable with alerting others when it starts to feel bad. We should fix the broken roof tiles before they become a leak.

Talk to your team about it. Talk to the other engineers that work on that codebase. Build consensus that there is a problem and that something should be done about it. Don’t wait for someone else to point it out. It is as much your responsibility as it is everyone else’s.

Shout.

Constructing the argument succinctly

Now that a technical debt problem has been identified, we’ll need to think about how best to argue for getting the time and space to fix it. 

Many engineering departments are building a product that makes the company money by selling it to external users. Some service internal users. I work in SaaS, and I would say that the expectations of our users are:

  • That our applications are available no matter the time of day or day of week.
  • That we’ll be continually adding new and innovative features to our products.

These expectations are pretty well understood by everyone in the business, regardless of whether they work in commercial, engineering, product, marketing, or wherever. That’s a good thing, because if you use one or both of them to construct your arguments about tackling particular pieces of technical debt, then it’s hard to be ignored.

Rephrasing the above two bullet points with a focus on thinking about technical debt:

  • The platform should be acceptably fast, correct (enough) and should have a very low likelihood of going catastrophically wrong with no prior warning. It is a very bad thing for business when this happens.
  • The codebase should be easy and efficient to work in as we continually add more stuff to it. If we can’t maintain a reasonable speed of adding new stuff, we begin to lose out to competitors, and the rest of the business wonders why we are getting slower, inviting lots of fruitless arguments about developer productivity.

We need to tie our arguments to these reasons. If engineers argue for doing technical debt work in a way that doesn’t make sense to the non-technical layperson, then it’s very hard to them to win hearts and minds in the business. They’ll wonder what they’re up to rather than shipping features.

Technical debt shouldn’t be fixed because it’s “obvious” or “the code could be better” or “it’s annoying” or a particular framework is now “the latest thing”. Those reasons may be entirely true, but the argument needs work.

Let’s have a look at some different scenarios.

  • “We need to upgrade Postgres.” OK, I totally understand. But we need to think of a better way of phrasing this to the non-technical person. What does the upgrade bring us? Is it some critical security patches? Does it have a positive effect on the speed at which the application is going to work? Does it have new features in the query language that will allow us to query the data in a new or better way?
  • “We need to refactor AnalysisPipeline.scala!” Nobody has any idea what AnalysisPipeline.scala does. Probably only a few in the department even know. Does it lack tests and is there causing a lot of bugs in written documents that are challenging to fix once they’re committed to storage? Is the class such a big monolithic mess that it is too hard to add new features at the rate that the business expects? Is it taking five times as long to work on as it would if it was split out into multiple classes, methods, modules or services?
  • “This service needs a rewrite.” Sure, it probably does. But what’s the real reason? Is it stuck on a framework that is now years beyond end of life and nobody knows how it works? Is it an area of the code that is going to have a lot of changes in the coming year, but the risk of it breaking is too high to keep adding to it quickly? Will the speed or stability of this particular service be much better if instead of working with it we just start again instead, taking advantage of the knowledge and technology that we have now?

Getting better at justifying why technical debt needs to be fixed isn’t just a skill that helps you get the clearance of your team lead or product owner to start working on it: it can also help you make up your own mind as to whether something is a real long term issue for the coming year or just a short term frustration for the current sprint.

Nobody will listen

If nobody will listen to your arguments about addressing technical debt, then first check that you’re constructing those arguments properly, as mentioned in the sections above. You are? Ace.

If a common pushback is that there are too many features queued up to build, then there may be an underlying worry from your product manager or line manager that fixing the technical debt will be a slippery slope that goes on forever and destroys productivity. 

One answer to this is to try your best to estimate the effort that it will take to fix it, and, better still, break that down into phases or milestones that can be incrementally worked on.

A tactic that works well to please both Product and Engineering is to balance periods of feature delivery with periods of tidy up and refactoring. In It Doesn’t Have To Be Crazy At Work, the creators of Basecamp pitch for periods of 6 weeks building followed by 2 weeks paying down technical debt. 

At Brandwatch we have employed similar tactics with a period of a team delivering a big ticket feature being followed by a fallow period where the team prioritizes and executes their most pressing technical debt concerns, such as refactoring, improving monitoring and writing documentation. The bonus to this way of doing things is it gives your product managers and designers time to ruminate on the next big thing.

Sometimes, however, there is a massive elephant in the room: a technical debt project so big that nobody wanted to talk about it, yet the swell has grown to the point where the wave is going to break – either with the codebase continuing to become a complete mess, or the platform becoming increasingly slow and unstable.

In this situation, honesty and transparency is the best policy. It is the job of the leaders in Engineering to elevate a large technical debt problem into a separate work stream in order to give it the recognition, space, and resources that it needs; typically a dedicated team over a longer period of time. 

In doing so, the principles above are just as valid: raise the flag, gain consensus, plot an approach, and make the problem understandable to the layperson. Make it clear that the future is brighter by doing this work.

Convince them that it would be silly not to do it because the future of the business depends on it. Then sort it out.

In summary

Remember that if you are an engineer, it’s your job to raise technical debt issues as early as possible, and to make sure that you are able to explain their impact in succinct and meaningful ways. Managers: it’s your job to listen and to create the space for the issues to get worked on.

Building a successful SaaS business requires a stable application and the ability to work quickly and efficiently: both of these things are impacted severely by technical debt, so don’t let it build up. Pay it down.

Switching to a remote manager

Leave a comment
Growth
Photo by Marius Christensen on Unsplash.

git merge

In the last four weeks, I’ve made a transition from having my line manager based in the same office, which has been a situation I’ve been used to for all of my professional life, to having them be remote. In my case this has happened because of the merger of Brandwatch and Crimson Hexagon. The CTO of the combined company is now based in Boston, and I’m in Brighton, England.

I have a VP Engineering role, which, silly job title aside, means that I have a division of the Engineering department reporting to me, focussed around building our Analytics and Audiences applications. We have other divisions of Engineering focussed around our infrastructure and compute, our data platform and the Vizia product. At the time of writing, I have 38 people in my division.

I’ve been fortunate to have always had the CTO in the same office over the recent years. As the company has continued to grow at a fairly fast pace, I’ve had local support. Ideas, thoughts, gripes: they’ve been there in the same place or on the same timezone.

There have been a number of benefits to having the leader of the department co-located:

  • My staff have been able to get to know him easily. We’re all just around most days. This makes them feel connected all of the way up the chain with minimal effort.
  • The general narrative of what’s going on, such as happiness, morale, stress levels, has been observable by both myself and my manager.
  • If there’s ever a crisis – of people or of production systems – then, most of the time, 35 steps is all I’ve needed to get some counsel or a second opinion.

However, things are now quite different.

After our companies merged, the CTO role was given to the Engineering department leader in the other company, putting myself in an interesting position:

  • I now have a manager who is not in the same physical location, so I lose out on all of the informal in-person contact that I had before.
  • My manager is now 5 hours behind me, meaning I have less times of the day in which to speak to him.
  • The new CTO doesn’t initially know me or any of my people; only what we’re responsible for. The rest is a black box.

Letters across the pond

Over the last few weeks, as was expected by the merger, we’ve both been very busy, both with logistics and with traveling. Our weekly hourly 1 to 1s often end before we’ve managed to cover everything off, and then we’re sliding into another meeting before clearing all items on our agenda, which has been frustrating.

Because these weekly catch ups didn’t seem like enough time, and because email chains typically devolve into stasis, I started writing a weekly digest which I send each Friday afternoon. The idea was that I could take some time to properly summarize everything that was going on in my world and flag anything that I needed help with. 

This has been working really well. 

I write it in a Google Doc, which means that a lot of the smaller items can get covered off asynchronously via the comments. Larger items that are worth spending some more time on become the focus of our conversation in our 1 to 1, and that more precious face to face time is spent on the meat of the main issues, rather than on the periphery. Both of us enjoy written communication too, so this works very well. It also gives us an ideal chance to poke fun at our Britishisms and Americanisms.

Here’s roughly what I cover in the weekly document. It takes me about 30 minutes to write:

  • Any interesting developments in any of the ongoing work streams, such as new links to demos, updates on estimates, or anything particularly good or bad that’s unfolding.
  • The latest on what’s next in the project pipeline from conversations with Product.
  • The general feel within the teams, such as happiness and morale. Are any of them overworked, or, on the contrary, spinning the wheels while waiting for a decision on the next thing? Are the teams right sized and is this looking true for the coming months?
  • An in-depth look at anything that’s front of mind right now, such as hiring, or thoughts about backend architecture and scaling, or contemplations over cool ideas we could pitch to the Product team.
  • A list of “documents of interest”, such as designs for upcoming features or architecture, or the fortnightly product and engineering updates that get sent out. I don’t expect any of these to be read in detail, but they’re there to satisfy any curiosity.
  • Occasionally a light sprinkling of GIFs. Because life’s too short to not use that one of Kermit furiously slapping the typewriter.
Yes, that one.

Soap opera rather than novel

I’ve been trying to open up my black box as much as possible to give my new manager a view into the decisions that I make on a day to day, and to allow my thought processes to be observed and discussed. However, the style of writing was challenging at first: how do I make the digest interesting and not a labour?

Given that my new manager was taking the role of the reader and I was the author, I didn’t really know where to start or how to collate my thoughts. But then I came to realize that it wasn’t my job to be the creator of a novel, thoroughly documenting everything that happened. Instead I needed to take the position of a screenwriter of a soap opera: an inventor of a regular rolling feed of narrative that is easy to soak in, letting the reader learn the characters and plot lines gradually by osmosis.

Tuning into The Wire halfway during Season 3 can leave you feeling a little lost and overwhelmed by the detail, but switching on Eastenders a couple of times during the week allows you to (assuming you want to…) follow along pretty easily. I decided to be more Eastenders, except with less arguing and fighting in the Queen Vic.

I scatter the document with parts prefixed with “Your thoughts please…” where I’d like to get some input. We usually chat on the comments around these parts.

Getting comfortable with async await

Although I thought that the experience may be more jarring at first, I think that I am getting better with a predominantly asynchronous relationship. 

There can be some benefits to having a remote manager, after all:

  • Because our face to face time is more valuable, we prepare more for when we do talk, meaning that conversations are rewarding.
  • We do a lot of written communication, which allows us to think more deeply about what we’re saying and how we’re saying it before presenting it to one another.
  • We have to continually operate from a place of trust, since we cannot easily insert ourselves into each other’s worlds to observe and come to our own conclusions. I like this.
  • I feel like I have to step up and represent my people more, in terms of my personal accountability and in promoting their cause, which can only be a good thing.
  • The introduction of even more extreme timezone differences across the now global Engineering department means we need to get better at being a company that supports flexible remote working, fast. I would like to think that being forced to break our predominantly European timezone habits will make it easier for us, in time, to hire people remotely all over the world.

But, still, a quick chat in the kitchen is nice, and is missed.