It is important to have a shared understanding of where we are headed as it pertains to the engineering practices at STORD. Whether you are a veteran team member or ramping-up as a new hire, you need to understand what we are working towards as an Engineering department.
The purpose of this article is to detail our vision for STORD’s Engineering principles and practices.
Systems View or OneOps - Emphasize the flow of work through the build and deployment value stream
- OneOps means we care about the flow of value from the company’s hypothesis about our feature until it is delivered to our customers and providing value.
- OneOps means we all care and think about quality, deployment, reliability and maintenance just as much as we do about meeting the functional requirements.
- We take a systems view, we care about optimizing for the whole system and not for individual parts.
- We have agile process that allow us to stay nimble toward change, and we continue to look for ways to improve
Ownership - We own our code from our desktops to production, and we keep working to remove manual handoffs between teams (by automating or providing self-service tools).
- On-Call - We have a Engineering-wide system to handle, and take ownership over, production issues
- Documentation - We document our designs and interfaces to help others as well as ourselves (Entity Relationship Diagrams, Data Dictionary, API Documentation, Technical Design Notes, Playbooks, How-To Articles, etc.). We read feature documentation to best understand the ‘the what’ and ‘the why’ behind our work, and we do not hesitate to ask questions when we do not understand the purpose of our work
- SLOs and Error Budgets - We understand what is expected of our services availability, and when we expend our error budget, then we prioritize efforts to getting the service back into expected service levels
Continuous Delivery - continuous flow of work that is in a deployable state and constantly shipped to production
- Flow - Seek to reduce handoffs to QA or Operations or Security. Focus on automation and visibility (feedback) as code flows from desktop to production
- Enablement - Seek to enable teams with all the resources they need in order succeed on their own and to minimize handoffs. There will be experts in certain areas (front-end, operations, database, architecture, security) that help team members from not having to know everything. Best practices need to scalable and codified by providing turnkey services for teams in such a way that they team’s work is still able to flow from initial branch to deployed in production
- Trunk-based development - avoid big merges and favor the flow of small batches of work by using abstractions, feature flags, and incremental database changes. Focus on small batches of work (avoid multitasking and large risks)
Feedback - Seek to provide feedback to enable flow
- Feedback is most valuable when it is fed back quickly. Getting feedback weeks or months later is less effective than right away.
- Testing at many levels (unit, integration, feature) is a great way to get feedback quickly and learn about how a change might have caused unintended consequences. How stressful would it be to work in a place that didn’t have a strong culture for testing, and you just had to ‘hope’ that your code didn’t break anything else?
- Examples of feedback - notices when tests fail, logs when there are issues, metrics on service health, current status of key process indicators (KPIs), and PR reviews before merging into master.
- For individuals seeking feedback, the best way to get feedback is to ask for it! It makes it easier for the feedback giver to be asked for it rather than for the feedback giver to find the opportune time and approach to give unsolicited feedback. Ask “What ways do I contribute best to the team and organization?” Ask “What can I do to be even more successful in helping the team and organization reach its goals?”
- Continuous Integration is about providing fast feedback on code quality and security after changes are made using tests at multiple levels (unit, integration, and feature tests) and linters
Empowered - emphasize experimentation and learning
- Learning is encouraged, failure is embraced, and team mates are not afraid to speak up
- Sharing lessons learned with one another.
- Empowered to do Pair programming, Mob programming, or even get or more people together to collaborate on a PR review. Feel empowered to do what you think is best for the team to achieve its sprint goals
- Meet as soon as possible after a major incident - blameless post-mortems so we can learn and prevent it from happening again; we are not there to point fingers and we are there to learn from our mistakes.
- Speak up about Enablers so we can keep improving and chip away at tech debt
- When failure is feared and avoided, then innovation is typically the first thing to suffer. To have an innovative culture, we need to embrace failure so we can learn from it.
- You are empowered to stop the flow of work at any time so that we can come together to fix a problem and learn as a team. It is beneficial to do that before moving forward with a defect, and potentially making the system worse, that will be realized in the future
- How do you incorporate experimentation? Focus on outcomes. Light from electricity was Edison’s desired outcome...he went through hundreds of experiments to reach this outcome. Also, focus on tackling high risk efforts early and often. First stories of the sprint should often be the most risky. Because, the sooner you de-risk something the better. And oftentimes in our line of work, it is knowledge risk that we need to de-risk.
Key practices that support our principles
Continuous Integration / Continuous Delivery
We deploy during the work day, not at night or on the weekend
- Eliminate as much risk as possible.
- Keep batches of work small
We focus on automated tests as a key way to enable continuous integration and delivery --it’s foundational, and delivers fast feedback.
- We know we’ve added enough tests when we’ve significantly reduced the probability of a customer finding a bug in production.
- Tests are run and must pass prior to landing in master
- “Hey, if you’ve written a function, it should have a test” -Halt
When we break the build, we make it a top priority to fix it.
- The build breaking is the equivalent of pulling the andon cord, and needs to be fixed as a top priority.
When we find a bug, create a test for it first. We “shift-left”, meaning if we have a feature test for it and still find a bug, then we create an integration or unit test for it to find the issue sooner
We “Fix forward” because we have a fast MTTR (mean-time-to-resolution)
- We have fast build times (code checkin to master to residing in production is less than 15 minutes), so we fix forward with quick mean-times-to-resolution
Writes end-to-end tests for the critical path flows -- our customers should not be finding bug before we do for features that are in the critical path of value for our customers
We favor short-lived feature branches (short = lives only for a couple of days at most).
- We create smaller stories to keep the branches open for short periods of time (measured by Cycle time -- time from branch created to merged), and this results in smaller PR reviews. Smaller PR reviews often result in deeper and more meaningful reviews compared to “monster” reviews.
We avoid long-lived branches and associated merge pains; reduce risk and keep the team flowing.
- Teams should become adept with the branch by abstraction technique
- Other techniques for making small changes that build up to a big feature
For the code, we use feature flags feature flags in day to day development to allow for hedging on the order of releases. Set the feature flag at the high level of the entry point for the feature or enabler.
For the database, we make incremental changes -- it’s not easy at first, it takes practice; worth the effort and risk reduction to keep teams flowing. Enables deployments during the day
We ensure that when our pull-requests get merged into master (trunk), that they are in a production-ready state that can be deployed, not necessarily released, to production;
Ownership - Teams Own their Services from their desktops to production
We care about deployability and releasability
- No manual steps, no asking operations “Can you just do X just this one time after this goes out?”)...No, follow the established patterns for releases and don’t consider a body of work done until it can be released without manual intervention.
We care about maintainability
- No manual steps to resolve issues, no asking operations “Can you please run this command one-time for us?” -- No, create tools to deal with maintenance needs
We care about performance
- We use infrastructure as Code to build our environments, and that includes the ability to stand up and tear down performance environments for teams to test performance of their services using production-like environments
We have a system to handle On-Call issues
- Playbooks are kept for the person On-Call to perform a set of steps based on the alerts that occur for the services they own
- Incident Reviews - Culture of blameless post-mortem for the purpose of learning and to avoid repeating the same mistake (typically improved by updating processes or adding more automation to a manual process).
- On-Call contacts outside of work hours should be exceptionally rare, because we don’t lean on this system as a solution to on-going problems. Instead, we prioritized the items coming out of the On-Call Incident Reviews to automate and improve processes, so we don’t get called again outside of regular hours
- When support issues come up within a sprint, the On-call person is the first person that the issue should be discussed with for research and triage.
We succeed with team members that are on-site and remote.
Cross team communication
- You are expected to communicate between teams -- not just with teams inside technology (engineering, product, design, integrations, sre), but also outside of technology (operations, sales, supplier relations, marketing, etc.)
- We value direct communication between the people that are doing the work and know the most about it. Talk directly to the people you need to talk with, AND loop in your manager and others that should be in the know so that they can help and support this cross-team effort.
- You see, don’t just say something, but do something -- take care of it, make a ticket, raise it as an issue
- Research something you don’t know, take it upon yourself to understand how something works.
“Okay, I was looking at this...I saw there was an error, an idea, or a library that would make it easier…and did it.”
- There’s the work that’s in front of you, that you are expected to do and there’s the freedom to do the right thing. This is why we don’t have a large bug backlog, because we take care of our code as soon as possible.
SLOs and Error Budgets
We have Service Level Objective (SLO) targets for service availability --- which is a percentage of the successful calls to a service divided by the total calls to that service. (scheduled maintenance, at least 1 week prior, does not count towards down time)
- The SLO targets should be set by the business owners. It is ultimately a business decision, not a technical one.
We understand that SLOs provide clear expectations from the business to the technology team on a Service’s availability requirements.
We understand that “error budgets” are the cumulative amount of time that a Service is allowed to be unavailable. Once an error budget is exceeded (or prior to the error budget getting exceeded), the team must be given time to invest in Enablers instead of Features, so that the SLOs can be met again. Otherwise, the business needs to lower the SLOs to increase the error budget.
- New to the ‘error budget’ concept? It’s a newish thing, so here’s a quick reference for you to get up to speed on it.
We need our services to have appropriate logging messages and levels (info, warnings, and errors), because this helps debug issues in production quickly.
We need our services to have operating metrics to track historic trends to help us with SLO tracking, post-mortems, and to foresee issues.
We understand that no service has a 100% availability, and therefore calls to services must be retried at least 2 times (so 3 attempts overall) with exponential backoff on each retry.
We codify our Technical designs with documentation to include the context and decisions made at that time.
We read product documentation (e.g., PRDs) so that we can understand the purpose of our work, and comment and ask questions when we do not understand the purpose of our work. We don’t hesitate to get items clarified, because others might have a similar question to ours.
We document APIs for services when we agree upon interfaces, because it validates one another’s understanding and allows both teams to reference this key interface that one or more teams rely on.
- Documenting service interfaces (and setting up consumer driven tests) is the best way to ensure that teams that are working in parallel are able to stay in sync.
- When you document your understanding between teams, then you have to update the interface documentation if any changes are made to the implementation, and then you can notify the other teams that depend on it.
Agile / Scrum
Our teams do daily standups to see how they can either ask for help or help others on the team to complete the sprint goals. Additionally, report on if any team members are blocked, or foresee any issues or risks with hitting the sprint goals.
Our teams groom stories with acceptance criteria that the team can understand well enough to point work -- there can be some unknowns, and that should be reflected in the point estimate, along with complexity.
- When the team just has no idea, then a spike story can be used to capture the necessary knowledge needed to point it.
- Teams have ‘steps of doneness’ to align on foundational acceptance criteria, what it means for a story/task to be done, that applies to all stories/tasks (rather than the team adding such foundational ACs to every story/task)
We use story points to represent the size of an effort, because humans are not good at estimating time, but are good at estimating relative size (relative, as in relation to other stories the team has done before). Story size is based on three items: level-of-effort, complexity, and risk/unknowns.
Our teams plan sprints based on the goals of the sprint, and average team velocity -- the empirical story point velocity that the team has performed in the past, on average.
- Goal is to have consistency in points completed per sprint, not to have an ever increasing number of points per sprint get completed.
- Do take into account planned time off or holidays when planning a sprint
Our teams groom entire epics of work so that they can provide an estimate of when the work will get completed based on their backlog and average sprint velocity.
Our teams demonstrate their work to the key stakeholders to demonstrate the work completed, and to get feedback from working software
Our teams perform retrospectives to learn from one another.
- We try to make one improvement per sprint.
- Making a 2 percent improvement per sprint would net a 52% improvement each year for an entire team of people, and that’s not including compounding, so it’s actually higher than that.
One last thing..
It is important to have a shared understanding of where we are headed as it pertains to the engineering practices at STORD. Whether you are a veteran team member or ramping-up as a new hire, you need to understand what we are working towards as an Engineering department. Please see and understand where we are headed and help us get there as fast as we can!