This is the second article in our Platform engineering survival: Solving the core challenges series.
In a Team Topologies model, stream-aligned teams are built to move fast. However, "shipping features" involves more than just writing code; it requires managing orchestration, observability, identity and access (IAM) and secrets amongst a wide set of technology capabilities.
Modern platforms act as a force multiplier for these teams. By offering a curated portfolio of self-service technology capabilities, the platform removes friction and abstracts underlying complexity. This ensures that stream-aligned teams aren't bogged down by "plumbing," but are instead empowered to focus entirely on their core domain logic.
For the promise of platform engineering to become a reality, organizations need a platform operating model optimized for fast flow, not only well understood by leadership but also adopted and practiced at scale. Of notable relevance here are DORA ‘capabilities that enable fast flow’ such as ‘loosely coupled teams’ and ‘streamlining change approval’.
As a litmus test on how well your platform operating model is functioning, consider some of these common use cases from stream-aligned teams:
Getting access to code repositories, pipelines, registries and runtime environments.
Creating a new Kubernetes cluster. Horizontally scaling compute capacity (nodes) in or out.
Creating a new database instance and wiring it up to an application.
Getting secure network connectivity to an internal system/service or to the internet.
Accessing telemetry data (logs, metrics, traces) from applications in different environments.
For these use cases, wearing the platform customer’s hat, consider the following questions:
How are platform services discovered and consumed?
How long are the feedback loops between platform services being requested and becoming usable?
Conversely, from a platform team’s perspective, how much effort does it take to fulfil these requests? How much manual intervention is involved?
The answers to these questions may reveal symptoms of a platform struggling to realize value from its investments. A likely cause, as the title states, could be an inadequate or absent platform operating model.
The key questions, signs of struggle and success signals for assessing the state of the platform
As the term ‘operating model’ tends to be subjective and ambiguous in different contexts, let’s narrow it down to a few key elements. For each element, you will find:
Guiding questions for assessing your current state.
Signals of struggle observed in platforms that have struggled to scale beyond a few teams (two-three year timeline).
Signals of success observed in platforms that have scaled to 50+ teams and 1000+ users (two-three year timeline).
A short disclaimer before we dive in — your mileage may vary. The aim of describing the elements below is to equip you with a practical starting point, encouraging further exploration and research within your organization's unique context.
Team shapes and responsibility boundaries
Guiding questions
How big is the platform team and how does it organize itself?
Are responsibility boundaries clearly outlined? (eg: platform manages the lifecycle of infrastructure, stream-aligned teams manage the lifecycle of their applications and data products).
How does collaboration take place when incidents occur? (eg: when an application front-end is inaccessible, what happens next?).
Signals of struggle
The platform suffers from internal silos, with different teams managing distinct technology components. This structure results in a lack of clear accountability for the end-to-end customer/developer journey across all platform services.
Increased platform complexity forces stream-aligned teams to create shadow platform teams.
Conflicts often arise between platform and stream-aligned teams with a ‘ping-pong’ of responsibilities. Architectural decisions are frequently overshadowed by internal politics and disputes over spheres of influence.
Incident accountability devolves into a ticket-based exchange between stream-aligned and platform teams, with incident response plans rarely practiced.
Signals of success
As platform capabilities grow, specialized sub-teams may emerge (e.g., separate teams for networking, compute orchestration and tooling built on top of kubernetes) while preserving the consistent developer journey through the platform.
Responsibility boundaries are transparently documented (eg: in a developer portal) and regularly reviewed in a collaborative effort.
Comprehensive observability determines the affected component across the tech stack and an incident response plan is followed.
Interfaces and knowledge discovery
Guiding questions
How do stream-aligned teams discover different services offered by the platform?
What are the interfaces through which platform services are requested and used?
Signals of struggle
A long, perhaps stale document provides strict usage instructions for platform services. Any deviation from the standard needs change requests.
Majority of platform services are requested through ticket based workflows with manual approvals and processing.
The platform becomes a bottleneck and stream-aligned teams have ‘backlog coupling’ with the platform resulting in user stories blocked awaiting platform action.
The platform team does not feel approachable, with feature requests or issues filtered through multiple channels before reaching them.
The platform plays minimal to no part in facilitating knowledge sharing between stream-aligned teams who use the platform’s services.
Signals of success
A continuously updated developer portal provides tangible reference examples, documentation and walkthroughs for the majority of developer journeys. The platform is optimized for paved-paths while providing flexibility for deviations.
Members of stream-aligned teams collaborate directly with the platform team (via chat, consultation hours or forums) – as an active user community – to submit feature requests and get on-demand support for platform services.
Stream-aligned teams regularly exchange examples of successful platform usage patterns, through demos and write ups.
Platform services are made available through automated ‘self-service’ interfaces such as Git-based pull requests, API or CLI based workflows.
Platform evolution and architectural decision making
Guiding questions
Does a platform roadmap exist? Is it validated regularly? Does it cover the emerging needs of stream-aligned teams?
Are there roles in the platform dedicated for product management and customer engagement?
When changing architectural elements such as tools, dependencies, interfaces or access controls, what is the decision making process? Are platform customers consulted?
Signals of struggle
Stream-aligned teams request platform features repeatedly without getting a clear answer on feasibility, priority or timelines.
The platform team unilaterally makes technology and architecture decisions with minimal research of its customers’ needs. This may be compounded by a lack of dedicated product management responsibilities within the platform.
Architectural changes frequently cause disruptions to customer workflows, resulting in a restoration period that can range from several hours to days.
Signals of success
Platform roadmaps are visible to all its customers, enhanced by methods such as voting to gauge feature demand. There is dedicated responsibility for platform product management, distilling customer needs into the platform vision, roadmap and eventually the backlog.
Stream-aligned teams are consulted in the architectural decision making process, impact of changes modeled and risk mitigations put in place.
Whenever platform changes occur (e.g., patching, upgrades, tool replacements, re-architecting, migrations, or deprecations), they are executed with minimal to no disruption. Any necessary actions for stream-aligned teams are documented and communicated well ahead of time.
Funding and FinOps
Guiding questions
How is funding secured for platform evolution? Is it tied to the business priorities of stream-aligned teams?
How are platform resource costs governed? Do platform customers have visibility into their resource usage and associated costs? Is there suitable FinOps tooling in place?
Signals of struggle
The platform is considered as a cost-center with upfront and ‘static’ budgeting. Platform and stream-aligned teams compete for annual budgets.
Stream-aligned teams are constantly asked to reduce resource usage and provide upfront capacity requirements. Their requests for more infrastructure capacity are subject to long budget approval cycles.
Signals of success
The platform's funding is linked to and incorporates the needs and business priorities of its customers (the stream-aligned teams).
Stream-aligned teams can access up-to-date, on-demand resource usage and cost information.
Stream-aligned teams are provided incentives (eg: ‘showback’ or chargeback) and the ability (eg: automated scaling) to optimize their usage and costs in line with business needs.
The platform provides accessible FinOps tooling for reviewing costs and offering optimization recommendations.
Measuring success
Guiding questions
To what extent are the platform's services being adopted? Following the release of a new feature, how long does it take until the majority start using it?
How is the platform impacting the daily lives of its users? Is feedback from stream-aligned teams sought regularly?
How is the value of the platform measured? How is the ROI articulated to the business?
Signals of struggle
Despite significant investments over the years, the platform fails to deliver value. Either adoption is low or, where usage is mandated, developer experience is sacrificed despite high adoption.
An inconsistent metrics strategy that may exist in principle but is hardly ever exercised. No quantified measure of the ‘waste and friction’ in the SDLC that the platform is seeking to reduce.
Stream-aligned teams may share feedback on an ad-hoc basis without a defined mechanism to do so.
The ROI of the platform isn’t clearly visible to the business even if it may be ‘felt’ by stream-aligned teams.
Signals of success
Platform adoption is strong and increasing, with regular feedback surveys validating the platform meets the needs of stream-aligned teams. Metrics such as an internal Net Promoter Score consistently score high.
Metrics such as the time to first commit, time to launch a new service in production, rate of adoption of paved paths and standards measure value delivered to stream-aligned teams, ‘before and after’ platform services are introduced.
Platform services help stream-aligned teams improve software delivery and quality, evidenced by better DORA four key metrics like faster deployments and quicker failure recovery. Also see the SPACE framework for a more comprehensive approach to measuring efficiency.
With less time spent on activities such as infrastructure configuration and management and a reduction in cognitive load and friction, stream-aligned teams can dedicate more time towards delivering domain-specific business features. With adoption at scale, this translates into a clear ROI towards the business.
Platform success: From MVP to 1200+ users in five years
Where there are signals of a successful platform operating model, business value can be delivered quickly. At a large automotive enterprise, we built a new cloud-based developer platform using an incremental approach, backed by a deep understanding of customer needs and a transparent roadmap.
This enabled the first set of customers to go live on the platform within a year of the platform’s inception, followed by further waves of customers. This one-year phase included MVP development, running multiple technology and architecture evaluations, testing platform features with a set of ‘beta customers’ and collaboration with multiple enterprise IT departments including network, datacenter, cloud and InfoSec.
Fast forwarding five years, the platform now houses more than 80 teams and 1200 users. The platform experience is rated 4.8/5 by customers and it takes less than a day to onboard a new stream-aligned team.
By moving to the new platform, customers reported significantly shorter lead times for common use cases. For example, an accelerated path to the cloud allowed stream-aligned teams to leverage node auto-scaling, rather than waiting for a month for a new compute capacity. The time required to gain access to a service behind a firewall was reduced from weeks to just minutes. Investing in strengthening the platform operating model from the start continues to provide significant benefits us as we scale.
While the platform operating model described above applies a broad stroke across different elements, every organization and context is unique. If your platform exhibits signals of struggle, moving towards improved alternatives should be a journey rather than a single leap — and must include a product-thinking mindset that prioritizes user needs in engineering and a culture of continuous learning driven by feedback and data.
Platform engineering's core effectiveness comes from aligning technology capabilities with the critical needs of its users, enabled by an operating model optimized for fast flow of value.
Recommended reading on Martin Fowler’s website:
In the next article, we'll discuss overcoming political hurdles for platform engineering success.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.