Platform Engineering and Culture

NIRAJ DESAI

Hello Everyone!

In this article I would like to tell you all about my perspective of Platform Engineering, what I have learnt over the years in this field, and how widely it has evolved over the years. Platform Engineering right now is at the for forefront of governing, operationalizing and managing large scale ecosystems which are responsible for rapid delivery which have helped companies to scale out and take a leap beyond the extraordinary in their growth phase.

We see the term operations everywhere. Operations is a term that can be linked with any vertical to make necessary improvements in the process, delivery and thus streamline the process to make it more efficient. It is true that the “idea” sells a business, but it is also true that “operations” give the ideas the fundamentals to run and sell the business which is why we are seeing the terms “Ops” everywhere. We have heard about DevOps, MLOps, FinOps, CloudOps, HROps, and so many more, But what exactly is Ops?

By the dictionary definition, Operation means “the action of functioning or the fact of being active or in effect”. To me Operations means “Culture”. It is a discipline that the whole company adopts to and brings changes in the way we work, communicate, govern, and deliver our ecosystem to the clients in the most effective and reliable way. It grows within an organization with best practices and is constantly redefined based on how we work, develop and release our services for adoption by the end user. It takes a whole company’s dedicated effort in achieving Operations.The whole Operations as Culture does not happen in a day or a year, it takes time. Organizations go through several failures to find the perfect way to operate and thus it can be different for every organization on how they operate, as it is built on experience, collective thoughts, behaviors, patterns and ideologies of all the people working towards it and thus there is no golden rule for operations. They are dynamic and heterogeneous. But why are we talking about Operations when the blog is about Platform Engineering?

Platform Engineering for me is a Culture. It is the most essential part of operations when it comes to SaaS development, delivery, and governance models. It stands as the core and is most responsible for delivering the products that customers NEED by giving developers the jump start they need to create astounding features and products that can shape the future of the society as well as hold those features and products in the most reliable and optimized way.

Platform Engineering 101

In the rapidly evolving realm of software development, forward-thinking organizations are embracing the transformative power of a platform engineering culture. This innovative approach emphasizes the creation of shared platforms, tools, and practices that empower developers to deliver high-quality software faster and more efficiently.

It is a process organizations can use to leverage their cloud platform as efficiently as possible so that engineers can deliver value to production quickly and reliably.To give an analogy of how platforms work, let me take an example of the automotive industry.

Imagine you are a truck manufacturer, the truck manufacturer’s core product is the truck and then on top of it there are OEMs and third party players who use the truck to customize it based on their needs. This can be similarly said with Platform Engineering, where we can give the same basic components to different product teams inside an organization, and therefore allow these product teams to leverage the completed parts to make their own shippable product. Giving the same components to different product teams inside an organization, these product teams can leverage the base fundamental parts and can make something of their own which they can ship to the market. Instead, if each and every product team wants to create these basic components by themselves, it would be relatively expensive and redundant compared to sharing the same resources (cost and resources) in terms of cost and resources.

When the Platform Teams become internal or external based on the context, they can change their operations as follows:

Evolution of Platform Engineering

By the above explanations, we might think that Platform Engineering directly conflicts with SRE and DevOps. If we introduce Platform Engineering, then would SRE and DevOps not be required? That is not true; DevOps and SRE are fields with fundamental principles which gave rise to Platform Engineering. Naturally Platform Engineer is the Evolution of DevOps and SRE. Platform Engineering is the Natural Progression of DevOps and SRE and Observability. Platform Engineering expands on the practices of DevOps by moving development out of silos and taking a collaborative, bird’s-eye view of an organization’s technology. Platform Engineering is built on a foundational infrastructure that enables self-service capabilities so developers can deploy code faster, more reliably and more securely.

While platform engineering represents the next wave of change in DevOps, it’s essential for leaders and developers to understand that it isn’t a singular technology or approach. The shift to platform engineering also represents a cultural shift for developers and leaders to address the needs, challenges and rapid pace of modern software development by optimizing processes, accelerating delivery and reducing the complexity of operations. For platform engineering to succeed, it’s critical to approach it from the perspective of a technology-driven culture that creates solutions with a developer’s perspective in mind. By seeking to improve the developer experience first, enterprises can achieve better results.

The above has been made possible due to the recent shifts in the technology landscape, out of all due to one technology that is greatly shaping our society for optimized and reproducible computing: cloud-native architecture

Cloud-native architecture is fundamental to modern platform engineering by designing and building applications for cloud environments. Infrastructure-as-code (IaC) is a key component, enabling the automated provisioning and management of resources. Cloud-native platforms like Kubernetes, Terraform, or Docker facilitate container orchestration and scaling, while serverless computing isolates infrastructure management entirely. This trend ensures that platform engineering is agile, scalable, and cost-effective, with minimal downtime and easier maintenance.

Why does an Organization need Platform Engineering

Essentially, Platform Engineering takes in charge of the following domains and arms the product teams and service teams to use the ones below:

Abstraction and Standardization: Platform engineers adeptly abstract the complexities of underlying infrastructure and services, providing developers with simple and standardized APIs. By doing so, they empower developers to focus on building features and functionality rather than dealing with low-level implementation details.

Automation: At the core of a platform engineering culture lies a strong emphasis on automation. Automated provisioning, deployment, testing, and monitoring processes enable teams to deliver software with heightened speed, reliability, and consistency.

Self-Service: Platform engineers diligently work to empower developers with self-service capabilities, enabling them to access and manage resources effortlessly. These self-service platforms facilitate quicker iterations and reduce reliance on centralized teams for routine tasks.

Continuous Improvement: A platform engineering culture is deeply ingrained in the principles of continuous improvement. Platform engineers actively seek feedback from developers, identify pain points, and iteratively evolve their platforms to cater to evolving development needs.

Which Results into …

Accelerated Development: By providing pre-built components and services, a platform engineering culture accelerates the development process. Developers can concentrate on building application logic, thus reducing redundancy and expediting time-to-market for new features and products.

Enhanced Collaboration: A platform engineering culture fosters a spirit of cross-team collaboration. Platform teams work hand-in-hand with application development teams, promoting a sense of partnership and shared responsibility for delivering high-quality software.

Scalability and Efficiency: Shared platforms inherently reduce redundant efforts and promote efficiency in software development. As the organization grows, the platform seamlessly scales to accommodate increasing demands without necessitating a significant increase in resources.

Consistency and Reliability: Standardized platforms guarantee consistency across products, resulting in a more reliable and predictable user experience. Additionally, automated processes lead to fewer human errors and a heightened level of system reliability.

Cultivating Innovation: By entrusting common infrastructure concerns to platform teams, application developers are liberated to focus on innovation and building unique features that add substantial value to their products or services.

Risk Reduction: The standardized platform reduced the risk of system failures and outages, enhancing the overall reliability and performance.

Principles Governing Successful Platform Engineering

Principle 1: It isn’t one platform, it’s layers of platforms that need different specialized knowledge so it’s usually many platform teams.
Principle 2: The platform layers are dynamic, evolve over time, and tend to move “up the stack” as they add functionality, and shed capabilities that are subsumed by lower level platforms.
Principle 3: The interface to a platform should be driven by the users of the platform. A platform team should include a product manager (or the team lead should perform that function), have a roadmap, and have mechanisms for prioritizing incoming requests.
Principle 4: A very clear distinction should be made between building internal platforms optimized to change quickly to meet specific business needs, and building externalized platforms optimized for long term stability, where you may not know who or what depends on the platform, and can’t always ask them to change with you.

How to have a successful Platform Engineering Topology in your organization

There are some topologies which we should take into account to apply and some which we should avoid in order to make Platform Engineering Successful. Let me list down the Good ones and the Anti-Types below

Anti-Type A: Dev and Ops Silos

This is the classic “throw it over the wall” split between Dev and Ops. It means that story points can be claimed early (DONE means “feature-complete”, but not working in Production). What ends up is that software operability suffers because Devs do not have enough context for operational features and Ops folks do not have time or inclination to engage Devs in order to fix the problems before the software goes live.

We likely all know this topology is bad, but I think there are actually worse topologies; at least with Anti-Type A (Dev and Ops Silos), we know there is a problem.

Anti-Type B: DevOps Team Silo

The DevOps Team Silo typically results from a manager or exec deciding that they “need a bit of this DevOps thing” and starting a “DevOps team” (probably full of people known as “a DevOps”). The members of the DevOps team quickly form another silo, keeping Dev and Ops further apart than ever as they defend their corner, skills, and toolset from the “clueless Devs” and “dinosaur Ops” people.

Anti-Type C: Dev Don’t Need Ops

This topology is borne of a combination of naivety and arrogance from developers and development managers, particularly when starting on new projects or systems. Assuming that Ops is now a thing of the past (“we have the Cloud now, right?”), the developers wildly underestimate the complexity and importance of operational skills and activities, and believe that they can do without them, or just cover them in spare hours.

Anti-Type D: Rebranded SysAdmin

This anti-type is typical in organizations with low engineering maturity. They want to improve their practices and reduce costs, yet they fail to see IT as a core driver of the business. Because industry successes with DevOps are now evident, they want to “do DevOps” as well. Unfortunately, instead of reflecting on the gaps in the current structure and relationships, they take the elusive path of hiring “DevOps engineers” for their Ops team(s). DevOps becomes just a rebranding of the role previously known as SysAdmin, with no real cultural/organizational change taking place. This anti-type is becoming more and more widespread as unscrupulous recruiters jump on the bandwagon searching for candidates with automation and tooling skills. Unfortunately, it’s the human communication skills that can make DevOps thrive in an organization.

Good Type A: Dev and Ops Collaboration

This is the “promised land” of DevOps: smooth collaboration between Dev teams and Ops teams, each specialising where needed, but also sharing where needed. There are likely many separate Dev teams, each working on a separate or semi-separate product stack.

My sense is that this Good Type A model needs quite substantial organizational change to established, and a good degree of competence higher up in the technical management team. Dev and Ops must have a clearly expressed and demonstrably effective shared goal (“Delivering Reliable, Frequent Changes”, or whatever). Ops folks must be comfortable pairing with Devs and get to grips with test-driven coding and Git, and Devs must take operational features seriously and seek out Ops people for input into logging implementations, and so on, all of which need quite a culture change from the recent past.

Good Type B: Fully Shared Ops Responsibilities

In this topology operational engineers are integrated in product development teams. There is so little separation between Dev and Ops that all people are highly focused on a shared purpose; this is arguably a form of Type 1 (Dev and Ops Collaboration), but it has some special features.

Organizations such as Netflix and Facebook have effectively achieved this Good Type B topology with a single web-based product, but I think it’s probably not hugely applicable outside a narrow product focus, because the budgetary constraints and context-switching typically present in an organization with multiple product streams will probably force Dev and Ops further apart (say, back to a Good Type A model). This topology might also be called “NoOps”, as there is no distinct or visible Operations team (although the Netflix NoOps might also be Good Type C (Ops as IaaS)).

Good Type C: Ops as Infrastructure-as-a-Service (Platform)

For organizations with a fairly traditional IT Operations department which cannot or will not change rapidly [enough], and for organizations who run all their applications in the public cloud (Amazon EC2, Rackspace, Azure, etc.), it probably helps to treat Operations as a team who simply provides the elastic infrastructure on which applications are deployed and run; the internal Ops team is thus directly equivalent to Amazon EC2, or Infrastructure-as-a-Service.

A team (perhaps a virtual team) within Dev then acts as a source of expertise about operational features, metrics, monitoring, server provisioning, etc., and probably does most of the communication with the IaaS team. This team is still a Dev team, however, following standard practices like TDD, CI, iterative development, coaching, etc.

The IaaS topology trades some potential effectiveness (losing direct collaboration with Ops people) for easier implementation, possibly deriving value more quickly than by trying for Good Type A (Dev and Ops Collaboration), which could be attempted at a later date.

Good Type D: DevOps as an External Service

Some organizations, particularly smaller ones, might not have the finances, experience, or staff to take a lead on the operational aspects of the software they produce. The Dev team might then reach out to a service provider like Rackspace to help them build test environments and automate their infrastructure and monitoring, and advise them on the kinds of operational features to implement during the software development cycles.

What might be called DevOps-as-a-Service could be a useful and pragmatic way for a small organization or team to learn about automation, monitoring, and configuration management, and then perhaps move towards a Good Type C (Ops as IaaS) or even Good Type A (Dev and Ops Collaboration) model as they grow and take on more staff with operational focus.

Stretching forward the synonymity between Platform Engineering and Culture that I briefly touched in the introduction, I would like to emphasize on the points below by comparing what makes a culture and what makes platform engineering successful that would make more sense now.

Like how culture is an amalgamation of values, traditions, arts, language, skills, artistry, tools, objects, food, drinks, values, and community, in the same way Platform Engineering is an amalgamation of best practices, use cases, tools, automation, code recipes, developer experience, knowledge, and community, and that is why I believe they go hand in hand and are successful for any organization going forward. Adopting Platform Engineering Culture will help companies leverage cutting edge technologies and ship them more qucikly to the consumers.

I would like to end the blog on two quotes which have been very essential to my journey in Life and Computer Science, they go as follows:

“First make the change easy, then make the easy change.”

-Kent Back

“Any organization that designs a system or a platform will produce a design whose structure is a copy of the organization’s communication structure.

-Melvin Conway

Through the way of this blog, I wanted to present you with a stroke of brush with colors, and would be happy to see you paint the whole picture in your teams and organizations with it.