Kubernetes Archives - Page 2 of 6

A Primer On Oracle Compute Cloud@Customer

Jun 10, 2025 | Kubernetes, OCI Dedicated Region, Oracle Cloud Infrastructure, Oracle Compute Cloud@Customer

Enterprises across regulated industries, such as banking, healthcare, and the public sector, often find themselves caught in a dilemma: they want the scale and innovation of the public cloud, but they can’t move their data off-premises due to regulatory, latency, or sovereignty concerns. The answer is not one-size-fits-all, and the market reflects that through several deployment models:

Public cloud vendors extending to on-premises (AWS Outposts, Azure Local + Azure Arc, Google Distributed Cloud Edge)
Software vendors offering a “private cloud” (Nutanix, VMware by Broadcom)
Hardware vendors offering “cloud-like” experiences (HPE GreenLake, Dell APEX, Lenovo TruScale)

Oracle C3 bridges the best of all three worlds:

Runs OCI control plane on-prem, with native compute, storage, GPU, and PaaS services
Keeps data resident while Oracle manages the infrastructure
Oracle manages hardware, software, updates, and lifecycle
Integration with Oracle Exadata and Autonomous Database
Same APIs, SDKs, CLI, and DevOps tools as OCI

Architecture

The Cloud Control Plane is an advanced software platform that operates within Oracle Cloud Infrastructure (OCI). It serves as the central management interface for deploying and operating resources, including those running on Oracle Compute Cloud@Customer. Customers access the Cloud Control Plane securely via a web browser, command-line interface (CLI), REST APIs, or language-specific SDKs, enabling flexible integration into existing IT and DevOps workflows.

At the heart of the platform is the identity and access management (IAM) system that allows multiple teams or departments to share a single OCI tenancy while maintaining strict control over access. Using compartments, organizations can logically organize and isolate resources such as Compute Cloud@Customer instances, and enforce granular access policies across the environment.

Communication between the Cloud Control Plane and the on-premises C3 system is established through a dedicated, secure tunnel. This encrypted tunnel is hosted by specialized management nodes within the rack. These nodes function as a gateway to the infrastructure, handling all control plane communications. In addition to maintaining the secure connection, they also:

Orchestrate cloud automation within the on-premises environment
Aggregate and route telemetry and diagnostic data to Oracle Support Services
Host software images and updates used for patching and maintenance

A diagram showing your tenancy in an OCI region, and how it connects to Compute Cloud@Customer in your data center.

Important: Even if connectivity between the Cloud Control Plane and the on-premises system is temporarily lost, virtual machines (VMs) and applications continue running uninterrupted on C3. This ensures high availability and operational continuity, even in isolated or restricted network environments.

Beyond deployment and orchestration, the Cloud Control Plane also handles essential lifecycle operations such as provisioning, patching, backup, and monitoring, and supports usage metering and billing.

Core Capabilities & Services

When you sign in to Oracle Compute Cloud@Customer, you gain access to the same types of core infrastructure resources available in the public Oracle Cloud Infrastructure (OCI). Here is what you can create and manage on C3:

Compute Instances. You can launch virtual machines (instances) tailored to your application requirements. Choose from various instance shapes based on CPU count, memory size, and network performance. Instances can be deployed using Oracle-provided platform images or custom images you bring yourself.
Virtual Cloud Networks (VCNs). A VCN is a software-defined, private network that replicates the structure of traditional physical networks. It includes subnets, route tables, internet/NAT gateways, and security rules. Every compute instance must reside within a VCN. On C3, you can configure the Load Balancing service (LBaaS) to automatically distribute network traffic.
Capacity and Performance Storage. Block Volumes, File Storage, Object Storage

Oracle Operator Access Control

To further support enterprise-grade security and governance, Oracle Compute Cloud@Customer includes Oracle Operator Access Control (OpCtl), which is a sophisticated system designed to manage and audit privileged access to your on-premises infrastructure by Oracle personnel. Unlike traditional support models, where vendor access can be blurred or overly permissive, OpCtl gives customers explicit control over every support interaction.

Before any Oracle operator can access the C3 environment for maintenance, updates, or troubleshooting, the customer must approve the request, define the time window, and scope the level of access permitted. All sessions are fully audited, with logs available to the customer for compliance and security reviews. This ensures that sensitive workloads and data remain under strict governance, aligning with zero-trust principles and regulatory requirements.

Available GPU Options on Compute Cloud@Customer

As enterprises aim to run AI, machine learning, digital twins, and graphics-intensive applications on-premises, Oracle introduced GPU expansion for Compute Cloud@Customer. This enhancement brings NVIDIA L40S GPU power directly into your data center.

Each GPU expansion node in the C3 environment is equipped with four NVIDIA L40S GPUs, and up to six of these nodes can be added to a single rack. For larger deployments, a second expansion rack can be connected, enabling support for a total of 12 nodes and up to 48 GPUs within a C3 deployment.

Oracle engineers deliver and install these GPU racks pre-configured, ensuring seamless integration with the base C3 system. These nodes connect to the existing compute and storage infrastructure over a high-speed spine-leaf network topology and are fully integrated with Oracle’s ZFS storage platform.

Platform-as-a-Service (PaaS) Offerings on C3

For organizations adopting microservices and containerized applications, Oracle Kubernetes Engine (OKE) on C3 provides a fully managed Kubernetes environment. Developers can deploy and manage Kubernetes clusters using the same cloud-native tooling and APIs as in OCI, while operators benefit from lifecycle automation, integrated logging, and metrics collection. OKE on C3 is ideal for hybrid deployments where containers may span on-prem and cloud environments.

The Logical Next Step After Compute Cloud@Customer?

Typically, organizations choose to move to OCI Dedicated Region when their cloud needs outgrow what C3 currently offers. As companies expand their cloud adoption, they require a richer set of PaaS capabilities, more advanced integration and analytics tools, and cloud-native services like AI and DevOps platforms that are not fully available in C3 yet. OCI Dedicated Region is designed to meet these demands by providing a comprehensive, turnkey cloud environment that is fully managed by Oracle but physically isolated within your data center.

I consider OCI Dedicated Region as the next-generation private cloud. If you are a VMware by Broadcom customer and looking for alternatives, have a look at 5 Strategic Paths from VMware to Oracle Cloud Infrastructure.

Final Thought – Choose the Right Model for Your Journey

Every organization is on its own digital transformation journey. For some, that means moving aggressively into the public cloud. For others, it’s about modernizing existing infrastructure or complying with tight regulations. If you need cloud-native services, enterprise-grade compute, and strong data sovereignty, Oracle Compute Cloud@Customer is one of the most complete and future-proof options available.

Open-Source Can Help With Portability And Lock-In But It Is Not A Silver Bullet

May 11, 2025 | Alibaba Cloud, AWS, Azure, Google Cloud, Kubernetes, Oracle Cloud Infrastructure, Sovereign Cloud, VMware Cloud

We have spent years chasing cloud portability and warning against vendor lock-in. And yet, every enterprise I have worked with is more locked in today than ever. Not because they failed to use open-source software (OSS). Not because they made bad decisions, but because real-world architecture, scale, and business momentum don’t care about ideals. They care about outcomes.

The public cloud promised freedom. APIs, managed services, and agility. Open-source added hope. Kubernetes, Terraform, Postgres. Tools that could, in theory, run anywhere. And so we bought into the idea that we were building “portable” infrastructure. That one day, if pricing changed or strategy shifted, we could pack up our workloads and move. But now, many enterprises are finding out the truth:

Portability is not a feature. It is a myth, and for most large organizations, it is a unicorn, but elusive in reality.

Let me explain, and before I do, talk about interclouds again.

Remember Interclouds?

Interclouds, once hyped as the answer to cloud portability (and lock-in), promised a seamless way to abstract infrastructure across providers, enabling workloads to move freely between clouds. In theory, they would shield enterprises from vendor dependency by creating a uniform control plane and protocols across AWS, Azure, GCP, OCI and beyond.

David Bernstein Intercloud

Note: An idea and concept that was discussed in 2012. It is 2025, and not much has happened since then.

But in practice, intercloud platforms failed to solve the lock-in problem because they only masked it, not removed it. Beneath the abstraction layer, each provider still has its own APIs, services, network behaviors, and operational peculiarities.

Enterprises quickly discovered that you can’t abstract your way out of data gravity, compliance policies, or deeply integrated PaaS services. Instead of enabling true portability, interclouds just delayed the inevitable realization: you still have to commit somewhere.

The Trigger Nobody Plans For

Imagine you are running a global enterprise with 500 or 1’000 applications. They span two public clouds. Some are modern, containerized, and well-defined in Terraform. Others are legacy, fragile, lifted, and shifted years ago in a hurry. A few run in third-party SaaS platforms.

Then the call comes: “We need to exit one of our clouds. Legal, compliance, pricing. Doesn’t matter why. It has to go.”

Suddenly, that portability you thought you had? It is smoke. The Kubernetes clusters are portable in theory, but the CI/CD tooling, monitoring stack, and security policies are not. Dozens of apps use PaaS services tightly coupled to their original cloud. Even the apps that run in containers still need to be re-integrated, re-tested, and re-certified in the new environment.

This isn’t theoretical. I have seen it firsthand. The dream of being “cloud neutral” dies the moment you try to move production workloads – at scale, with real dependencies, under real deadlines.

Open-Source – Freedom with Strings Attached

It is tempting to think that open-source will save you. After all, it is portable, right? It is not tied to any vendor. You can run it anywhere. And that is true on paper.

But the moment you run it in production, at enterprise scale, a new reality sets in. You need observability, governance, upgrades, SLAs. You start relying on managed services for these open-source tools. Or you run them yourself, and now your internal teams are on the hook for uptime, performance, and patching.

You have simply traded one form of lock-in for another: the operational lock-in of owning complexity.

So yes, open-source gives you options. But it doesn’t remove friction. It shifts it.

The Other Lock-Ins No One Talks About

When we talk about “avoiding lock-in”, we usually mean avoiding proprietary APIs or data formats. But in practice, most enterprises are locked in through completely different vectors:

Data gravity makes it painful to move large volumes of information, especially when compliance and residency rules come into play. The real issue is the latency, synchronization, and duplication challenges that come with moving data between clouds.

Tooling ecosystems create invisible glue. Your CI/CD pipelines, security policies, alerting, cost management. These are all tightly coupled to your cloud environment. Even if the core app is portable, rebuilding the ecosystem around it is expensive and time-consuming.

Skills and culture are rarely discussed, but they are often the biggest blockers. A team trained to build in cloud A doesn’t instantly become productive in cloud B. Tooling changes. Concepts shift. You have to retrain, re-hire, or rely on partners.

So, the question becomes: is lock-in really about technology or inertia (of an enterprise’s IT team)?

Data Gravity

Data gravity is one of the most underestimated forces in cloud architecture. Whether you are using proprietary services or open-source software. The idea is simple: as data accumulates, everything else like compute, analytics, machine learning, and governance, tends to move closer to it.

In practice, this means that once your data reaches a certain scale or sensitivity, it becomes extremely hard to move, regardless of whether it is stored in a proprietary cloud database or an open-source solution like PostgreSQL or Kafka.

With proprietary platforms, the pain comes from API compatibility, licensing, and high egress costs. With open-source tools, it is about operational entanglement: complex clusters, replication lag, security hardening, and integration sprawl.

Either way, once data settles, it anchors your architecture, creating a gravitational pull that resists even the most well-intentioned portability efforts.

The Cost of Chasing Portability

Portability is often presented as a best practice. But there is a hidden cost.

To build truly portable applications, you need to avoid proprietary features, abstract your infrastructure, and write for the lowest common denominator. That often means giving up performance, integration, and velocity. You are paying an “insurance premium” for a theoretical future event like cloud exit or vendor failure, that may never come.

Worse, in some cases, over-engineering for portability can slow down innovation. Developers spend more time writing glue code or dealing with platform abstraction layers than delivering business value.

If the business needs speed and differentiation, this trade-off rarely holds up.

So… What Should We Do?

Here is the hard truth: lock-in is not the problem. Lack of intention is.

Lock-in is unavoidable. Whether it is a cloud provider, a platform, a SaaS tool, or even an open-source ecosystem. You are always choosing dependencies. What matters is knowing what you are committing to, why you are doing it, and what the exit cost will be. That is where most enterprises fail.

And let us be honest for a moment. A lot of enterprises call it lock-in because their past strategic decision doesn’t feel right anymore. And then they blame their “strategic” partner.

The better strategy? Accept lock-in, but make it intentional. Know your critical workloads. Understand where your data lives. Identify which apps are migration-ready and which ones never will be. And start building the muscle of exit-readiness. Not for all 1’000 apps, but for the ones that matter most.

True portability isn’t binary. And in most large enterprises, it only applies to the top 10–20% of apps that are already modernized, loosely coupled, and containerized. The rest? They are staying where they are until there is a budget, a compliance event, or a crisis.

Avoiding U.S. Public Clouds And The Illusion of Independence

While independence from the U.S. hyperscalers and the potential risks associated with the CLOUD Act may seem like a compelling reason to adopt open-source solutions, it is not always the silver bullet it appears to be. The idea is appealing: running your infrastructure on open-source tools in order to avoid being dependent on any single cloud provider, especially those based in the U.S., whose data may be subject to foreign government access under the CLOUD Act.

However, this approach introduces its own set of challenges.

First, by attempting to cut ties with US providers, organizations often overlook the global nature of the cloud. Most open-source tools still rely on cloud providers for deployment, support, and scalability. Even if you host your open-source infrastructure on non-U.S. clouds, the reality is that many key components of your stack, like databases, messaging systems, or AI tools, may still be indirectly influenced by U.S.-based tech giants.

Second, operational complexity increases as you move away from managed services, requiring more internal resources to manage security, compliance, and performance. Rather than providing true sovereignty, the focus on avoiding U.S. hyperscalers may result in an unintended shift of lock-in from the provider to the infrastructure itself, where the trade-off is a higher cost in complexity and operational overhead.

Top Contributors To Key Open-Source Projects

U.S. public cloud providers like Google, Amazon, Microsoft, Oracle and others are not just spectators in this space. They’re driving the innovation and development of key projects:

Kubernetes remains the flagship project of the CNCF, offering a robust container orchestration platform that has become essential for cloud-native architectures. The project has been significantly influenced by a variety of contributors, with Google being the original creator.
Prometheus, the popular monitoring and alerting toolkit, was created by SoundCloud and is now widely adopted in cloud-native environments. The project has received significant contributions from major players, including Google, Amazon, Facebook, IBM, Lyft, and Apple.
Envoy, a high-performance proxy and communication bus for microservices, was developed by Lyft, with broad support from Google, Amazon, VMware, and Salesforce.
Helm is the Kubernetes package manager, designed to simplify the deployment and management of applications on Kubernetes. It has a strong community with contributions from Microsoft (via Deis, which they acquired), Google, and other cloud providers.
OpenTelemetry provides a unified standard for distributed tracing and observability, ensuring applications are traceable across multiple systems. The project has seen extensive contributions from Google, Microsoft, Amazon, Red Hat, and Cisco, among others.

While these projects are open-source and governed by the CNCF (Cloud Native Computing Foundation), the influence of these tech companies cannot be understated. They not only provide the tools and resources necessary to drive innovation but also ensure that the technologies powering modern cloud infrastructures remain at the cutting edge of industry standards.

Final Thoughts

Portability has become the rallying cry of modern cloud architecture. Real-world enterprises aren’t moving between clouds every year. They are digging deeper into ecosystems, relying more on managed services, and optimizing for speed.

So maybe the conversation shouldn’t be about avoiding lock-in but about managing it. Perhaps more about understanding it. And, above all, owning it. The problem isn’t lock-in itself. The problem is treating lock-in like a disease, rather than what it really is: an architectural and strategic trade-off.

This is where architects and technology leaders have a critical role to play. Not in pretending we can design our way out of lock-in, but in navigating it intentionally. That means knowing where you can afford to be tightly coupled, where you should invest in optionality, and where it is simply not worth the effort to abstract away.

The State of Application Modernization 2025

May 4, 2025 | Artificial Intelligence, AWS, Azure, Cloud Native Apps, CNCF, Edge Computing, Google Cloud, Kubernetes, Oracle Cloud Infrastructure, Sovereign Cloud

Every few weeks, I find myself in a conversation with customers or colleagues where the topic of application modernization comes up. Everyone agrees that modernization is more important than ever. The pressure to move faster, build more resilient systems, and increase operational efficiency is not going away.

But at the same time, when you look at what has actually changed since 2020… it is surprising how much has not.

We are still talking about the same problems: legacy dependencies, unclear ownership, lack of platform strategy, organizational silos. New technologies have emerged, sure. AI is everywhere, platforms have matured, and cloud-native patterns are no longer new. And yet, many companies have not even started building the kind of modern on-premises or cloud platforms needed to support next-generation applications.

It is like we are stuck between understanding why we need to modernize and actually being able to do it.

Remind me, why do we need to modernize?

When I joined Oracle in October 2024, some people reminded me that most of us do not know why we are where we are. One could say that it is not important to know that. In my opinion, it very much is. Something has fundamentally changed in the past that has led us to our situation.

In the past, when we moved from physical servers to virtual machines (VMs), apps did not need to change. You could lift and shift a legacy app from bare metal to a VM and it would still run the same way. The platform changed, but the application did not care. It was an infrastructure-level transformation without rethinking the app itself. So, the transition (P2V) of an application was very smooth and not complicated.

But now? The platform demands change.

Cloud-native platforms like Kubernetes, serverless runtimes, or even fully managed cloud services do not just offer a new home. They offer a whole new way of doing things. To benefit from them, you often have to re-architect how your application is built and deployed.

That is the reason why enterprises have to modernize their applications.

What else is different?

User expectations, business needs, and competitive pressure have exploded as well. Companies need to:

Ship features faster
Scale globally
Handle variable load
Respond to security threats instantly
Reduce operational overhead

A Quick Analogy

Think of it like this: moving from physical servers to VMs was like transferring your VHS tapes to DVDs. Same content, just a better format.

But app modernization? That is like going from DVDs to Netflix. You do not just change the format, but you rethink the whole delivery model, the user experience, the business model, and the infrastructure behind it.

Why Is Modernization So Hard?

If application modernization is so powerful, why is not everyone done with it already? The truth is, it is complex, disruptive, and deeply intertwined with how a business operates. Organizations often underestimate how much effort it takes to replatform systems that have evolved over decades. Here are 6 common challenges companies face during modernization:

Legacy Complexity – Many existing systems are tightly coupled, poorly documented, and full of business logic buried deep in spaghetti code.
Skill Gaps – Moving to cloud-native tech like Kubernetes, microservices, or DevOps pipelines requires skills many organizations do not have in-house. Upskilling or hiring takes time and money.
Cultural Resistance – Modernization often challenges organizational norms, team structures, and approval processes. People do not always welcome change, especially if it threatens familiar workflows.
Data Migration & Integration – Legacy apps are often tied to on-prem databases or batch-driven data flows. Migrating that data without downtime is a massive undertaking.
Security & Compliance Risks – Introducing new tech stacks can create blind spots or security gaps. Modernizing without violating regulatory requirements is a balancing act.
Cost Overruns – It is easy to start a cloud migration or container rollout only to realize the costs (cloud bills, consultants, delays) are far higher than expected.

Modernization is not just a technical migration. It’s a transformation of people, process, and platform (technology). That is why it is hard and why doing it well is such a competitive advantage!

Technical Debt Is Also Slowing Things Down

Also known as the silent killer of velocity and innovation: technical debt

Technical debt is the cost of choosing a quick solution now instead of a better one that would take longer. We have all seen/done it. 🙂 Sometimes it is intentional (you needed to hit a deadline), sometimes it is unintentional (you did not know better back then). Either way, it is a trade-off. And just like financial debt, it accrues interest over time.

Here is the tricky part: technical debt usually doesn’t hurt you right away. You ship the feature. The app runs. Management is happy.

But over time, debt compounds:

New features take longer because the system is harder to change
Bugs increase because no one understands the code
Every change becomes risky because there is no test safety net

Eventually, you hit a wall where your team is spending more time working around the system than building within it. That is when people start whispering: “Maybe we need to rewrite it.” Or they just leave your company.

Let me say it: Cloud Can Also Introduce New Debt

Cloud-native architectures can reduce technical debt, but only if used thoughtfully.

You can still:

Over-complicate microservices
Abuse Kubernetes without understanding it
Ignore costs and create “cost debt”
Rely on too many services and lose track

Use the cloud to eliminate debt by simplifying, automating, and replacing legacy patterns, not just lifting them into someone else’s data center.

It Is More Than Just Moving to the Cloud

Modernization is about upgrading how your applications are built, deployed, run, and evolved, so they are faster, cheaper, safer, and easier to change. Here are some core areas where I saw organizations are making real progress:

Improving CI/CD. You can’t build modern applications if your delivery process is stuck in 2010.
Data Modernization. Migrate from monolithic databases to cloud-native, distributed ones.
Automation & Infrastructure as Code. It is the path to resilience and scale.
Serverless Computing. It is the “don’t worry about servers” mindset and ideal for many modern workloads.
Containerizing Workloads. Containers are a stepping stone to microservices, Kubernetes, and real DevOps maturity.
Zero-Trust Security & Cybersecurity Posture. One of the biggest priorities at the moment.
Cloud Migration. It is not about where your apps run. it is about how well they run there. “The cloud” should make you faster, safer, and leaner.

As you can see, application modernization is not one thing, it’s many things. You do not have to do all of these at once. But if you are serious about modernizing, these points (any more) must be part of your blueprint. Modernization is a mindset.

Why (replatforming) now?

There are a few reasons why application modernization projects are increasing:

The maturity of cloud-native platforms: Kubernetes, managed databases, and serverless frameworks have matured to the point where they can handle serious production workloads. It is no longer “bleeding edge”
DevOps and Platform Engineering are mainstream: We have shifted from siloed teams to collaborative, continuous delivery models. But that only works if your platform supports it.
AI and automation demand modern infrastructure: To leverage modern AI tools, event-driven data, and real-time analytics, your backend can’t be a 2004-era database with a web front-end duct-taped to it.

Conclusion

There is no longer much debate: (modern) applications are more important than ever. Yet despite all the talk around cloud-native technologies and modern architectures, the truth is that many organizations are still trying to catch up and work hard to modernize not just their applications, but also the infrastructure and processes that support them.

The current progress is encouraging, and many companies have learned from the experience of their first modernization projects.

One thing that is becoming harder to ignore is how much the geopolitical situation is starting to shape decisions around application modernization and cloud adoption. Concerns around data sovereignty, digital borders, national cloud regulations, and supply chain security are no longer just legal or compliance issues. They are shaping architecture choices.

Some organizations are rethinking their cloud and modernization strategies, looking at multi-cloud or hybrid models to mitigate risk. Others are delaying cloud adoption due to regional uncertainty, while a few are doubling down on local infrastructure to retain control. It is not just about performance or cost anymore, but also about resilience and autonomy.

The global context (suddenly) matters, and it is influencing how platforms are built, where data lives, and who organizations choose to partner with. If anything, it makes the case even stronger for flexible, portable, cloud-native architectures. So you are not locked into a single region or provider.

Evicted Pod Problem When Deploying App on Kubernetes

Apr 21, 2025 | Kubernetes

Recently, while deploying Longhorn, I encountered a frustrating issue: my pods were always showing “evicted” even though it was a freshly deployed Kubernetes Cluster. Since I am new to Kubernetes, I had no clue which commands would help me to better understand and then solve the problem. Luckily, ChatGPT makes that process much easier, but it still took me a while to figure it out.

FYI, Longhorn is a lightweight, reliable, and powerful distributed block storage system for Kubernetes.

Understanding the “Evicted” Status

In Kubernetes, eviction is a mechanism used to maintain node stability under resource pressure. When the kubelet detects that a node is running out of resources (memory, CPU, disk), it may evict pods to reclaim resources.

When you run:

kubectl get pods -A

You see something like:

instance-manager-db95e4280b535d82cde25cce5b44f97b 0/1 Evicted 0 1s

If you describe the node, you will see something like:

kubectl describe node k8s-worker1 | grep -A10 "Conditions"

Type: DiskPressure

Status: True

Message: kubelet has disk pressure

If you describe the pod, you will see something like:

kubectl describe pod -n longhorn-system

Status: Failed

Reason: Evicted

Message: The node was low on resource: ephemeral-storage.

On my node “k8s-worker1”, I then executed this command:

df -h /var/lib/kubelet

which showed:

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 11G 8.5G 1.8G 84% /var/lib/kubelet

Conclusion: Only 1.8 GB was free, and it was already at 84% usage. Kubernetes starts evicting pods when the disk usage exceeds thresholds, typically around 85–90%, depending on your kubelet settings.

That seems to have happened in my case. I don’t know how production Kubernetes clusters are set up and which “best practices” the experts use, but you would probably at least monitor /var/lib/kubelet closely. 🙂

The Solution

I had to increase the virtual disk size in ESXi and then had to resize the partition, physical volume, and logical volume inside the Ubuntu VM. After resizing the volumes, Longhorn could deploy properly without eviction errors. My nodes now looked like this:

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 71G 8.0G 60G 12% /var/lib/kubelet

Kubernetes stopped evicting pods, and the Longhorn UI was finally available to me and ready to provision volumes! 😀

Kubernetes Longhorn

Kubernetes Documentation – Node-pressure Eviction

Here are some snippets from the official Kubernetes documentation and node-pressure eviction.

The kubelet monitors resources like memory, disk space, and filesystem inodes on your cluster’s nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.

You can specify custom eviction thresholds for the kubelet to use when it makes eviction decisions. You can configure soft and hard eviction thresholds.

The kubelet reports node conditions to reflect that the node is under pressure because hard or soft eviction threshold is met, independent of configured grace periods.

Private Cloud Autarky – You Are Safe Until The World Moves On

Mar 3, 2025 | AWS, Azure, Google Cloud, Kubernetes, OCI Dedicated Region, Oracle Alloy, Oracle Cloud Infrastructure

I believe it was 2023 when the term “autarky” was mentioned during my conversations with several customers, who maintained their own data centers and private clouds. Interestingly, this word popped up again recently at work, but I only knew it from photovoltaic systems. And it kept my mind busy for several weeks.

What is autarky?

To understand autarky in the IT world and its implications for private clouds, an analogy from the photovoltaic (solar power) system world offers a clear parallel. Just as autarky in IT means a private cloud that is fully self-sufficient, autarky in photovoltaics refers to an “off-grid” solar setup that powers a home or facility without relying on the external electrical grid or outside suppliers.

Imagine a homeowner aiming for total energy independence – an autarkic photovoltaic system. Here is what it looks like:

Solar Panels: The homeowner installs panels to capture sunlight and generate electricity.
Battery: Excess power is stored in batteries (e.g., lithium-ion) for use at night or on cloudy days.
Inverter: A device converts solar DC power to usable AC power for appliances.
Self-Maintenance: The homeowner repairs panels, replaces batteries, and manages the system without calling a utility company or buying parts.

This setup cuts ties with the power grid – no monthly bills, no reliance on power plants. It is a self-contained energy ecosystem, much like an autarkic private cloud aims to be a self-contained digital ecosystem.

Question: Which partner (installation company) has enough spare parts and how many homeowners can repair the whole system by themselves?

Let’s align this with autarky in IT:

Solar Panels = Servers and Hardware: Just as panels generate power, servers (compute, storage, networking) generate the cloud’s processing capability. Theoretically, an autarkic private cloud requires the organization to build its own servers, similar to crafting custom solar panels instead of buying from any vendor.
Battery = Spares and Redundancy: Batteries store energy for later; spare hardware (e.g., extra servers, drives, networking equipment) keeps the cloud running when parts fail.
Inverter = Software Stack: The inverter transforms raw power into usable energy, like how a software stack (OS, hypervisor) turns hardware into a functional cloud.
Self-Maintenance = Internal Operations: Fixing a solar system solo parallels maintaining a cloud without vendor support – both need in-house expertise to troubleshoot and repair everything.

Let me repeat it: both need in-house expertise to troubleshoot and repair everything. Everything.

The goal is self-sufficiency and independence. So, what are companies doing?

An autarkic private cloud might stockpile Dell servers or Nvidia GPUs upfront, but that first purchase ties you to external vendors. True autarky would mean mining silicon and forging chips yourself – impractical, just like growing your own silicon crystals for panels.

The problem

In practice, autarky for private clouds sounds like an extreme goal. It promises maximum control. Ideal for scenarios like military secrecy, regulatory isolation, or distrust of global supply chains but clashes with the realities of modern IT:

Once the last spare dies, you are done. No new tech without breaking autarky.
Autarky trades resilience for stagnation. Your cloud stays alive but grows irrelevant.
Autarky’s price tag limits it to tiny, niche clouds – not hyperscale rivals.
Future workloads are a guessing game. Stockpile too few servers, and you can’t expand. Too many, and you have wasted millions. A 2027 AI boom or quantum shift could make your equipment useless.

But where is this idea of self-sufficiency or sovereign operations coming from? Nowadays? Geopolitical resilience.

Sanctions or trade wars will not starve your cloud. A private (hyperscale) cloud that answers to no one, free from external risks or influence. That is the whole idea.

What is the probability of such sanctions? Who knows… but this is a number that has to be defined for each case depending on the location/country, internal and external customers, and requirements.

If it happens, is it foreseeable, and what does it force you to do? Does it trigger a cloud-exit scenario?

I just know that if there are sanctions, any hyperscaler in your country has the same problems. No matter if it is a public or dedicated region. That is the blast radius. It is not only about you and your infrastructure anymore.

What about private disconnected hyperscale clouds?

When hosting workloads in the public clouds, organizations care more about data residency, regulations, the US Cloud Act, and less about autarky.

Hyperscale clouds like Microsoft Azure and Oracle Cloud Infrastructure (OCI) are built to deliver massive scale, flexibility, and performance but they rely on complex ecosystems that make full autarky impossible. Oracle offers solutions like OCI Dedicated Region and Oracle Alloy to address sovereignty needs, giving customers more control over their data and operations. However, even these solutions fall short of true autarky and absolute sovereign operations due to practical, technical, and economic realities.

A short explanation from Microsoft gives us a hint why that is the case:

Additionally, some operational sovereignty requirements, like Autarky (for example, being able to run independently of external networks and systems) are infeasible in hyperscale cloud-computing platforms like Azure, which rely on regular platform updates to keep systems in an optimal state.

So, what are customers asking for when they are interested in hosting their own dedicated cloud region in their data centers? Disconnected hyperscale clouds.

But hosting an OCI Dedicated Region in your data center does not change the underlying architecture of Oracle Cloud Infrastructure (OCI). Nor does it change the upgrade or patching process, or the whole operating model.

Hyperscale clouds do not exist in a vacuum. They lean on a web of external and internal dependencies to work:

Hardware Suppliers. For example, most public clouds use Nvidia’s GPUs for AI workloads. Without these vendors, hyperscalers could not keep up with the demand.
Global Internet Infrastructure. Hyperscalers need massive bandwidth to connect users worldwide. They rely on telecom giants and undersea cables for internet backbone, plus partnerships with content delivery networks (CDNs) like Akamai to speed things up.
Software Ecosystems. Open-source tools like Linux and Kubernetes are part of the backbone of hyperscale operations.
Operations. Think about telemetry data and external health monitoring.

Innovation depends on ecosystems

The tech world moves fast. Open-source software and industry standards let hyperscalers innovate without reinventing the wheel. OCI’s adoption of Linux or Azure’s use of Kubernetes shows they thrive by tapping into shared knowledge, not isolating themselves. Going it alone would skyrocket costs. Designing custom chips, giving away or sharing operational control or skipping partnerships would drain billions – money better spent on new features, services or lower prices.

Hyperscale clouds are global by nature, this includes Oracle Dedicated Region and Alloy. In return you get:

Innovation
Scalability
Cybersecurity
Agility
Reliability
Integration and Partnerships

Again, by nature and design, hyperscale clouds – even those hosted in your data center as private Clouds (OCI Dedicated Region and Alloy) – are still tied to a hyperscaler’s software repositories, third-party hardware, operations personnel, and global infrastructure.

Sovereignty is real, autarky is a dream

Autarky sounds appealing: a hyperscale cloud that answers to no one, free from external risks or influence. Imagine OCI Dedicated Region or Oracle Alloy as self-contained kingdoms, untouchable by global chaos.

Autarky sacrifices expertise for control, and the result would be a weaker, slower and probably less secure cloud. Self-sufficiency is not cheap. Hyperscalers spend billions of dollars yearly on infrastructure, leaning on economies of scale and vendor deals. Tech moves at lightning speed. New GPUs drop yearly, software patches roll out daily (think about 1’000 updates/patches a month). Autarky means falling behind. It would turn your hyperscale cloud into a relic.

Please note, there are other solutions like air-gapped isolated cloud regions, but those are for a specific industry and set of customers.

VMware vSphere Foundation and VMware Cloud Foundation Overview

Dec 22, 2023 | Edge Computing, ESXi, Kubernetes, NSX, Tanzu, VMware Aria, VMware Cloud Foundation, vSAN, vSphere

As some of you already know, VMware by Broadcom is moving forward to primary offers only: VMware vSphere Foundation (VVF) and VMware Cloud Foundation (VCF). If you have missed this announcement, have a look at my blog A New Era: Broadcom’s Streamlined Approach to VMware’s Product Lineup and Licensing.

Since a lot of solutions and different editions from before are included, I thought it might be helpful to summarize in a little bit more detail what is known to partners, analysts, and some customers already. I am also adding some screenshots from VMware websites and presentations, which should help everyone get a better understanding of VVF and VCF.

vSphere Foundation and VMware Cloud Foundation

I have included the vSphere editions for smaller use cases and projects as well.

Please note that ROBO licenses are not available anymore and I expect the edge division at VMware by Broadcom to come up with additional bundles in the future.

VVF and VCF Products

More details about the different products and the features included in the Aria suites can be found here: VMware Aria Suite Editions and Products

If you are looking for more information about the Aria Operations management packs (formerly known as True Visibility Suite or Aria Operations for Integrations), have a look here: VMware Aria Operations for Integrations Documentation

Add-ons for VVF and VCF

The table below gives you an overview of which add-ons are available at the time of writing this blog.

Make sure to contact your VMware representative to understand which add-ons are available for VVF and VCF.

Note: Available add-ons are the text in bold.

VVF and VCF Add-ons

Tanzu Guardrails (formerly VMware Aria Guardrails)

Looking at the official Tanzu Guardrails product website we can learn the following:

Note: It seems that Tanzu Hub is part of Tanzu Guardrails Advanced and Enterprise

SRE Services for VMware Cloud Foundation

VMware Site Reliability Engineering (SRE) Services for VMware Cloud Foundation provide VMware expertise to create highly reliable and scalable cloud environments. The services provide a range of capabilities from patching and upgrades to security hardening to automated management and operations.

The SRE Services for VCF datasheet can be found here.

How to count cores for VVF/VCF and TiBs for vSAN add-on?

Please have a look at this updated knowledgebase article: KB95927

What about VMwara Aria SaaS?

Customers have no more option to buy VMware Aria products as standalone products or as SaaS: https://blogs.vmware.com/management/2024/01/dramatic-simplification-of-vmware-aria-as-part-of-vmware-cloud-foundation.html

The Aria cloud management capabilities are available only as components of VMware vSphere Foundation and VMware Cloud Foundation, which are sold for deployment on-premises or on certain public cloud providers including VMware Cloud on AWS. Existing Aria SaaS subscriptions will continue through the end of their term. At time of renewal, customers should purchase VMware vSphere Foundation and VMware Cloud Foundation.

Use this KB96168 to understand which products are impacted by this new policy.

What about Tanzu products?

For some of us, it seems that the Tanzu products are only available as VVF/VCF add-ons, which is not true.

Based on the different comments on various social media platforms and the interviews we have seen from VMware by Broadcom executives, we can say the following:

vSphere with Tanzu (aka TKGs) with its Supervisor architecture is going to be the long-term strategy (part of VVF and VCF)
Heavy focus on Tanzu Application Platform (TAP) and Tanzu For Kubernetes Operations (TKO)
We can expect continued support for TKGm and TKGi

Tanzu Portofolio and Strategy Recap

At VMware Explore 2023, VMware presented the “develop, operate, optimize” approach when they talk about platform engineering:

Develop – Secure paths to production
Operate – Deploy, managed and scale applications seamlessly
Optimize – Continuously tune cost, performance and security of applications at runtime

We learned that VMware (by Broadcom) is going to invest in TAP, Spring, TKO and data services. What’s the difference between TAS and TAP again?

Tanzu Application Service – Opinionated platform built on Cloud Foundry
Tanzu Application Platform – Modular and portable PaaS for any conformant Kubernetes

Tanzu Portfolio Jan 2024

Tanzu for Kubernetes Operations Refresher

TKO comes in two different editions:

Tanzu for Kubernetes Operations Foundation (TKO-F)
- Tanzu Mission Control (includes TMC self-managed)
- Tanzu Service Mesh
Tanzu for Kubernetes Operations (TKO)
- Tanzu Mission Control (includes TMC self-managed)
- Tanzu Service Mesh
- Tanzu Observability (aka Aria Operations for Apps, formerly Wavefront)
- Antrea (CNI)
- TKGm
- Harbor, HA Proxy, Calico, FluentBit, Contour, Prometheus, Grafana
- Avi Essentials (NSX ALB)