DevOps & SRE notes
12.7K subscribers
46 photos
19 files
2.57K links
Helpful articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
The article features an interview with Landon Clipp, who built a multi-tenant GPU-based CaaS platform.
- Bypassing the NVIDIA GPU Operator
- Why gVisor Fails for GPUs
- VM Boot Delays
- Firmware and Memory Security
- Ideal Workload

https://kube.fm/gpu-containers-as-a-service-landon
👍6
The article explains that while Kubernetes excels at scheduling and isolating workloads, it lacks the context to secure Large Language Models (LLMs), which process untrusted natural language inputs. Highlighting four key risks from the OWASP Top 10 for LLMs, the author argues that security controls shouldn't live within the model runtime (like Ollama). Instead, organizations need a dedicated, LLM-aware policy layer (such as LiteLLM, Kong AI Gateway, or Portkey) in front of the model to enforce validation, filtering, and authorization.

https://www.cncf.io/blog/2026/03/30/llms-on-kubernetes-part-1-understanding-the-threat-model/
4👍4
Uber engineered an automated approach to migrate its massive Java monorepo (over 600,000 tests, 15 million lines of code) from the deprecated JUnit 4 to JUnit 5. Facing challenges like the lack of native JUnit 5 support in their Bazel build system and custom test configurations, they successfully migrated over 75,000 test classes and 1.25 million lines of code in just four months without disrupting developer workflows.

https://www.uber.com/us/en/blog/junit-migration/
🔥7
Claude Code gave me three "tickets" for a free week. You can grab them using this link: https://claude.ai/referral/NXtyf-cgbQ
6👍1👎1
The observability market is shifting from volume-based data ingestion to a value-driven model due to the unsustainable costs of scaling cloud-native and AI workloads. Driven by innovations like Chronosphere’s "Logs 2.0" and its subsequent acquisition by Palo Alto Networks, the industry is prioritizing "signal discipline"—retaining only actionable telemetry—and integrating observability directly into broader AI and security platforms.

https://siliconangle.com/2026/02/05/observability-cost-ai-scale-chronosphere-opensourcesummit/
3👍3
Managing expenses in the cloud requires a strategic approach beyond just looking at bills. A senior engineer shares valuable insight into optimizing costs effectively in this detailed read.
https://medium.com/@razkevich8/cloud-cost-optimization-a-senior-engineers-guide-d49ed4606de1
👍31
Many organizations are looking for more efficient logging solutions than the traditional stack. This comparison highlights a modern alternative to ELK that aims to reduce complexity and resource usage.
https://osuite.io/articles/modern-alternative-to-elk
👍2
Networking within container orchestration can often seem like a black box to developers. This explanation aims to demystify Kubernetes CNI providers and how they manage connectivity.
https://medium.com/@csinclair11/demystifying-kubernetes-cni-providers-5ed79569c797
4👍1
I found a good example of why autoscaling based only on CPU utilization can cause an outage.

About a week ago, Twingate had an incident that affected us as a client. They've published a postmortem, and it's a good example of why CPU isn't a good metric to rely on when autoscaling your services.

The incident was triggered by elevated network latency affecting communication paths used by the Authorization service. As requests took longer to complete, individual service instances were able to process fewer requests than normal.

This reduction in throughput exposed a limitation in our auto-scaling configuration, which primarily relied on CPU utilization to determine service capacity requirements.


So, from the CPU utilization perspective, everything was OK, but the number of processed requests decreased.

https://status.twingate.com/incidents/49qvqk7swjpq
👍6🔥2
Forwarded from AI Vibe Notes
kagent runs your agents where your workloads already live — on Kubernetes. Deploy, observe, and govern AI agents with the tools your platform team already trusts. Open source. Production grade. Built by the founders of Istio.

https://github.com/kagent-dev/kagent
👍42
When you have a special math to calculate your uptime, you always have 100%.
🤣6👏3😱1