Shrikant’s Kriti-Sangraha

Pre-warming Container Images on Upgraded Nodes with Kubernetes CronJob + DaemonSet

2026-04-15T18:30:00+00:00

The Problem

On our hosted notebook platform, when a node undergoes an OS upgrade, it rejoins the cluster with a fresh slate — no cached container images. The first user pod scheduled on that node pays the full image pull penalty for the base image, leading to slow startup times and a degraded experience.

We needed a way to guarantee that upgraded nodes have the base image pre-warmed before any user workload lands on them.

Why Not Just Let It Pull?

Our base images are large. A cold pull on a freshly upgraded node adds significant latency to the first notebook spawn. In a platform where users expect near-instant startup, that delay is noticeable. The image needs to be ready before the node is open for scheduling.

The Alternative: A Custom Controller

The textbook Kubernetes answer would be a custom controller (operator) that watches node events and reacts in real-time.

How It Would Work

The controller would use the watch/reconcile pattern — subscribe to node events via the K8s API, react to label changes in near real-time, and manage the full lifecycle in a single reconciliation loop.

Why We Didn’t Go This Route

Concern	Custom Controller	CronJob + DaemonSet
React time	Seconds (event-driven)	Up to 1 hour (poll-based)
Complexity	Custom CRD, RBAC, leader election, error handling	Two standard K8s resources + a script
Maintenance	Dedicated codebase, CI/CD, versioning	ConfigMap with a Python script
Failure mode	Controller crash = no nodes processed	Missed cycle = next run catches up
Deployment	Needs a long-running Deployment with HA	CronJob is fire-and-forget
Development time	Weeks (with testing, CRD design)	Days

For our use case — a batch of nodes upgrading over hours, not minutes — the near-instant reaction time of a controller wasn’t worth the operational overhead. The CronJob’s hourly poll is fast enough, and the DaemonSet gives us the per-node execution for free.

A custom controller becomes the right choice when you need sub-second reaction times, complex state machines, or this pattern extends to many different reconciliation workflows.

Our Approach: Taint, Upgrade, Pre-warm, Untaint

We broke the problem into two standard Kubernetes primitives working in concert — a CronJob as the orchestrator and a DaemonSet as the per-node executor.

Component 1: The CronJob — Orchestrator

A CronJob runs every hour and executes a Python script with two responsibilities.

Part 1 — Taint & Label

Fetches all active node maintenance objects for the notebook pool. For each node, it applies:

Label: os-upgrade/status=upgrading
Taint: os-upgrade=true:NoSchedule

The taint prevents any user pods from landing on the node. The label acts as a targeting signal for the DaemonSet.

apiVersion: batch/v1
kind: CronJob
metadata:
name: os-upgrade-orchestrator
namespace: notebooks
spec:
schedule: "0 * * * *"
jobTemplate:
    spec:
    backoffLimit: 2
    ttlSecondsAfterFinished: 30
    template:
        spec:
        restartPolicy: OnFailure
        serviceAccountName: upgrade-sa
        containers:
            - name: orchestrator
            image: my-registry/image-prewarmer:latest
            command: ["/bin/sh", "-c", "python3 /scripts/orchestrate.py"]
            volumeMounts:
                - name: script
                mountPath: /scripts/orchestrate.py
                subPath: orchestrate.py
        volumes:
            - name: script
            configMap:
                name: upgrade-orchestrator-script

Part 2 — Detect & Release

On each run, the script also checks for nodes that have already been upgraded (by reading the kernel version label) AND have a Running DaemonSet pod. That intersection represents nodes where the upgrade is complete and the image is warm. For those nodes, the script removes the taint and label.

upgraded_nodes = fetch_upgraded_nodes()
nodes_with_running_ds_pod = fetch_running_daemonset_pods()

ready_nodes = set(upgraded_nodes) & set(nodes_with_running_ds_pod)
for node in ready_nodes:
    remove_label_and_taint(node)

Component 2: The DaemonSet — Image Pre-warmer

A DaemonSet with tight node affinity targets only nodes satisfying all three conditions:

Upgraded kernel — e.g. kernel-version.full = 5.15.173.1
Notebook pool — e.g. node-pool = notebooks
Upgrade in progress — os-upgrade/status = upgrading

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: image-prewarmer
namespace: notebooks
spec:
selector:
    matchLabels:
    app: image-prewarmer
template:
    metadata:
    labels:
        app: image-prewarmer
    spec:
    tolerations:
        - key: "os-upgrade"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    affinity:
        nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
                - key: node.kubernetes.io/kernel-version
                    operator: In
                    values: ["5.15.173.1"]
                - key: node-pool
                    operator: In
                    values: ["notebooks"]
                - key: os-upgrade/status
                    operator: In
                    values: ["upgrading"]
    containers:
        - name: warmer
        image: my-registry/notebook-base:latest
        resources:
            requests:
            memory: "100Mi"
            limits:
            memory: "200Mi"
        command: ['sh', '-c', 'echo "Image pulled — pre-warm complete" && sleep 3600']

Key design choices:

The DaemonSet tolerates the os-upgrade taint — it can schedule where user pods cannot
The container image is the actual base image we want cached — pulling it is the pre-warming
The container just sleeps — its only job is to trigger the pull and serve as a ready signal for the CronJob

Node Lifecycle Through an Upgrade

Why This Works Well

Self-healing. The CronJob continuously reconciles. If a node is missed in one cycle, it gets picked up in the next.
No user impact. The taint guarantees no user pod hits a cold node. The node only becomes schedulable after the image is confirmed cached.
Native K8s primitives. DaemonSets naturally handle “run exactly one pod per matching node.” No custom controller needed — just the right combination of labels, taints, and affinity rules.
Decoupled from the upgrade pipeline. We don’t modify the OS upgrade process. We observe its side effects (kernel version label change) and react.

Takeaway

Sometimes you don’t need a custom operator. A CronJob for orchestration + a DaemonSet for per-node work, connected through labels and taints, gave us a reliable image pre-warming pipeline with about 200 lines of Python and 90 lines of YAML. The custom controller path would have given us faster reaction times, but for a process that plays out over hours, polling every hour is more than adequate — and dramatically simpler to build, deploy, and maintain.

Claude Code Plugin: AI-Native Flink Pipeline Orchestration

2026-04-10T18:30:00+00:00

Creating a Flink pipeline at work used to take hours. Not because Flink is hard. Because you had to navigate 6-8 steps across JupyterLab, terminal, wiki docs, access control portals, and schema registry. In the right order. Any wrong step meant oncall.

I automated it with an agent. Full workflow now completes in under 10 minutes. Zero oncall escalations for this class of issue since.

Here’s how it works and what made it actually hard to build.

The Problem

Engineers relied on a wiki doc to get through each pipeline creation step. Context-switching between the JupyterLab UI, terminal, access control portals, and metadata catalogs. Build failures, stale clusters, and access control misconfigurations routinely required escalation. Median time to a working pipeline: hours.

The manual steps weren’t the real problem. The failure handling was. Every step had a different failure mode. None of them had a standard fix. That’s what made this interesting to build.

How It’s Built

A Claude Code plugin with 11 composable skills. Built in Python on top of Jupyter kernel internals and IPython magic commands so it runs natively inside the same environment engineers already work in. No new tooling, no platform changes.

Each skill is backed by a purpose-built CLI that talks directly to platform APIs: Flink SQL gateway, access control services, schema registry, kernel sessions. The skills are independent you can run the full workflow or just one step.

But the CLI layer is the easy part.

The Failure Recovery Layer

This is what separates it from a CLI wrapper.

flowchart TD
      A([▶ Skill step executes]):::start --> B[Agent reads full log output]
      B --> C{Success?}

      C -- Yes --> D{More steps?}
      D -- Yes --> A
      D -- No --> Z([✓ Session complete\nAudit trail generated]):::done

      C -- No --> E[Classify error]

      E --> F1[Build failure\nParse Gradle error, fix build config]
      E --> F2[Cluster recycled\nReprovision automatically]
      E --> F3[Pod OOM or crash\nFetch K8s logs, diagnose, apply fix]
      E --> F4[Access control block\nSurface approval URL, wait]
      E --> F5[DDL placeholder mismatch\nValidate against config, flag proactively]

      F1 & F2 & F3 --> R([Rerun step])
      F4 --> W[Engineer approves] --> R
      F5 --> W2[Engineer reviews] --> R

      R --> B

      classDef start fill:#2d6a4f,color:#fff,stroke:#1b4332
      classDef done fill:#1d3557,color:#fff,stroke:#0d2137

After every skill step, a reasoning layer reads the full log output and decides what to do. Not just “did it succeed?” it reads the actual error and takes corrective action.

Build failure: Parses the Gradle error, identifies the root cause, fixes the build config, reruns. Engineers don’t see the failure unless the fix itself fails.

Cluster recycled mid-session: Platform recycles idle clusters. The agent detects the error, reprovisions automatically, and resumes where it left off. Previously this required manual intervention every time.

Job pod OOM crash: Fetches Kubernetes pod logs, diagnoses whether it’s an OOM or a misconfiguration, applies or suggests the fix depending on confidence.

Access control pending: Identifies the pending request, surfaces the exact approval URL, advises retry after approval. No more digging through portals to find the right link.

DDL placeholder mismatch: Validates substitution against app config before execution and flags mismatches proactively. Catches a whole class of silent failures before they happen.

None of these hit oncall anymore.

Human-in-the-Loop and Audit Trail

Two things I spent more time on than I expected.

Before every command, the agent surfaces the exact parameters and a 1-2 line review before anything runs. This wasn’t optional it’s what made people actually trust it.

There’s also a summary command. Run it at any point to get a structured view of what has completed and what remains. Useful mid-session, and essential when something goes wrong and you want to understand the state.

I also built a session summary skill that reconstructs the entire session from conversation logs every operation, every failure, every fix, in order. Structured audit trail. Turned out to be one of the more-used features.

“Every agent action should be explainable, attributable, auditable.” That’s not a principle for this system. It’s a feature.

Impact

Full workflow in under 10 minutes. Was hours.
Zero oncall escalations for this class of issue since shipping.
Everything in one terminal. No context switching between UI, wiki, portals, and catalogs.
Each skill runs independently. Add a single source without rerunning the whole workflow.
No platform changes required. Integrates at the CLI layer only.

What I’d Build Differently

The recovery layer grew organically as I hit each failure mode in production. I’d design it upfront next time define the failure taxonomy first, then build the recovery handlers. The way it happened, each fix was slightly inconsistent in how it reported back to the engineer.

I’d also add structured evals earlier. Right now I know it works because I can see it working. That’s not the same as having a repeatable way to verify it still works after changes.

The gap between a demo and something engineers actually use is mostly a failure handling problem. Happy path is easy. Recycled clusters, OOM crashes, stale access tokens: that’s where demos die.

If you’re building something similar or have thoughts on the recovery layer design, I’d be curious to hear it.

Chrome Extension: Country of Origin Toast

2026-03-09T18:30:00+00:00

About the Extension

This post is about the development of a Chrome extension called “Made Origin”. This extension shows a small toast notification about the Country of Origin of a product when you visit apparel websites like Myntra, Uniqlo, H&M, etc. Usually, the Country of Origin is not easily visible on the product page and takes multiple clicks to find. This extension makes it easier for users to find this information.

Development Process

Initial Prompt to Claude

I have an idea of a chrome extension:
When I visit any apparel site like myntra, uniqlo, h&m it is tough to find the Country of Origin / Manufactured country of a product. Its usually hidden under some panel which the user has to look out for and click it.

My idea is simple, whenever I open a product on any of these sites, I should see the country as a simple toast on the top right of the site page with country name and flag. A simple thing but it should be easily visible.

The prompt above gave me a basic structure for the extension, which I then refined and developed further.

Working on the Extension with VSCODE Copilot

The initial code generated by Claude was a good starting point. But when I tried to load the extension in Chrome, it didn’t work as expected. So it was time to first understand the code at a high level and work with Copilot to make it function correctly.

Help me analyse this Chrome extension with an example and a step-by-step approach

The prompt above helped me understand the code and how it works.

So I started with Myntra products first. I inspected the product page on Myntra to see where the country-of-origin information is stored and what the manual process was to find it, such as scrolling down and clicking on View Supplier Information.

This opened a modal which had the country of origin information. So I had to make the extension do the same thing, click on the button and then extract the country information from the modal.

Based on this, I found the div selector that contained the country information and passed it to Copilot as:

For myntra its not working.
The Country of Origin comes up as a More Information
Product Code : 36879762
Importer Details
Marks & Spencer Reliance India Pvt Ltd Ground Floor Infinity Tower Cdlf Cyber City Phase Ii Gurgaon Haryana-122002
Country of Origin
Bangladesh

When we click on this div: class="supplier-viewmore-link" having text: View Supplier Information

#mountRoot > div > div:nth-child(1) > main > div.pdp-details.common-clearfix > div.pdp-description-container > div.undefined.supplier-desktopCodeSupplier > div:nth-child(2) > div

Now, when opening a product page on Myntra, the extension clicked the View Supplier Information button, extracted the country information from the modal, and showed it as a toast notification on the top right of the page.

But the issue here was that the modal opened, which was a bit annoying for the user. All of this had to happen in the background without opening the modal, so I had to find a way to extract the country information without showing it.

So this was the next prompt:

Okay, now it works, but the modal opens in the UI, which is a bad user experience. Is there any way to do it in the background?

Gist of the solution:

The supplier link is still clicked programmatically, but:
addMyntraModalStealthStyle() temporarily hides the modal (opacity: 0, no pointer events) while it’s open, so users should not see it.

openMyntraCountryModalIfNeeded() now:
Clicks the link once, waits ~900 ms, scans modal content via scanElements(getCurrentHostSelectors()), and returns the detected country.

Then calls closeMyntraModalIfOpen() to close/remove the modal and removeMyntraModalStealthStyle() to restore normal styles.

The run() function:
If initial selector scan finds nothing on Myntra, it calls this new “stealth” modal flow and uses the returned country.

This made it work as expected without opening the modal in the UI.

While trying it in the UI, I found a case where, if the product size is not selected, the supplier information is not available.

Here was the prompt to fix it:

One behavior I observed is that when I open a product and have not selected a size, the supplier information is not present.

So now only trigger the flow when a size is selected from this div: #sizeButtonsContainer > div.size-buttons-size-buttons
Then trigger our flow to find the country of origin, rather than triggering it as soon as the page opens.

This made the extension work only after a product size is selected, and then it shows the country of origin as a toast notification on the top right of the page.

The next step is to optimize it for other sites.

Summary

Overall, getting started with boilerplate code from Claude and then using Copilot to understand and refine the code was a good experience. Obviously, you need to be specific with your prompts to get the desired output.

An idea turned into a product quickly with the help of AI tools. I hope this guide will be helpful for others who want to build similar extensions or products using AI tools.

Welcome!!

2026-02-04T06:48:24+00:00

Welcome to my blog! Here I would be documenting things I read: books, articles, concalls, annual reports, pdfs, etc. Things that I listen or watch: podcasts, youtube videos, etc. And also things that I do: projects, work, etc.

This is more like a personal journal, sharing my learnings, thoughts and observations. I hope that it will be useful for someone else as well. I will try to keep it updated regularly.