GCP VM + Terraform Gotchas

Lessons from standing up a service on a GCP e2-micro with Terraform. Each of these bit us in production.

1. VM recreation destroys all local state

Warning

Terraform destroys first. A force-replacement apply deletes the old VM before the new one is healthy. There is no in-place swap. If your data is on local disk, it is gone.

Certain google_compute_instance fields trigger a destroy + create cycle — the old VM is deleted before the new one exists. Any state on local disk (SQLite, uploaded files, credentials) is gone.

Force-replacement fields

Field	Reason
`metadata_startup_script`	Intentionally ForceNew — triggers recreation when the startup script changes
`zone`	Cannot move a VM across zones in-place
`boot_disk.initialize_params`	Disk creation params can’t be modified after provisioning
`name`	Rename = new resource

What is metadata_startup_script? A field where you write a shell script that GCP runs automatically every time the VM boots from scratch. Common uses: install packages, pull secrets from Secret Manager, restore a GCS backup, start your service. It’s your VM’s bootstrap logic — equivalent to a user-data script on AWS EC2.

metadata_startup_script = <<-EOF
  #!/bin/bash
  apt-get install -y google-cloud-cli
  gsutil cp gs://my-bucket/backup.db /app/data.db
  systemctl start my-service
EOF

The provider treats any change to this script as “rebuild the VM from scratch,” since there’s no reliable way to re-run just the diff of a shell script on a live machine.

Stop-to-update fields

service_account works differently — it does not force replacement. Instead, GCP stops the VM, swaps the service account, then restarts it. Terraform won’t do this automatically unless you opt in:

resource "google_compute_instance" "vm" {
  allow_stopping_for_update = true   # lets Terraform stop → update → start
  # ...
  service_account {
    email  = google_service_account.new_sa.email
    scopes = ["cloud-platform"]
  }
}

Without allow_stopping_for_update = true, Terraform errors out and refuses to apply — it won’t silently skip the change or force-replace. The same flag is required for machine_type and min_cpu_platform changes.

The alternative is desired_status = "TERMINATED": Terraform stops the VM and applies the change, but does not restart it — you restart manually. Useful if you want to control the restart window.

Force replacement vs stop-to-update

Behaviour	Fields	Local disk
Force replacement	`metadata_startup_script`, `zone`, `name`	Gone — VM destroyed and recreated
Stop-to-update	`service_account`, `machine_type`, `min_cpu_platform`	Survives — VM stopped, updated, restarted

Fix

Move all persistent state off the VM before going to production — GCS bucket, Cloud SQL, or any managed store. At startup, restore from GCS; on a schedule, snapshot back to GCS.

Tip

Rule — Treat the VM as cattle, not a pet. Any persistent state must live outside it (GCS, Cloud SQL, etc.) before you go to production.

2. Reserve a static IP before day one

When the VM was first recreated, its ephemeral external IP changed and broke everything downstream: SSH config, Cloudflare env vars, bot config. We had to reserve a static IP and terraform import it after the fact — a painful retrofit.

google_compute_address and google_compute_instance are two separate resources because the IP needs to outlive the VM. If the IP were defined inline inside the VM block, it would be destroyed with the VM. As a standalone resource, it survives VM recreation and the new VM just re-attaches to the same IP.

resource "google_compute_address" "vm_ip" {
  name   = "my-service-ip"
  region = var.region
}
 
resource "google_compute_instance" "vm" {
  # ...
  network_interface {
    network = "default"
    access_config {
      nat_ip = google_compute_address.vm_ip.address
    }
  }
}

Pricing: a static IP attached to a running instance costs effectively $0/ m o n t h . T h e p e na lt yr a t e (a ro u n d$ 7.30/month) only applies to reserved IPs that are idle (reserved but not attached to any resource).

Tip

Rule — Add google_compute_address on day one, before the first production deploy. Retrofitting it requires terraform import and a config change applied while the service is live.

3. Zone capacity errors fail silently until apply

us-east1-b had no e2-micro capacity. Terraform only discovered this at VM creation time — after it had already destroyed the old instance:

terraform apply starts
Old VM: destroyed ✓
New VM: ZONE_RESOURCE_POOL_EXHAUSTED — creation fails
Service is down with no rollback path

terraform plan shows no capacity information. The error only surfaces at apply time.

Warning

Capacity errors only fail at apply, after destroy. There is no rollback — the old VM is already gone when the new one fails to provision.

Fix: pin the zone in a tfvar

Pinning the zone is not a GCP resource — it’s a config hygiene choice. Declare a variable and reference it so the zone is visible and intentional, and so you can switch in one line if you hit capacity limits.

# variables.tf
variable "zone" {
  type = string
}
 
# terraform.tfvars
zone = "us-central1-a"
 
# main.tf
resource "google_compute_instance" "vm" {
  zone = var.zone
  # ...
}

Zones with consistent e2-micro availability (GCP always-free free tier): us-central1-a, us-east1-c.

If you hit ZONE_RESOURCE_POOL_EXHAUSTED, change one line in terraform.tfvars and re-apply.

4. google-cloud-cli takes 10+ minutes to install on e2-micro

The google-cloud-cli apt package is 479 MB and unpacks 53,000+ files. The dpkg post-install step compiles Python bytecode for every file — on a 2-vCPU shared-core machine, this is genuinely slow.

Option A — Skip Python compilation (fastest fix, no code change):

sudo CLOUDSDK_SKIP_PY_COMPILATION=1 apt-get install -y google-cloud-cli

Drops install time from 10+ minutes to roughly 1–2 minutes. Commands still work; they just compile on first run instead.

Option B — Use the Python client library directly (best if you only need GCS):

pip install google-cloud-storage

No gcloud binary needed. If the only use is bucket read/write (e.g. backup restore), this avoids the full SDK entirely.

Option C — Bake a custom machine image:

Build a custom GCP image with gcloud pre-installed and use it as the boot disk. Startup time drops dramatically; the install cost is paid once during image build.

Note

google-cloud-cli-slim is not an apt package — the :slim variant only exists as a Docker image tag (google/cloud-sdk:slim). There is no slim Debian package in the Google Cloud apt repository.

Zhu Yuechen's Tech Notes

Explorer

GCP VM + Terraform Gotchas

1. VM recreation destroys all local state

Force-replacement fields

Stop-to-update fields

Force replacement vs stop-to-update

Fix

2. Reserve a static IP before day one

3. Zone capacity errors fail silently until apply

Fix: pin the zone in a tfvar

4. google-cloud-cli takes 10+ minutes to install on e2-micro

See also

Graph View

Table of Contents

Backlinks