Lessons from standing up a service on a GCP e2-micro with Terraform. Each of these bit us in production.
1. VM recreation destroys all local state
Warning
Terraform destroys first. A force-replacement apply deletes the old VM before the new one is healthy. There is no in-place swap. If your data is on local disk, it is gone.
Certain google_compute_instance fields trigger a destroy + create cycle — the old VM is deleted before the new one exists. Any state on local disk (SQLite, uploaded files, credentials) is gone.
Force-replacement fields
| Field | Reason |
|---|---|
metadata_startup_script | Intentionally ForceNew — triggers recreation when the startup script changes |
zone | Cannot move a VM across zones in-place |
boot_disk.initialize_params | Disk creation params can’t be modified after provisioning |
name | Rename = new resource |
What is metadata_startup_script? A field where you write a shell script that GCP runs automatically every time the VM boots from scratch. Common uses: install packages, pull secrets from Secret Manager, restore a GCS backup, start your service. It’s your VM’s bootstrap logic — equivalent to a user-data script on AWS EC2.
metadata_startup_script = <<-EOF
#!/bin/bash
apt-get install -y google-cloud-cli
gsutil cp gs://my-bucket/backup.db /app/data.db
systemctl start my-service
EOFThe provider treats any change to this script as “rebuild the VM from scratch,” since there’s no reliable way to re-run just the diff of a shell script on a live machine.
Stop-to-update fields
service_account works differently — it does not force replacement. Instead, GCP stops the VM, swaps the service account, then restarts it. Terraform won’t do this automatically unless you opt in:
resource "google_compute_instance" "vm" {
allow_stopping_for_update = true # lets Terraform stop → update → start
# ...
service_account {
email = google_service_account.new_sa.email
scopes = ["cloud-platform"]
}
}Without allow_stopping_for_update = true, Terraform errors out and refuses to apply — it won’t silently skip the change or force-replace. The same flag is required for machine_type and min_cpu_platform changes.
The alternative is desired_status = "TERMINATED": Terraform stops the VM and applies the change, but does not restart it — you restart manually. Useful if you want to control the restart window.
Force replacement vs stop-to-update
| Behaviour | Fields | Local disk |
|---|---|---|
| Force replacement | metadata_startup_script, zone, name | Gone — VM destroyed and recreated |
| Stop-to-update | service_account, machine_type, min_cpu_platform | Survives — VM stopped, updated, restarted |
Fix
Move all persistent state off the VM before going to production — GCS bucket, Cloud SQL, or any managed store. At startup, restore from GCS; on a schedule, snapshot back to GCS.
Tip
Rule — Treat the VM as cattle, not a pet. Any persistent state must live outside it (GCS, Cloud SQL, etc.) before you go to production.
2. Reserve a static IP before day one
When the VM was first recreated, its ephemeral external IP changed and broke everything downstream: SSH config, Cloudflare env vars, bot config. We had to reserve a static IP and terraform import it after the fact — a painful retrofit.
google_compute_address and google_compute_instance are two separate resources because the IP needs to outlive the VM. If the IP were defined inline inside the VM block, it would be destroyed with the VM. As a standalone resource, it survives VM recreation and the new VM just re-attaches to the same IP.
resource "google_compute_address" "vm_ip" {
name = "my-service-ip"
region = var.region
}
resource "google_compute_instance" "vm" {
# ...
network_interface {
network = "default"
access_config {
nat_ip = google_compute_address.vm_ip.address
}
}
}Pricing: a static IP attached to a running instance costs effectively 7.30/month) only applies to reserved IPs that are idle (reserved but not attached to any resource).
Tip
Rule — Add
google_compute_addresson day one, before the first production deploy. Retrofitting it requiresterraform importand a config change applied while the service is live.
3. Zone capacity errors fail silently until apply
us-east1-b had no e2-micro capacity. Terraform only discovered this at VM creation time — after it had already destroyed the old instance:
terraform applystarts- Old VM: destroyed ✓
- New VM:
ZONE_RESOURCE_POOL_EXHAUSTED— creation fails - Service is down with no rollback path
terraform plan shows no capacity information. The error only surfaces at apply time.
Warning
Capacity errors only fail at apply, after destroy. There is no rollback — the old VM is already gone when the new one fails to provision.
Fix: pin the zone in a tfvar
Pinning the zone is not a GCP resource — it’s a config hygiene choice. Declare a variable and reference it so the zone is visible and intentional, and so you can switch in one line if you hit capacity limits.
# variables.tf
variable "zone" {
type = string
}
# terraform.tfvars
zone = "us-central1-a"
# main.tf
resource "google_compute_instance" "vm" {
zone = var.zone
# ...
}Zones with consistent e2-micro availability (GCP always-free free tier): us-central1-a, us-east1-c.
If you hit ZONE_RESOURCE_POOL_EXHAUSTED, change one line in terraform.tfvars and re-apply.
4. google-cloud-cli takes 10+ minutes to install on e2-micro
The google-cloud-cli apt package is 479 MB and unpacks 53,000+ files. The dpkg post-install step compiles Python bytecode for every file — on a 2-vCPU shared-core machine, this is genuinely slow.
Option A — Skip Python compilation (fastest fix, no code change):
sudo CLOUDSDK_SKIP_PY_COMPILATION=1 apt-get install -y google-cloud-cliDrops install time from 10+ minutes to roughly 1–2 minutes. Commands still work; they just compile on first run instead.
Option B — Use the Python client library directly (best if you only need GCS):
pip install google-cloud-storageNo gcloud binary needed. If the only use is bucket read/write (e.g. backup restore), this avoids the full SDK entirely.
Option C — Bake a custom machine image:
Build a custom GCP image with gcloud pre-installed and use it as the boot disk. Startup time drops dramatically; the install cost is paid once during image build.
Note
google-cloud-cli-slimis not an apt package — the:slimvariant only exists as a Docker image tag (google/cloud-sdk:slim). There is no slim Debian package in the Google Cloud apt repository.
See also
- terraform-variables — how variables.tf / tfvars / var.<name> fit together
- terraform-apply — terraform import, reading # forces replacement in plan output