Observability on a Budget: What Every Startup Should Monitor Before Series A
You don't need a $5K/month Datadog bill to know what's happening in your application. Here's the observability stack I install at seed-stage startups — effective, cheap, and ready to scale.
The observability conversation at a seed-stage startup usually goes one of two ways. Either the team has nothing — no structured logging, no metrics, no alerting — and they debug production by reading console output. Or they signed up for Datadog in month two and are staring at a $3K/month bill that grows faster than their revenue. Both are wrong. The first is flying blind. The second is paying enterprise prices for startup problems.
There is a middle path, and it is the one I install at nearly every engagement. A focused observability stack that covers the things you actually need to monitor at your stage, costs under $200/month, and scales cleanly when the time comes to invest more.
The three pillars, scoped for startups
Observability has three pillars: logs, metrics, and traces. At a seed-stage startup, you need all three, but you need them at different depths.
Logs are the foundation. Structured JSON logs with consistent fields — timestamp, level, request ID, user ID, service name. Not console.log("something happened"). Structured logs let you search, filter, and correlate. When a user reports a bug, you can find every log line from their session in seconds.
Metrics tell you the aggregate story. Request count, error rate, latency percentiles (p50, p95, p99), database query time, queue depth, memory and CPU utilization. Metrics are what trigger alerts. You do not need 500 custom metrics at seed stage. You need 15–20 that cover the critical paths.
Traces show the journey of a single request through your system. In a monolith, traces are less critical because the entire request executes in one process. As you add external API calls, background jobs, and third-party services, traces become essential for debugging latency and failures.
At seed stage, invest 80% of your effort in logs and metrics. Add tracing when you have more than one service or when external API calls become a meaningful source of latency.
The budget stack
The specific tools I recommend at seed stage, optimized for cost:
Logging: Axiom or Betterstack. Both offer generous free tiers (Axiom: 500GB ingest/month on the free plan; Betterstack: similar) and simple setup. Ship structured JSON logs from your application, and you get search, filtering, and basic dashboards. At seed stage, this replaces the need for an ELK stack or a Datadog log subscription.
Metrics and alerting: Grafana Cloud free tier. Grafana Cloud's free tier includes 10,000 metrics series, 50GB logs, and 50GB traces. For a seed-stage application, this is more than enough. Set up a few dashboards for the critical metrics and configure PagerDuty or Opsgenie alerts for the things that should wake someone up.
Uptime monitoring: BetterUptime or UptimeRobot. Simple external checks that verify your application is responding. These catch the failure modes that internal monitoring misses — DNS issues, TLS certificate expiry, load balancer misconfiguration. Free tiers are sufficient for seed stage.
Error tracking: Sentry. Sentry's free tier handles 5,000 events per month. For most seed-stage applications, this is enough to catch unhandled exceptions, group them by root cause, and track resolution. Sentry integrates with Slack, so the team sees errors in real time without checking a dashboard.
Total cost for this stack: $0–200/month depending on volume, versus $2K–5K/month for the equivalent Datadog or New Relic setup.
The 15 metrics that matter
You do not need hundreds of metrics. You need the right ones. Here are the fifteen I configure on every engagement:
Application health: request rate, error rate (4xx and 5xx separately), p50/p95/p99 latency, active connections.
Database: query count, slow query count (queries over 100ms), connection pool utilization, replication lag (if applicable).
Infrastructure: CPU utilization, memory utilization, disk usage, container restart count.
Business-critical paths: authentication success/failure rate, payment processing success/failure rate, and one metric specific to your core value prop (API calls made, documents processed, messages sent, whatever your product actually does).
Each of these should have an alert threshold. Not "alert when anything changes" — alert when a specific threshold is crossed. Error rate above 5% for 3 minutes. P99 latency above 2 seconds for 5 minutes. Disk usage above 85%. Payment failure rate above 2%.
The thresholds are not set once and forgotten. They calibrate over the first few weeks as you learn what normal looks like for your application.
Structured logging done right
The single highest-leverage thing a startup can do for observability is switch from unstructured to structured logging. This takes half a day and pays dividends for the life of the product.
Concretely: every log line should be a JSON object with at minimum these fields: timestamp, level (info, warn, error), message, requestId (a unique ID per incoming request), userId (if authenticated), and service (the name of the application or module).
For HTTP APIs, log the request (method, path, status code, duration) on every response. For errors, log the error message, the stack trace, and whatever context is needed to reproduce the issue.
The requestId is the most important field. When a user reports a bug, you search by their user ID, find the request ID, and pull every log line for that request. The debugging time drops from hours to minutes.
In Node.js, use Pino. In Python, use structlog. Both produce JSON output with minimal overhead and integrate cleanly with the logging services mentioned above.
Alerting philosophy
The wrong approach: alert on everything and create a noisy channel that everyone ignores. The right approach: alert on the things that require human action within minutes, and send everything else to a dashboard for periodic review.
Page-worthy alerts (wake someone up): application is down, error rate exceeds threshold, payment processing is failing, database is unreachable, disk is nearly full.
Dashboard-level signals (check daily): latency trending upward, slow query count increasing, memory usage trending upward, error rate slightly elevated. These are trends that need attention but not at 3 AM.
Most seed-stage teams should have 5–8 page-worthy alerts, not 50. Every additional alert reduces the team's responsiveness to all alerts. If the on-call engineer gets paged 10 times a day, they start ignoring pages.
The on-call question
At seed stage, on-call is usually informal — the founder or the first engineer checks Slack when something looks wrong. That works until it does not. The signal that you need formal on-call is the first time a production issue went unnoticed for more than 30 minutes during business hours, or the first time a customer found a bug before you did.
Formal does not mean heavy. At seed stage, formal on-call means: one person is designated as on-call each week, they have PagerDuty on their phone, the alerting system sends page-worthy alerts to PagerDuty, and there is a documented escalation path. That is a one-hour setup.
When to upgrade
The budget stack works until approximately Series A or 50K monthly active users, whichever comes first. The signals that you need to upgrade:
Log volume exceeds free tiers. You are ingesting more than the free tier allows and the cost of the budget tool is approaching the cost of a more capable tool.
Correlation is manual. You are copying request IDs between the logging tool, the error tracker, and the metrics dashboard to piece together what happened. This is the signal for an integrated platform.
Multiple services. You have extracted your first microservice and debugging cross-service requests without distributed tracing is painful.
Compliance requirements. SOC 2 or similar requires specific log retention periods, access controls, and audit trails that the budget tools may not support.
At that point, Datadog, Grafana Cloud paid tiers, or Honeycomb become worth the investment. The structured logging and metrics you set up at seed stage migrate cleanly to any of these platforms.
Counterpoint: sometimes Datadog is right from day one
If your application processes financial transactions, health data, or any other domain where a missed error has regulatory or legal consequences, the budget stack may not be sufficient. The compliance requirements for log retention, access controls, and audit trails may require an enterprise tool from the start. Factor the observability cost into your infrastructure budget and treat it as non-negotiable.
Your next step
This week, do three things. First, add structured JSON logging to your application if you do not have it. Second, set up Sentry for error tracking. Third, configure one external uptime check. Those three steps take a few hours total and give you the foundation for everything else.
Where I come in
Setting up the observability stack is a standard part of the infrastructure review I do at fractional CTO engagements. Usually a 2–3 day effort that gives the team visibility into production without the enterprise price tag. Book a call if your team is debugging by reading console output or paying too much for monitoring you do not fully use.
Related reading: The Seed-Stage Stack · Your Cloud Bill Is a Strategy Document · The Startup Security Baseline
Need an observability stack that fits your budget? Book a call.
Get in touch →