Building High-Load APIs: Problems, Solutions, and Practical Advice
A practical deep-dive into the eight most common bottlenecks that emerge as APIs scale from hundreds to tens of thousands of requests per second. Covers caching strategies, async queues, observability, security, and more — with real code examples.
When a system grows from a hundred to ten thousand requests per second, problems appear that are invisible at smaller scale. In nine out of ten cases the database is the first suspect — but it is rarely the only one. This article walks through eight recurring problems in high-load API development and the concrete solutions that actually work.
Problem 1: The Database Is Always the Suspect
As load increases, the database becomes the bottleneck. The solution is layered caching applied at every level.
In-Memory Cache. Use an LRU cache for frequently requested data such as reference tables and category lists. Do not forget mutexes and negative caching — storing the fact that a key does not exist, to avoid hammering the database with repeated misses.
Distributed Cache. Redis or Memcached for sessions and the results of expensive queries.
Two caching strategies worth knowing:
- Cache-Aside: The application checks the cache first. On a miss it goes to the database, loads the data, and stores it in the cache for next time.
- Cache Stampede Protection: When the TTL on a popular key expires, many requests arrive simultaneously and all try to reload from the database at once. Use a distributed lock so only one request rebuilds the cache entry while the others wait.
Database optimization. Use EXPLAIN ANALYZE to understand query plans. Learn the difference between B-Tree, GIN, and GiST indexes and apply them appropriately. Watch for the N+1 problem in ORM usage by logging SQL queries in development. Pre-compute aggregates at write time rather than recalculating them on every read.
Scaling the database. Add read replicas and route all SELECT statements to them, keeping writes on the primary. For larger systems, consider sharding to distribute data physically across servers.
Problem 2: Synchronous and Blocking Operations
When a single API call must charge a card, create a database record, send an email, and dispatch an SMS — all synchronously — users wait ten seconds or more. The solution is asynchronous processing via message queues.
The pattern:
- The API performs only the minimum: input validation and the single critical operation.
- All remaining tasks are enqueued (RabbitMQ, Kafka, AWS SQS).
- The API returns
202 Acceptedimmediately. - Separate worker processes consume the queue asynchronously.
Reliability requires several supporting patterns:
- Acknowledgements: A worker sends an
ackonly after it has successfully completed the task, not before. - Idempotent consumers: Processing the same task twice must not produce duplicate side effects.
- Dead-Letter Queues (DLQ): Tasks that repeatedly fail are moved to a DLQ for investigation rather than looping forever.
- Transactional Outbox: Save the database record and the outgoing message in a single transaction so they either both commit or both roll back.
Problem 3: Monolithic Code and Single Points of Failure
Application-level async. Instead of executing three downstream requests sequentially (300 ms + 200 ms + 500 ms = 1000 ms total), launch them in parallel and finish in 500 ms — the duration of the slowest one.
Horizontal scaling with stateless architecture. Run 20 instances of the application. The fundamental rule: a service must not remember anything about a specific user between requests. Store sessions in Redis and files in S3. Any instance must be able to handle any request.
API Gateway. Use a single entry point (Kong, Tyk, or similar) to centralize authentication, authorization, rate limiting, routing, and response aggregation. This removes cross-cutting concerns from individual services.
Problem 4: Network and Protocol Inefficiencies
Modern protocols.
- HTTP/2 solves Head-of-Line blocking through request multiplexing over a single connection.
- HTTP/3 (QUIC) goes further: losing one packet no longer blocks all other streams in the connection.
- gRPC uses Protocol Buffers binary encoding, which is faster and more compact than JSON, and natively supports streaming.
CDN. Modern CDNs can cache API responses at edge locations worldwide, drastically reducing latency for users who are geographically distant from your origin servers.
Problem 5: Risky Deployments
Zero-downtime deployment strategies:
- Blue-Green Deployment: Maintain two identical environments. Deploy to the inactive one, verify it, then switch all traffic over. Roll back instantly by switching back.
- Canary Releasing: Deploy the new version to 5% of servers. Monitor error rates automatically and roll back if they climb, or gradually expand to 100%.
Feature Flags. Decouple deployment from release. Ship the code to production hidden behind a flag, then enable it for specific users, regions, or percentages of traffic — with no new deployment required.
Problem 6: Blindness in Production (Observability)
You cannot fix what you cannot see. The three pillars of observability are:
- Logs: Use structured JSON logs and attach a
trace_idto every request so you can follow it through every service it touches. - Metrics: Rely on percentiles — p50, p95, p99, p99.9 — rather than averages. A mean response time of 100 ms is meaningless if 1% of users wait 30 seconds. Apply the RED methodology: Rate, Errors, Duration.
- Traces: A distributed trace is a map of a request's journey through the system, composed of spans (individual operations). Tools like Jaeger or Zipkin visualize these end-to-end.
Problem 7: Security
Attack protection:
- Rate Limiting: Implement the Token Bucket algorithm, keyed on IP address, API token, or a combination. Return
429 Too Many Requeststo abusive clients. - WAF: A Web Application Firewall with custom rules blocks common attack patterns before they reach your application.
- CAPTCHA: For endpoints exposed to end users, protect against automated abuse.
Authentication best practices:
- Always verify the JWT digital signature — never trust the payload without validation.
- Prefer RS256 (asymmetric) over HS256 (symmetric) so the verification key can be public.
- Use short-lived access tokens (5–15 minutes) paired with long-lived refresh tokens.
Problem 8: Configuration and Secret Management
Centralization. Environment variables are the minimum viable approach. For anything beyond a single service, use Consul or etcd for centralized, versioned configuration that updates without redeploys.
Secrets management. Never store passwords, API keys, or certificates in source code or environment variables visible in process listings. Use HashiCorp Vault or AWS Secrets Manager — both provide encryption at rest, audit logs, and automatic secret rotation.
Choosing the Right Tool: Polyglot Persistence
A relational database should not be used for everything. Match the storage engine to the access pattern:
- Elasticsearch for full-text search
- ClickHouse for analytics and columnar queries
- InfluxDB for time-series data and metrics
- Neo4j for graph data
Code Example: Idempotency Manager (C++)
The following is a production-oriented implementation of an idempotency manager that guarantees each operation executes at most once, even under concurrent duplicate requests:
#include <string>
#include <unordered_map>
#include <mutex>
#include <chrono>
#include <optional>
#include <vector>
#include <condition_variable>
#include <thread>
#include <memory>
struct StoredResponse {
int http_code;
std::string body;
};
class IdempotencyManager {
private:
enum class Status { PROCESSING, COMPLETED, FAILED };
struct RequestState {
Status status;
std::optional<StoredResponse> response;
std::chrono::steady_clock::time_point creation_time;
std::condition_variable cv;
std::mutex state_mutex;
};
std::unordered_map<std::string, std::shared_ptr<RequestState>> requests;
mutable std::mutex map_mutex;
public:
enum class CheckResult {
PROCEED,
WAIT,
COMPLETED
};
CheckResult check(const std::string& key, StoredResponse& out_response) {
std::shared_ptr<RequestState> state;
{
std::lock_guard<std::mutex> lock(map_mutex);
auto it = requests.find(key);
if (it == requests.end()) {
state = std::make_shared<RequestState>();
state->status = Status::PROCESSING;
state->creation_time = std::chrono::steady_clock::now();
requests[key] = state;
return CheckResult::PROCEED;
}
state = it->second;
}
std::unique_lock<std::mutex> state_lock(state->state_mutex);
if (state->status == Status::PROCESSING) {
if (state->cv.wait_for(state_lock, std::chrono::seconds(5)) == std::cv_status::timeout) {
return CheckResult::WAIT;
}
}
if (state->status == Status::COMPLETED) {
out_response = *state->response;
return CheckResult::COMPLETED;
}
return CheckResult::WAIT;
}
void complete(const std::string& key, const StoredResponse& response) {
std::shared_ptr<RequestState> state;
{
std::lock_guard<std::mutex> lock(map_mutex);
auto it = requests.find(key);
if (it == requests.end()) return;
state = it->second;
}
{
std::lock_guard<std::mutex> state_lock(state->state_mutex);
state->status = Status::COMPLETED;
state->response = response;
}
state->cv.notify_all();
}
};
Practical Principles
- Set timeouts on every network call — no exceptions.
- Implement back-pressure: return
429 Too Many Requestswhen overloaded rather than accepting unbounded work. - Design for cost: use auto-scaling and Spot instances to avoid paying for idle capacity.
- Do not optimize prematurely, but design for evolution — make it easy to swap components later.
- Idempotency is critical in unreliable networks. Build it in from the start.
- Design for failure. Implement the Circuit Breaker pattern so a failing downstream dependency does not cascade.
The best architecture is the one that solves today's business and load problems without painting you into a corner tomorrow. It is a continuous process, not a one-time decision.