Building High-Performance REST APIs in Java

A slow API is not just a bad user experience. It costs money. Amazon calculated that every additional 100ms of latency reduces revenue by 1%. Google found that being 500ms slower led to 20% less traffic. Shopify handles hundreds of thousands of requests per minute during flash sales. Those numbers do not come from faster machines. They come from understanding exactly where the bottlenecks are in the system and eliminating them deliberately.

Performance is not something you add after the fact. It is the result of every design decision. This post does not teach you to add caching to everything or raise the thread pool to 1000. It teaches you to read the system, find the real bottleneck, and fix the right thing, the way a senior engineer approaches the problem.

Part 1: Why Is My API Slow?

The full request lifecycle

Before optimizing anything, you need to understand where a request goes and where it spends its time.

Client
  |-- DNS lookup (1-100ms, cached after first request)
  |-- TCP handshake (1 RTT ~= 0.5-50ms depending on geography)
  |-- TLS handshake (1-2 additional RTTs)
  v
Load Balancer
  |-- Health check routing
  |-- SSL termination (if not offloaded)
  |-- Connection overhead (negligible with keep-alive)
  v
API Server (JVM)
  |-- Thread pool wait (0 if available, up to seconds if exhausted)
  |-- Request deserialization -- JSON parsing (1-50ms depending on payload size)
  |-- Authentication/Authorization (token validation, database check)
  |-- Business logic (depends on complexity)
  |-- Serialization -- response to JSON
  v
Database
  |-- Connection pool acquisition (0-30s if exhausted)
  |-- Query execution (1ms-10s depending on query complexity and indexes)
  |-- Network round-trip (0.5-2ms within the same datacenter)
  v
Cache (Redis)
  |-- Network round-trip (0.5-1ms)
  |-- Cache hit: return data
  |-- Cache miss: fall through to DB
  v
External Services
  |-- DNS + TCP + TLS (if not reusing connection)
  |-- External API latency (10ms-5s, out of your control)
  |-- Timeout handling

Request latency equals the sum of every step in the path. The slowest step determines everything. If the external payment API takes 800ms, even if you optimize the database from 50ms down to 5ms, total latency is still roughly 850ms.

This is why profiling matters more than guessing. Senior engineers do not guess the bottleneck. They measure it.

Realistic latency contributions in production

Here is a typical latency breakdown for an API endpoint that reads from a database and cache:

P50 latency = 45ms

Breakdown:
  Network (client to LB):    8ms   (18%)
  Auth token validation:     3ms   (7%)
  Cache lookup (Redis):      2ms   (4%)  -- cache hit
  Serialization:             5ms   (11%)
  Business logic:            2ms   (4%)
  DB query (on cache miss):  25ms  (56%)
  Network (LB to client):    8ms   (18%)

The database dominates most API endpoints. Start optimizing there.

Thread contention: the hidden bottleneck

// Real production code that caused an outage:
@RestController
public class ReportController {
    private static final Map<String, Report> reportCache = new HashMap<>(); // NOT thread-safe!

    @GetMapping("/reports/{id}")
    public Report getReport(@PathVariable String id) {
        // ConcurrentModificationException in production
        // Or worse: stale reads silently returning wrong data
        return reportCache.computeIfAbsent(id, this::generateReport);
    }
}

Thread contention happens when multiple threads compete for the same resource: a lock, I/O, or CPU. Symptoms include low CPU but low throughput, and many threads in BLOCKED state in a thread dump.

Garbage Collection: unpredictable latency spikes

GC pauses are the enemy of consistent low latency. In G1GC (the Spring Boot default), minor GC typically takes less than 5ms. A full GC can pause the entire JVM for hundreds of milliseconds.

Timeline:
| request | request | request | request | request | request |
|  45ms   |  42ms   |  48ms   | [GC 200ms pause] |  44ms   |
                                    ^
                    This request sees 5x latency with no code change

Unusually high P99 latency compared to P50 is almost always a sign of GC pressure or thread pool contention.

Part 2: Understanding Performance Metrics Correctly

Why average latency is meaningless

Scenario: 100 requests in one minute
- 95 requests: 50ms
- 4 requests: 200ms
- 1 request: 5,000ms (database query missing an index)

Average: (95*50 + 4*200 + 1*5000) / 100 = 100ms

You report "average latency is 100ms". The reality:
- 1% of users wait 5 seconds
- With 1 million requests per day: 10,000 requests daily take 5 seconds

Averages are skewed by outliers in a way that does not represent the user experience. Users experience the distribution, not the average.

Percentiles: how senior engineers think

P50 (Median): 50% of requests complete at or below this value. The “typical user” experience.

P95: 95% of requests complete within this value. 5% are slower. Used for setting SLAs.

P99: 99% of requests complete within this value. Important for heavy users (who send the most requests), peak traffic, and operations with fan-out.

System A:   P50=50ms  P95=100ms  P99=200ms  -- consistent and predictable
System B:   P50=40ms  P95=500ms  P99=3000ms -- bimodal, serious problems

System B looks better at P50 but is completely unacceptable at P99.

Why P99 matters especially in microservices:

Service A calls Service B calls Service C calls Service D

If each service has P99 latency = 100ms:
P99 of the full chain = 1 - (0.99^4) ~= 4% of requests exceed 100ms at some step
P99 latency of the chain ~= 400ms+ (4 services in the worst-case path)

With 10 services: P99 of the full chain approaches 1 second even though each service only takes 100ms
Microservices amplify P99 latency.

Throughput vs latency: the fundamental trade-off

// Scenario: batch vs individual processing

// Option A: process each request immediately -- low latency, lower throughput
@PostMapping("/orders")
public OrderResponse createOrder(@RequestBody OrderRequest request) {
    Order order = orderService.create(request); // commit immediately
    return OrderResponse.from(order);
    // Latency: 50ms, Throughput: 200 RPS
}

// Option B: buffer and batch -- higher latency, much higher throughput
@PostMapping("/orders")
public OrderResponse createOrder(@RequestBody OrderRequest request) {
    orderQueue.enqueue(request);
    return OrderResponse.accepted(); // return immediately, process async
    // Latency: 5ms, Throughput: 5000 RPS
    // Trade-off: the user does not get immediate success or failure confirmation
}

Good throughput and good latency do not come for free. You have to decide what to prioritize based on the use case.

Saturation: the signal before the crash

Saturation is how much of a resource is being used relative to its capacity. A connection pool 90% full is a dangerous signal even if things are currently working.

HikariCP pool: 10 connections
8 in use (80% saturation)  -- still has buffer
9 in use (90% saturation)  -- warning: a small traffic spike will exhaust the pool
10 in use (100% saturation) -- requests are queuing for a connection

Monitor saturation, not just current usage. Saturation above 70-80% on any resource needs attention.

Part 3: The Database Is the Most Common Bottleneck

N+1 queries: how Hibernate kills performance

// Controller returning merchants with their orders
@GetMapping("/merchants")
public List<MerchantResponse> getMerchants() {
    List<Merchant> merchants = merchantRepo.findAll(); // 1 query

    return merchants.stream()
        .map(merchant -> MerchantResponse.builder()
            .id(merchant.getId())
            .name(merchant.getName())
            .orderCount(merchant.getOrders().size()) // N queries! LAZY loads each merchant
            .totalRevenue(merchant.getOrders().stream()
                .mapToDouble(o -> o.getTotal().doubleValue())
                .sum())
            .build())
        .toList();
    // With 100 merchants: 1 + 100 = 101 queries
    // Each query at 2ms: 202ms of pure N+1 overhead
}

Detecting N+1 in development:

// Use datasource-proxy to count queries per request
@Bean
public DataSource dataSource(DataSourceProperties properties) {
    HikariDataSource ds = properties.initializeDataSourceBuilder()
        .type(HikariDataSource.class).build();

    return ProxyDataSourceBuilder.create(ds)
        .name("Query-Counter")
        .countQuery()
        .logSlowQueryBySlf4j(50, TimeUnit.MILLISECONDS)
        .afterQuery((execInfo, queryInfoList) -> {
            if (queryInfoList.size() > 10) {
                log.warn("N+1 suspected: {} queries in one request, first: {}",
                    queryInfoList.size(),
                    queryInfoList.get(0).getQuery());
            }
        })
        .build();
}

Fix: aggregate in the database:

// One query that fetches all required data
@Query("""
    SELECT new com.example.dto.MerchantStats(
        m.id,
        m.name,
        COUNT(o.id),
        COALESCE(SUM(o.total), 0)
    )
    FROM Merchant m
    LEFT JOIN m.orders o
    GROUP BY m.id, m.name
    """)
List<MerchantStats> findAllWithStats();
// 1 query, the database aggregates, no lazy loading needed

Fix 2: JOIN FETCH when you need full entities:

@Query("""
    SELECT DISTINCT m FROM Merchant m
    LEFT JOIN FETCH m.orders o
    WHERE m.active = true
    """)
List<Merchant> findActiveWithOrders();

SELECT *: not just wasteful

// Imagine a Product entity with 30 fields including:
@Entity
public class Product {
    // ... 25 normal fields ...
    @Lob
    private byte[] fullDescription; // HTML content, 50KB per product
    @Lob
    private byte[] technicalManual; // PDF, 5MB per product
}

// API listing 50 products:
List<Product> products = productRepo.findAll(pageable); // SELECT * loads 5MB * 50 = 250MB!

// Fix: projection fetches only what is needed
public interface ProductSummary {
    Long getId();
    String getName();
    String getSku();
    BigDecimal getPrice();
    // fullDescription and technicalManual excluded
}

List<ProductSummary> products = productRepo.findAllProjectedBy(pageable);
// SELECT id, name, sku, price FROM products -- under 1KB per product

Quarkus Panache with projection:

@ApplicationScoped
public class ProductRepository implements PanacheRepository<Product> {
    public List<ProductSummaryDTO> findSummaries(int page, int pageSize) {
        return find("active = true")
            .page(page, pageSize)
            .project(ProductSummaryDTO.class)
            .list();
    }
}

Over-fetching and under-fetching in API design

Over-fetching: the client receives more data than it needs. Impact: bandwidth, serialization time, memory.

Under-fetching: the client must make multiple requests to get enough data. Impact: additional network round-trips, higher latency.

// Over-fetching: one endpoint returns everything
@GetMapping("/users/{id}")
public User getUser(@PathVariable Long id) {
    return userRepo.findById(id).orElseThrow();
    // Returns: profile, settings, 200 orders, 50 reviews, payment methods...
    // A mobile app only needs name and avatar for the list view
}

// Better pattern: sparse fieldsets or separate endpoints
@GetMapping("/users/{id}")
public UserResponse getUser(
    @PathVariable Long id,
    @RequestParam(required = false) Set<String> fields
) {
    User user = userRepo.findById(id).orElseThrow();
    return UserResponse.of(user, fields); // Only include requested fields
}

// Client calls: GET /users/123?fields=id,name,avatarUrl

Part 4: Connection Pool Optimization

Why opening a database connection is expensive

Creating a new JDBC connection to PostgreSQL involves:

TCP handshake (0.5-2ms)
TLS handshake if encrypted (additional 1-2ms)
PostgreSQL authentication protocol (1-2 round-trips)
Session parameter negotiation
Server-side memory allocation for the session

Total: 5-15ms of overhead to get a single connection. At 1000 requests per second, creating a new connection for each request means 5-15 seconds of overhead per second. The system can never catch up.

Connection pooling keeps connections warm and reuses them, eliminating this overhead entirely.

HikariCP sizing: a practical formula

From the HikariCP documentation:

pool_size = (core_count * 2) + effective_spindle_count

With SSD: effective_spindle_count = 1
4-core server with SSD: pool_size = (4 * 2) + 1 = 9

Why “more connections = better” is wrong:

PostgreSQL server: max_connections = 100
Application: 20 instances, each with pool_size = 20
Total connections = 400 -- PostgreSQL rejects connections!

Correct: pool_size = 100 / 20 instances = 5 connections per instance
         (Reserve 20 connections for admin, monitoring, and migrations)

# application.yml -- Production-ready HikariCP config
spring:
  datasource:
    hikari:
      maximum-pool-size: 10
      minimum-idle: 5
      connection-timeout: 5000        # 5s -- fail fast, do not make users wait
      idle-timeout: 300000            # 5 minutes
      max-lifetime: 1800000           # 30 minutes -- recycle before firewall timeout
      keepalive-time: 60000           # 1-minute ping
      leak-detection-threshold: 10000 # 10s -- warn on connection leaks
      validation-timeout: 3000        # 3s timeout to validate a connection

Pool exhaustion: symptoms and diagnosis

Symptoms:
- API latency spikes from 50ms to 5+ seconds
- Error: "Unable to acquire JDBC Connection" or
          "Connection is not available, request timed out after 5000ms"
- Database CPU is low (the DB is not busy, the app is waiting for connections)
- hikaricp_connections_pending > 0

Typical timeline:
T=0s   Traffic spike (flash sale, new deploy)
T=30s  Pool hits max, requests start queuing
T=35s  connection-timeout begins expiring -- 503 errors appear
T=40s  Error rate above 50%
T=45s  Traffic self-reduces (clients give up) -- or worse, a retry storm begins

// Detect pool exhaustion before it becomes an incident
@Component
public class ConnectionPoolHealthCheck {
    @Autowired private HikariDataSource dataSource;

    @Scheduled(fixedRate = 5000)
    public void checkPool() {
        HikariPoolMXBean pool = dataSource.getHikariPoolMXBean();
        int pending = pool.getThreadsAwaitingConnection();
        double utilization = (double) pool.getActiveConnections()
                             / dataSource.getMaximumPoolSize();

        if (pending > 0) {
            log.error("CONNECTION POOL: {} requests waiting. Active={}, Idle={}, Max={}",
                pending,
                pool.getActiveConnections(),
                pool.getIdleConnections(),
                pool.getTotalConnections());
        }

        Metrics.gauge("hikaricp.utilization", utilization);
    }
}

PgBouncer: connection pooling at the database level

When many application instances need more connections than PostgreSQL can handle:

Without PgBouncer:
20 app instances * 10 connections = 200 connections to PostgreSQL
PostgreSQL overhead: ~5-10MB memory per connection = 1-2GB just for connections

With PgBouncer (transaction pooling mode):
20 app instances * 10 = 200 connections to PgBouncer
PgBouncer: 10-20 connections to PostgreSQL
PostgreSQL sees only 10-20 connections, scaling to thousands of app connections

# pgbouncer.ini
[databases]
mydb = host=localhost dbname=mydb

[pgbouncer]
pool_mode = transaction          # Reuse connection after each transaction
max_client_conn = 1000           # Max client connections to PgBouncer
default_pool_size = 20           # Connections from PgBouncer to PostgreSQL
min_pool_size = 5
reserve_pool_size = 5            # Emergency connections
server_idle_timeout = 300

Trade-off: Transaction pooling mode is not compatible with some PostgreSQL features: prepared statements, advisory locks, and SET parameters. Verify compatibility before adopting it.

Part 5: Serialization and JSON Performance

Why serialization is expensive at scale

At 10,000 requests per second, each serializing a 10KB response:

10,000 RPS * 10KB = 100MB per second of JSON serialization
+ Jackson uses reflection to read field names and values
+ Object allocation for intermediate representation
+ GC pressure from short-lived objects

Jackson reflection can consume 10-30% of CPU in high-throughput services.

Jackson internals and tuning

Jackson has two layers when serializing:

ObjectMapper: the high-level API, caches type information
JsonSerializer: per-type serializer, generated or reflection-based

// ObjectMapper is EXPENSIVE to create -- create once and reuse
// Spring Boot manages this automatically, but understanding it matters

@Configuration
public class JacksonConfig {
    @Bean
    @Primary
    public ObjectMapper objectMapper() {
        return JsonMapper.builder()
            // Performance: skip null fields to reduce payload size
            .serializationInclusion(JsonInclude.Include.NON_NULL)
            // Performance: fail fast instead of ignoring unknown fields
            .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
            // Performance: do not serialize dates as arrays
            .configure(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS, false)
            // Register type handlers
            .addModule(new JavaTimeModule())
            // Performance: AfterburnerModule uses bytecode generation instead of reflection
            .addModule(new AfterburnerModule())
            .build();
    }
}

AfterburnerModule is the most important for performance: it replaces reflection with bytecode generation, increasing serialization throughput by 2-5x.

MapStruct instead of BeanUtils:

// BeanUtils.copyProperties: uses reflection, slow
BeanUtils.copyProperties(order, orderDto);

// MapStruct: compile-time code generation, zero reflection overhead
@Mapper(componentModel = "spring")
public interface OrderMapper {
    OrderDto toDto(Order order);
    Order toEntity(OrderDto dto);
}

// MapStruct generates code like:
public OrderDto toDto(Order order) {
    OrderDto dto = new OrderDto();
    dto.setId(order.getId());
    dto.setStatus(order.getStatus().name());
    // ... pure Java, no reflection
    return dto;
}

Circular references and object graph explosion

// Circular reference: causes StackOverflowError or infinite JSON output
@Entity
public class Order {
    @ManyToOne
    private Customer customer; // Customer has List<Order>
}

@Entity
public class Customer {
    @OneToMany
    private List<Order> orders; // Orders have Customer -- infinite loop
}

// Jackson annotation fix:
@Entity
public class Customer {
    @OneToMany
    @JsonManagedReference // serialize this side
    private List<Order> orders;
}

@Entity
public class Order {
    @ManyToOne
    @JsonBackReference // do not serialize this side
    private Customer customer;
}

// Better: create a dedicated DTO and break the circular reference explicitly
public record OrderDto(Long id, String status, Long customerId) {}
// Never serialize entities directly

Response compression: 60-90% bandwidth reduction

# application.yml -- Enable compression
server:
  compression:
    enabled: true
    mime-types: application/json,text/html,text/plain
    min-response-size: 1024  # Only compress responses larger than 1KB

Without compression:  100KB JSON response -> 100KB transferred
With gzip:            100KB JSON response -> ~10-15KB (85-90% smaller)
With brotli:          100KB JSON response -> ~8-12KB (88-92% smaller)

Trade-off: Compression uses CPU. For responses under 1KB, the compression overhead exceeds the bandwidth savings. A CDN generally handles compression better, so offload it there when possible.

Streaming for large responses

// Instead of loading everything into memory then serializing:
@GetMapping("/reports/export")
public ResponseEntity<List<SalesRecord>> exportReport() {
    List<SalesRecord> allRecords = reportRepo.findAll(); // 1GB into heap!
    return ResponseEntity.ok(allRecords);
}

// Use StreamingResponseBody: write directly to the output stream
@GetMapping(value = "/reports/export", produces = MediaType.APPLICATION_JSON_VALUE)
public ResponseEntity<StreamingResponseBody> exportReport() {
    StreamingResponseBody stream = outputStream -> {
        JsonGenerator generator = objectMapper.getFactory()
            .createGenerator(outputStream);
        generator.writeStartArray();

        reportRepo.findAllAsStream().forEach(record -> { // stream from DB
            try {
                objectMapper.writeValue(generator, record);
            } catch (IOException e) {
                throw new UncheckedIOException(e);
            }
        });

        generator.writeEndArray();
        generator.close();
    };

    return ResponseEntity.ok()
        .contentType(MediaType.APPLICATION_JSON)
        .body(stream);
}
// Memory usage: constant, independent of data size

Part 6: REST API Design for Performance

Offset pagination: why it does not scale

-- Page 1: fast
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 0;

-- Page 500: PostgreSQL must:
-- 1. Scan from the beginning (may use an index)
-- 2. Skip 9,980 rows
-- 3. Return 20 rows
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 9980;
-- Cost is linear with offset -- page 5000 scans and discards 100,000 rows

Keyset pagination: O(log n) regardless of page number:

// Request: GET /orders?limit=20&after_created=2026-06-01T10:00:00Z&after_id=9876
@GetMapping("/orders")
public PageResponse<OrderDto> getOrders(
    @RequestParam int limit,
    @RequestParam(required = false) Instant afterCreated,
    @RequestParam(required = false) Long afterId
) {
    List<Order> orders;

    if (afterCreated == null) {
        // First page
        orders = orderRepo.findFirstPage(limit + 1);
    } else {
        // Subsequent pages -- cursor-based
        orders = orderRepo.findNextPage(afterCreated, afterId, limit + 1);
    }

    boolean hasMore = orders.size() > limit;
    List<Order> page = hasMore ? orders.subList(0, limit) : orders;

    String nextCursor = hasMore
        ? buildCursor(page.get(page.size() - 1))
        : null;

    return PageResponse.of(page.stream().map(mapper::toDto).toList(), nextCursor);
}

// Repository:
@Query("""
    SELECT o FROM Order o
    WHERE (o.createdAt < :afterCreated)
       OR (o.createdAt = :afterCreated AND o.id < :afterId)
    ORDER BY o.createdAt DESC, o.id DESC
    """)
List<Order> findNextPage(
    @Param("afterCreated") Instant afterCreated,
    @Param("afterId") Long afterId,
    Pageable pageable
);
// Index: (created_at DESC, id DESC) -- O(log n) traversal

Response format with cursor:

{
  "data": [...],
  "pagination": {
    "hasMore": true,
    "nextCursor": "eyJjcmVhdGVkQXQiOiIyMDI2LTA2LTAxVDEwOjAwOjAwWiIsImlkIjo5ODc2fQ==",
    "limit": 20
  }
}

Filtering and sorting: direct impact on the database

// Dynamic filter builder -- no string concatenation (SQL injection risk!)
@GetMapping("/orders")
public Page<OrderDto> searchOrders(
    @RequestParam(required = false) String status,
    @RequestParam(required = false) Long merchantId,
    @RequestParam(required = false) @DateTimeFormat(iso = DateTimeFormat.ISO.DATE) LocalDate dateFrom,
    @RequestParam(required = false) @DateTimeFormat(iso = DateTimeFormat.ISO.DATE) LocalDate dateTo,
    @RequestParam(defaultValue = "createdAt") String sortBy,
    @RequestParam(defaultValue = "DESC") String sortDir,
    Pageable pageable
) {
    Specification<Order> spec = Specification.where(null);

    if (status != null) spec = spec.and(OrderSpecs.hasStatus(status));
    if (merchantId != null) spec = spec.and(OrderSpecs.forMerchant(merchantId));
    if (dateFrom != null) spec = spec.and(OrderSpecs.createdAfter(dateFrom));
    if (dateTo != null) spec = spec.and(OrderSpecs.createdBefore(dateTo));

    // Validate sort column -- prevent injection, prevent sorting on non-indexed columns
    Sort sort = buildSafeSort(sortBy, sortDir);

    return orderRepo.findAll(spec, PageRequest.of(pageable.getPageNumber(),
                                                   pageable.getPageSize(), sort))
                    .map(mapper::toDto);
}

private Sort buildSafeSort(String sortBy, String sortDir) {
    Set<String> allowedFields = Set.of("createdAt", "totalAmount", "status");
    if (!allowedFields.contains(sortBy)) {
        sortBy = "createdAt"; // default safe fallback
    }
    Sort.Direction direction = sortDir.equalsIgnoreCase("ASC")
        ? Sort.Direction.ASC : Sort.Direction.DESC;
    return Sort.by(direction, sortBy);
}

Performance consideration for dynamic filters: Each filter combination may need its own index. With 5 filterable fields, there can be many combinations. Solutions:

Index the highest-cardinality, most-common filter columns
Composite indexes for the most common combinations
Elasticsearch or PostgreSQL full-text search for complex queries

Sparse fieldsets: reducing payload and database fetch

// Client requests only the fields it needs
// GET /orders?fields=id,status,totalAmount

@GetMapping("/orders/{id}")
public Map<String, Object> getOrder(
    @PathVariable Long id,
    @RequestParam(required = false) Set<String> fields
) {
    Order order = orderRepo.findById(id).orElseThrow();

    if (fields == null || fields.isEmpty()) {
        return mapper.toFullMap(order);
    }

    return mapper.toPartialMap(order, fields);
}

Sparse fieldsets reduce payload size and can enable covering indexes when the DB projection matches the index.

Part 7: Caching Strategies

Why caching exists

The database is expensive: disk I/O, query planning, lock acquisition. For read-heavy workloads (most web APIs are 80-95% reads), serving data from memory instead of disk has one of the highest ROIs of any optimization.

Caching only makes sense when data is read more often than it is written. Caching frequently-changing data creates consistency problems without delivering meaningful benefits.

Cache-aside (lazy loading): the most common pattern

@Service
public class ProductService {
    @Autowired private ProductRepository repo;
    @Autowired private RedisTemplate<String, Product> redisTemplate;
    private static final Duration TTL = Duration.ofMinutes(10);

    public Product getProduct(Long id) {
        String key = "product:" + id;

        // 1. Check cache
        Product cached = redisTemplate.opsForValue().get(key);
        if (cached != null) {
            return cached; // Cache hit -- ~1ms
        }

        // 2. Cache miss: load from DB
        Product product = repo.findById(id).orElseThrow(); // ~5-50ms

        // 3. Populate cache
        redisTemplate.opsForValue().set(key, product, TTL);

        return product;
    }

    public void updateProduct(Product product) {
        repo.save(product);
        // Invalidate cache
        redisTemplate.delete("product:" + product.getId());
        // Alternative: write the new data to cache immediately (write-through)
    }
}

Spring Cache abstraction, cleaner:

@Configuration
@EnableCaching
public class CacheConfig {
    @Bean
    public RedisCacheManager cacheManager(RedisConnectionFactory factory) {
        RedisCacheConfiguration config = RedisCacheConfiguration.defaultCacheConfig()
            .entryTtl(Duration.ofMinutes(10))
            .serializeValuesWith(RedisSerializationContext.SerializationPair.fromSerializer(
                new GenericJackson2JsonRedisSerializer()
            ))
            .disableCachingNullValues();

        return RedisCacheManager.builder(factory)
            .cacheDefaults(config)
            .withCacheConfiguration("products", config.entryTtl(Duration.ofHours(1)))
            .withCacheConfiguration("orders", config.entryTtl(Duration.ofMinutes(5)))
            .build();
    }
}

@Service
public class ProductService {
    @Cacheable(value = "products", key = "#id")
    public ProductDto getProduct(Long id) {
        return mapper.toDto(repo.findById(id).orElseThrow());
    }

    @CacheEvict(value = "products", key = "#product.id")
    @Transactional
    public ProductDto updateProduct(ProductDto product) {
        return mapper.toDto(repo.save(mapper.toEntity(product)));
    }

    @CacheEvict(value = "products", allEntries = true)
    @Scheduled(fixedRate = 3600000) // Clear all entries every hour
    public void evictAllProducts() {}
}

Write-through: higher consistency

// Write-through: update cache and DB at the same time
@Transactional
public Product updateProductWriteThrough(Product product) {
    Product saved = repo.save(product); // DB first
    redisTemplate.opsForValue().set(
        "product:" + saved.getId(),
        saved,
        Duration.ofMinutes(10)
    ); // Cache updated immediately after DB
    return saved;
}

// Advantage: cache is always fresh after a write
// Disadvantage: write latency increases (DB + Redis), cache holds data that is rarely read

Cache stampede: when cache misses overwhelm the database

Cache key "popular-products" expires at 14:00:00.000
14:00:00.001: 1000 concurrent requests hit cache -- all miss
14:00:00.002: 1000 requests simultaneously query DB -- DB overloaded

// Fix 1: Probabilistic Early Recomputation
public Product getProductWithPER(Long id) {
    String key = "product:" + id;
    ValueOperations<String, CachedValue<Product>> ops = redisTemplate.opsForValue();

    CachedValue<Product> cached = ops.get(key);
    if (cached != null) {
        // Probabilistically recompute before the actual expiry
        long remainingTtl = redisTemplate.getExpire(key, TimeUnit.SECONDS);
        double beta = 1.0; // tuning parameter
        double logFetchTime = Math.log(cached.getFetchTimeMs() / 1000.0);
        double threshold = -beta * logFetchTime * remainingTtl;

        if (Math.random() > Math.exp(threshold)) {
            return cached.getValue(); // cache hit, still fresh
        }
        // Probabilistically recompute early to warm the cache before expiry
    }

    return recomputeAndCache(id);
}

// Fix 2: Distributed lock -- only one request recomputes, rest wait
public Product getProductWithLock(Long id) {
    String key = "product:" + id;
    String lockKey = "lock:product:" + id;

    Product cached = redisTemplate.opsForValue().get(key);
    if (cached != null) return cached;

    // Try to acquire lock
    Boolean locked = redisTemplate.opsForValue()
        .setIfAbsent(lockKey, "1", Duration.ofSeconds(10));

    if (Boolean.TRUE.equals(locked)) {
        try {
            Product product = repo.findById(id).orElseThrow();
            redisTemplate.opsForValue().set(key, product, Duration.ofMinutes(10));
            return product;
        } finally {
            redisTemplate.delete(lockKey);
        }
    } else {
        // Wait for the lock holder to finish
        Thread.sleep(100);
        return getProductWithLock(id); // retry
    }
}

Cache penetration: querying data that does not exist

Attacker or bug: continuously queries IDs that do not exist
Each request: cache miss -> DB query -> not found -> not cached (null)
Every request hits the DB -> DB overloaded

// Fix: cache null results with a short TTL
@Cacheable(value = "products", key = "#id", unless = "#result == null")
public ProductDto getProduct(Long id) {
    return repo.findById(id).map(mapper::toDto).orElse(null);
}
// Will not cache null because of unless="#result == null"

// Better: explicitly cache null with a short TTL
public Optional<Product> getProduct(Long id) {
    String key = "product:" + id;
    Object cached = redisTemplate.opsForValue().get(key);

    if (cached != null) {
        return cached instanceof NullMarker ? Optional.empty() : Optional.of((Product) cached);
    }

    Optional<Product> product = repo.findById(id);
    if (product.isPresent()) {
        redisTemplate.opsForValue().set(key, product.get(), Duration.ofMinutes(10));
    } else {
        redisTemplate.opsForValue().set(key, new NullMarker(), Duration.ofMinutes(1));
        // Cache the miss with a shorter TTL
    }
    return product;
}

Cache avalanche: all keys expiring at the same time

// With a fixed TTL:
redisTemplate.opsForValue().set(key, value, Duration.ofMinutes(10));
// If all keys were cached at 2:00 AM, they all expire at 2:10 AM -- stampede

// Fix: TTL with jitter
Duration baseTtl = Duration.ofMinutes(10);
Duration jitter = Duration.ofSeconds(ThreadLocalRandom.current().nextInt(0, 300));
redisTemplate.opsForValue().set(key, value, baseTtl.plus(jitter));
// Keys expire spread across a 10-15 minute window, not simultaneously

Local cache: Caffeine before Redis

For read-heavy data that rarely changes (config, currencies, categories), a local in-memory cache is 100x faster than Redis.

@Configuration
public class LocalCacheConfig {
    @Bean
    public CacheManager localCacheManager() {
        CaffeineCacheManager manager = new CaffeineCacheManager();
        manager.setCaffeine(Caffeine.newBuilder()
            .maximumSize(1000)
            .expireAfterWrite(5, TimeUnit.MINUTES)
            .recordStats() // enable metrics
        );
        return manager;
    }
}

// Layered caching: L1 (local Caffeine) -> L2 (Redis) -> DB
@Service
public class CurrencyService {
    @Caching(cacheable = {
        @Cacheable(cacheManager = "localCacheManager", value = "currencies", key = "#code"),
        @Cacheable(cacheManager = "redisCacheManager", value = "currencies", key = "#code")
    })
    public Currency getCurrency(String code) {
        return currencyRepo.findByCode(code);
    }
}

Trade-off: Local cache data can be stale at different times across nodes. When config changes, you must invalidate all instances. Use Redis pub/sub to broadcast invalidation:

@Component
public class CacheInvalidationListener {
    @Autowired private CacheManager localCacheManager;

    @RedisListener(topics = "cache:invalidate")
    public void handleInvalidation(String cacheKey) {
        // Invalidate local cache when Redis receives the message
        String[] parts = cacheKey.split(":");
        localCacheManager.getCache(parts[0]).evict(parts[1]);
    }
}

Part 8: Threading and Concurrency

Tomcat thread pool: the default mechanism

Spring Boot with Tomcat handles each HTTP request with one thread from the pool. The thread is blocked while waiting for I/O (DB queries, external calls).

Max threads = 200 (Tomcat default)
Each request holds a thread for 100ms
Max throughput = 200 threads / 100ms = 2,000 RPS

But if a request takes 500ms due to an external service call:
Max throughput = 200 threads / 500ms = 400 RPS
Thread pool exhausts at 400 RPS even though the server is not busy

# application.yml -- Tomcat thread pool tuning
server:
  tomcat:
    threads:
      max: 200              # Increase if workload is I/O-bound
      min-spare: 20         # Minimum threads always ready
    accept-count: 100       # Queue size when all threads are busy
    connection-timeout: 5000 # 5s to complete request
    max-connections: 8192   # Max TCP connections

Why not increase max threads to 1000?

Each thread uses roughly 512KB to 1MB of stack memory. 1000 threads means 500MB to 1GB just for stacks. Context-switching overhead grows significantly. This is exactly why non-blocking I/O exists.

Virtual threads (Java 21): a game changer

Virtual threads (Project Loom) allow millions of threads without the overhead of platform threads.

// Spring Boot 3.2+ -- Enable virtual threads
# application.yml
spring:
  threads:
    virtual:
      enabled: true

// With virtual threads:
// Each request still uses a "thread" but it is a virtual thread
// When a virtual thread blocks (waiting for DB, an HTTP call), it unmounts from the carrier thread
// The carrier thread can serve another virtual thread
// Result: as efficient as non-blocking I/O with traditional blocking code style

// Before virtual threads -- Reactive style:
@GetMapping("/orders/{id}")
public Mono<OrderDto> getOrder(@PathVariable Long id) {
    return orderRepository.findById(id) // Returns Mono
        .map(mapper::toDto)
        .switchIfEmpty(Mono.error(new NotFoundException()));
    // Hard to debug, stack traces are meaningless
}

// With virtual threads -- simpler:
@GetMapping("/orders/{id}")
public OrderDto getOrder(@PathVariable Long id) {
    return orderRepository.findById(id) // Blocking style
        .map(mapper::toDto)
        .orElseThrow(NotFoundException::new);
    // Normal blocking-style code that scales like non-blocking
}

// Custom executor with virtual threads when needed
@Bean
public Executor virtualThreadExecutor() {
    return Executors.newVirtualThreadPerTaskExecutor();
}

// Async with virtual threads
@Async("virtualThreadExecutor")
public CompletableFuture<Report> generateReport(Long merchantId) {
    // Runs on a virtual thread -- does not block a platform thread
    Report report = expensiveReportGeneration(merchantId);
    return CompletableFuture.completedFuture(report);
}

When virtual threads are NOT sufficient:

CPU-bound tasks (no I/O blocking): virtual threads do not help
Tasks that need explicit backpressure: reactive streams are still more appropriate
Pinning issues with synchronized blocks: check JVM flags to detect

Backpressure: avoiding overload

// Without backpressure: the server accepts as many requests as the client sends
// Result: memory exhaustion, GC pressure, OOM

// Rate limiting at the API level (Resilience4j):
@Bean
public RateLimiter rateLimiter() {
    RateLimiterConfig config = RateLimiterConfig.custom()
        .limitRefreshPeriod(Duration.ofSeconds(1))
        .limitForPeriod(1000)        // Max 1000 requests per second
        .timeoutDuration(Duration.ofMillis(100))
        .build();
    return RateLimiter.of("api-rate-limiter", config);
}

@GetMapping("/orders")
@RateLimiter(name = "api-rate-limiter", fallbackMethod = "rateLimitFallback")
public Page<OrderDto> getOrders(Pageable pageable) {
    return orderService.findAll(pageable);
}

public Page<OrderDto> rateLimitFallback(Pageable pageable, RequestNotPermitted e) {
    throw new TooManyRequestsException("Rate limit exceeded. Retry after 1 second.");
}

Part 9: Async Processing

Not everything needs to happen in the request

// Synchronous: user waits for everything:
@PostMapping("/orders")
public OrderResponse placeOrder(@RequestBody OrderRequest request) {
    Order order = orderService.create(request);         // 50ms
    emailService.sendConfirmation(order);               // 300ms -- SMTP
    pushNotification.send(order.getUserId(), order);    // 200ms -- FCM
    analyticsService.track("order_placed", order);      // 100ms -- analytics DB
    // Total: 650ms, user waits for all of it
    return OrderResponse.from(order);
}

// Async: user gets a response immediately:
@PostMapping("/orders")
public OrderResponse placeOrder(@RequestBody OrderRequest request) {
    Order order = orderService.create(request);         // 50ms -- critical path
    // Fire and forget -- does not block the response
    CompletableFuture.runAsync(() -> emailService.sendConfirmation(order));
    CompletableFuture.runAsync(() -> pushNotification.send(order.getUserId(), order));
    CompletableFuture.runAsync(() -> analyticsService.track("order_placed", order));
    // Total user-facing time: 50ms, the rest happens in the background
    return OrderResponse.from(order);
}

Pitfall of CompletableFuture.runAsync(): It defaults to ForkJoinPool.commonPool(), shared with all code in the JVM. One heavy task can starve others.

// Dedicated executor:
@Bean(name = "asyncExecutor")
public Executor asyncExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(10);
    executor.setMaxPoolSize(50);
    executor.setQueueCapacity(500);
    executor.setThreadNamePrefix("async-");
    executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
    executor.initialize();
    return executor;
}

@Async("asyncExecutor")
public CompletableFuture<Void> sendConfirmationEmail(Order order) {
    emailService.sendConfirmation(order);
    return CompletableFuture.completedFuture(null);
}

Message queues for durability

CompletableFuture.runAsync() is not durable. If the server crashes before the email is sent, the email is lost.

// Use a message queue for critical async tasks:
@PostMapping("/orders")
@Transactional
public OrderResponse placeOrder(@RequestBody OrderRequest request) {
    Order order = orderService.create(request);

    // Publish message within the same transaction (Transactional Outbox pattern)
    outboxRepo.save(OutboxMessage.builder()
        .eventType("ORDER_CONFIRMATION_EMAIL")
        .payload(objectMapper.writeValueAsString(new EmailRequest(order)))
        .build());

    return OrderResponse.from(order);
}

// Consumer: runs separately, retries automatically on failure
@KafkaListener(topics = "order.email.requests")
public void processEmailRequest(EmailRequest request) {
    emailService.sendConfirmation(request);
    // If it fails: Kafka retries with backoff
    // Message is not lost even if the server crashes
}

Parallel fan-out: aggregating from multiple services

// Sequential: slow
@GetMapping("/dashboard/{merchantId}")
public DashboardData getDashboard(@PathVariable Long merchantId) {
    MerchantStats stats = statsService.get(merchantId);         // 50ms
    List<Order> recentOrders = orderService.getRecent(merchantId); // 80ms
    RevenueChart chart = chartService.get(merchantId);          // 60ms
    // Total: 190ms sequential
    return new DashboardData(stats, recentOrders, chart);
}

// Parallel: much faster
@GetMapping("/dashboard/{merchantId}")
public DashboardData getDashboard(@PathVariable Long merchantId) {
    CompletableFuture<MerchantStats> statsFuture =
        CompletableFuture.supplyAsync(() -> statsService.get(merchantId), asyncExecutor);

    CompletableFuture<List<Order>> ordersFuture =
        CompletableFuture.supplyAsync(() -> orderService.getRecent(merchantId), asyncExecutor);

    CompletableFuture<RevenueChart> chartFuture =
        CompletableFuture.supplyAsync(() -> chartService.get(merchantId), asyncExecutor);

    CompletableFuture.allOf(statsFuture, ordersFuture, chartFuture).join();

    // Total: max(50, 80, 60) = 80ms parallel vs 190ms sequential
    return new DashboardData(
        statsFuture.join(),
        ordersFuture.join(),
        chartFuture.join()
    );
}

With a timeout to prevent hanging:

try {
    CompletableFuture.allOf(statsFuture, ordersFuture, chartFuture)
        .get(2, TimeUnit.SECONDS); // Overall timeout
} catch (TimeoutException e) {
    // Cancel pending futures
    statsFuture.cancel(true);
    ordersFuture.cancel(true);
    chartFuture.cancel(true);
    throw new ServiceUnavailableException("Dashboard data timeout");
}

Part 10: HTTP-Level Optimizations

Keep-alive and connection reuse

Without Keep-Alive:
Client -> [TCP handshake] -> Request -> Response -> [TCP close]
         3ms overhead                              3ms overhead

With Keep-Alive (HTTP/1.1 default):
Client -> [TCP handshake] -> Request 1 -> Response 1
                          -> Request 2 -> Response 2  (no handshake!)
                          -> Request N -> Response N
         3ms overhead once

Spring Boot (Tomcat) enables Keep-Alive by default. Make sure the client (RestTemplate, HttpClient) also supports it:

// RestTemplate with connection pooling:
@Bean
public RestTemplate restTemplate() {
    HttpComponentsClientHttpRequestFactory factory =
        new HttpComponentsClientHttpRequestFactory();
    factory.setHttpClient(
        HttpClients.custom()
            .setConnectionManager(PoolingHttpClientConnectionManager.create(
                RegistryBuilder.<ConnectionSocketFactory>create()
                    .register("http", PlainConnectionSocketFactory.getSocketFactory())
                    .register("https", SSLConnectionSocketFactory.getSystemSocketFactory())
                    .build()
            ))
            .setConnectionReuseStrategy(DefaultClientConnectionReuseStrategy.INSTANCE)
            .build()
    );
    return new RestTemplate(factory);
}

// Or WebClient (Spring WebFlux) -- handles connection pooling better:
@Bean
public WebClient webClient() {
    HttpClient httpClient = HttpClient.create()
        .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 3000)
        .responseTimeout(Duration.ofSeconds(5))
        .doOnConnected(conn -> conn
            .addHandlerLast(new ReadTimeoutHandler(5))
            .addHandlerLast(new WriteTimeoutHandler(5)));

    return WebClient.builder()
        .clientConnector(new ReactorClientHttpConnector(httpClient))
        .build();
}

HTTP/2: multiplexing and header compression

HTTP/1.1: each request needs its own connection (or pipelining, which has limited support). Browsers open a maximum of 6 connections per domain.

HTTP/2: multiple requests over a single connection, with no head-of-line blocking.

# Spring Boot -- Enable HTTP/2 with the embedded server
server:
  http2:
    enabled: true
  ssl:
    enabled: true  # HTTP/2 requires TLS in most browsers
    key-store: classpath:keystore.p12
    key-store-password: ${SSL_PASSWORD}
    key-store-type: PKCS12

HTTP/1.1 (6 parallel connections):
Conn1: GET /api/orders    -> 80ms
Conn2: GET /api/users     -> 60ms
Conn3: GET /api/products  -> 90ms
... (3 more connections)

HTTP/2 (1 connection, multiplexed):
Stream1: GET /api/orders   -|
Stream2: GET /api/users    -+-> All on the same connection
Stream3: GET /api/products -|
+ Header compression (HPACK): repeated headers (Authorization, Content-Type) compressed

Impact: Small for server-to-server communication (which usually has connection pooling already). Large for browser-to-server traffic (many parallel requests, header compression, no HOL blocking).

ETag and conditional requests: avoiding unnecessary data transfer

@GetMapping("/products/{id}")
public ResponseEntity<ProductDto> getProduct(
    @PathVariable Long id,
    @RequestHeader(value = "If-None-Match", required = false) String ifNoneMatch
) {
    Product product = productService.findById(id);
    String etag = "\"" + product.getVersion() + "\""; // or an MD5 hash

    if (etag.equals(ifNoneMatch)) {
        return ResponseEntity.status(HttpStatus.NOT_MODIFIED).build();
        // 304: no data transferred, client uses its cached version
        // Saves: bandwidth + serialization cost + DB fetch (if version is cached)
    }

    return ResponseEntity.ok()
        .eTag(etag)
        .cacheControl(CacheControl.maxAge(60, TimeUnit.SECONDS))
        .body(mapper.toDto(product));
}

Cache-Control headers: browser and CDN caching

@GetMapping("/static/product-catalog")
public ResponseEntity<List<ProductDto>> getProductCatalog() {
    List<ProductDto> catalog = catalogService.getAll();

    return ResponseEntity.ok()
        .cacheControl(CacheControl
            .maxAge(1, TimeUnit.HOURS)                    // Browser cache for 1 hour
            .staleWhileRevalidate(5, TimeUnit.MINUTES)    // Serve stale for 5 minutes while revalidating
            .staleIfError(1, TimeUnit.DAYS)               // Serve stale for 1 day if origin is down
        )
        .body(catalog);
}

@GetMapping("/user/{id}/profile")
public ResponseEntity<UserProfile> getProfile(@PathVariable Long id) {
    UserProfile profile = userService.getProfile(id);

    return ResponseEntity.ok()
        .cacheControl(CacheControl.noStore()) // Sensitive data -- do not cache
        .body(profile);
}

Part 11: External Service Optimization

Timeouts: the first line of defense

Without timeouts, a single slow downstream service can hold all threads indefinitely.

// RestTemplate with timeouts:
@Bean
public RestTemplate restTemplate() {
    HttpComponentsClientHttpRequestFactory factory =
        new HttpComponentsClientHttpRequestFactory();
    factory.setConnectTimeout(3000);  // 3s to establish connection
    factory.setReadTimeout(5000);     // 5s to read the response
    // If payment API takes more than 5s: timeout, fail fast
    return new RestTemplate(factory);
}

// WebClient (preferred):
@Bean
public WebClient paymentClient() {
    return WebClient.builder()
        .baseUrl(paymentServiceUrl)
        .clientConnector(new ReactorClientHttpConnector(
            HttpClient.create()
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 3000)
                .responseTimeout(Duration.ofSeconds(5))
        ))
        .build();
}

Timeout strategy:

Connect timeout should be less than read timeout: connection establishment is usually faster
Read timeout = downstream P99 latency + a buffer
Total chain timeout should be less than your endpoint’s SLA

Circuit breaker: fail fast instead of slow fail

A circuit breaker prevents cascade failure: when a downstream service is continuously failing, stop calling it instead of wasting resources on hopeless requests.

CLOSED state (normal):
  Calls pass through -- track failure rate

If failure rate exceeds threshold -- OPEN state:
  All calls fail immediately (no network call) -- saves resources
  Wait for cooldown period

After cooldown -- HALF-OPEN state:
  Allow limited calls to test if the service has recovered
  If success: back to CLOSED
  If fail: back to OPEN

// Resilience4j Circuit Breaker:
@Bean
public CircuitBreaker paymentCircuitBreaker(CircuitBreakerRegistry registry) {
    CircuitBreakerConfig config = CircuitBreakerConfig.custom()
        .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
        .slidingWindowSize(10)               // Last 10 calls
        .failureRateThreshold(50)            // Open if >= 50% fail
        .waitDurationInOpenState(Duration.ofSeconds(30))  // 30s cooldown
        .permittedNumberOfCallsInHalfOpenState(3)
        .slowCallRateThreshold(80)           // Treat slow calls as failures too
        .slowCallDurationThreshold(Duration.ofSeconds(2))
        .build();

    return registry.circuitBreaker("payment-service", config);
}

@Service
public class PaymentService {
    @Autowired private CircuitBreaker paymentCircuitBreaker;

    public PaymentResult charge(PaymentRequest request) {
        return paymentCircuitBreaker.executeSupplier(
            () -> paymentApiClient.charge(request)
        );
    }
}

Fallback strategies:

@CircuitBreaker(name = "recommendation-service", fallbackMethod = "getDefaultRecommendations")
public List<ProductDto> getRecommendations(Long userId) {
    return recommendationClient.getForUser(userId);
}

// Fallback: degrade gracefully -- do not fail the entire page
public List<ProductDto> getDefaultRecommendations(Long userId, Exception e) {
    log.warn("Recommendation service unavailable, serving popular products. Error: {}", e.getMessage());
    return popularProductsCache.getTopProducts(10); // Serve popular products instead
}

Retry storms: when retries become the problem

Payment service returns 503
1000 clients retry immediately
1000 requests hit the payment service at once
Payment service is overloaded by retries
503 continues
Clients retry again
Infinite loop (Retry Storm)

// Retry with exponential backoff and jitter:
@Bean
public Retry paymentRetry() {
    RetryConfig config = RetryConfig.custom()
        .maxAttempts(3)
        .waitDuration(Duration.ofMillis(500))
        .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
            500,   // Initial interval ms
            2.0,   // Multiplier
            0.5,   // Randomization factor (jitter)
            10000  // Max interval ms
        ))
        // Only retry specific errors:
        .retryOnException(e -> e instanceof ConnectTimeoutException
                             || e instanceof ServiceUnavailableException)
        // Do not retry client errors:
        .ignoreExceptions(BadRequestException.class, UnauthorizedException.class)
        .build();

    return Retry.of("payment-retry", config);
}

Retry timeline with jitter:
Request 1: 503
Retry 1: wait 400-600ms (500ms base +/- 50% jitter)
Retry 2: wait 800-1200ms
Retry 3: wait 1600-2400ms
Fail (max attempts reached)
Retries spread out, not simultaneous

Bulkhead: isolation between downstream services

// Without a bulkhead:
// Payment service slows down -> consumes all threads -> Inventory and Shipping cannot be served either

// With bulkhead: dedicated thread pool per downstream service
@Bean
public ThreadPoolBulkhead paymentBulkhead() {
    ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(5)         // Only 5 threads for payment calls
        .coreThreadPoolSize(3)
        .queueCapacity(10)            // Queue up to 10 requests
        .keepAliveDuration(Duration.ofSeconds(30))
        .build();

    return ThreadPoolBulkhead.of("payment", config);
}

Part 12: JVM-Level Optimizations

Garbage collection: choosing the right GC for your workload

G1GC (Default Java 11+): Balanced latency and throughput. A good default for most applications.

ZGC (Java 15+ production-ready): Sub-millisecond pause times. Best for latency-sensitive applications.

Shenandoah: Similar to ZGC, concurrent and low-pause.

Workload              Recommended GC
----------------------------------------------
General OLTP API      G1GC (default)
Low-latency API       ZGC or Shenandoah
High-throughput batch ParallelGC
Large heap (> 32GB)   ZGC

# JVM flags for a production API server:

# G1GC (default):
-XX:+UseG1GC
-Xms2g -Xmx4g              # Heap: set min = max to avoid resize
-XX:MaxGCPauseMillis=200   # Target max pause (G1 optimizes toward this)
-XX:G1HeapRegionSize=8m
-XX:+G1UseAdaptiveIHOP

# ZGC for low latency:
-XX:+UseZGC
-Xms4g -Xmx4g
-XX:+ZGenerational          # ZGC generational mode (Java 21+)

Object allocation: the invisible cost

// High-allocation code: creates many short-lived objects
@GetMapping("/orders")
public List<OrderDto> getOrders() {
    return orderRepo.findAll().stream()
        .map(order -> {
            // Creates a new StringBuilder per iteration
            String formattedDate = new SimpleDateFormat("yyyy-MM-dd")
                .format(order.getCreatedAt()); // SimpleDateFormat is NOT thread-safe!
            return new OrderDto(
                order.getId(),
                order.getStatus(),
                formattedDate,
                // String concatenation creates StringBuilder + String objects
                "Order-" + order.getId() + "-" + order.getMerchantId()
            );
        })
        .collect(Collectors.toList());
}

// Better:
private static final DateTimeFormatter FORMATTER = DateTimeFormatter.ofPattern("yyyy-MM-dd");
// DateTimeFormatter is thread-safe, created once

@GetMapping("/orders")
public List<OrderDto> getOrders() {
    return orderRepo.findAll().stream()
        .map(order -> new OrderDto(
            order.getId(),
            order.getStatus(),
            FORMATTER.format(order.getCreatedAt()),  // reuse the formatter
            String.format("Order-%d-%d", order.getId(), order.getMerchantId())
        ))
        .toList(); // Java 16+: does not create a mutable List
}

String allocation: the most common GC culprit

// String concatenation in a loop creates many String objects:
public String buildQuery(List<Long> ids) {
    String query = "SELECT * FROM orders WHERE id IN (";
    for (Long id : ids) {
        query += id + ","; // Each += creates a new StringBuilder!
    }
    return query + ")";
}

// Better:
public String buildQuery(List<Long> ids) {
    return "SELECT * FROM orders WHERE id IN (" +
        ids.stream()
           .map(Object::toString)
           .collect(Collectors.joining(",")) +
        ")";
}

// Or use a parameterized query (preferred for SQL):
// "SELECT * FROM orders WHERE id IN (:ids)"
// With Spring Data: findAllById(ids)

Escape analysis: automatic JVM optimization

The JVM can allocate short-lived objects on the stack instead of the heap (no GC needed). This happens when the JVM can prove an object does not “escape” the method.

// Object can be stack-allocated (does not escape):
public int sumOrderItems(Order order) {
    Iterator<OrderItem> iter = order.getItems().iterator(); // iterator is local
    int sum = 0;
    while (iter.hasNext()) {
        sum += iter.next().getQuantity();
    }
    return sum; // iterator does not escape the method
}

// Object escapes (must be heap-allocated):
public List<OrderItem> getExpensiveItems(Order order) {
    List<OrderItem> result = new ArrayList<>();
    for (OrderItem item : order.getItems()) {
        if (item.getPrice().compareTo(THRESHOLD) > 0) {
            result.add(item); // result escapes the method
        }
    }
    return result; // must be heap-allocated
}

This is why JVM performance does not always match intuition: the optimizer does many things you cannot see. Profile before optimizing manually.

Part 13: Observability and Performance Debugging

The three pillars of observability

Metrics: Aggregated numbers: throughput, latency percentiles, error rate. Best for alerting and trend analysis.

Traces: End-to-end request flow across multiple services. Best for finding bottlenecks in distributed systems.

Logs: Detailed event records. Best for debugging specific incidents.

Micrometer + Prometheus + Grafana stack

// Spring Boot exposes metrics automatically at /actuator/prometheus
// Adding custom metrics:
@RestController
public class OrderController {
    private final Counter orderCounter;
    private final Timer orderLatencyTimer;
    private final DistributionSummary payloadSizeDistribution;

    public OrderController(MeterRegistry registry) {
        this.orderCounter = Counter.builder("api.orders.created")
            .tag("environment", "production")
            .description("Total orders created")
            .register(registry);

        this.orderLatencyTimer = Timer.builder("api.orders.latency")
            .publishPercentiles(0.5, 0.95, 0.99) // Publish P50, P95, P99
            .publishPercentileHistogram(true)
            .register(registry);

        this.payloadSizeDistribution = DistributionSummary.builder("api.response.size")
            .baseUnit("bytes")
            .register(registry);
    }

    @PostMapping("/orders")
    public ResponseEntity<OrderResponse> createOrder(@RequestBody OrderRequest request) {
        return orderLatencyTimer.recordCallable(() -> {
            orderCounter.increment();
            OrderResponse response = orderService.create(request);
            payloadSizeDistribution.record(objectMapper.writeValueAsBytes(response).length);
            return ResponseEntity.ok(response);
        });
    }
}

Grafana dashboard queries:

# API latency P99 (PromQL)
histogram_quantile(0.99, rate(api_orders_latency_seconds_bucket[5m]))

# Error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
  / rate(http_server_requests_seconds_count[5m])

# Throughput
rate(api_orders_created_total[5m])

# HikariCP saturation
hikaricp_connections_active / hikaricp_connections_max

OpenTelemetry distributed tracing

// application.yml -- Spring Boot 3 auto-instruments with Micrometer Tracing
management:
  tracing:
    sampling:
      probability: 0.1  # Sample 10% of requests (100% is too expensive)

spring:
  application:
    name: order-service

# Send traces to Jaeger or Zipkin
management:
  otlp:
    tracing:
      endpoint: http://jaeger:4317

// Custom span for important operations:
@Service
public class OrderService {
    @Autowired private Tracer tracer;

    public Order processOrder(OrderRequest request) {
        Span span = tracer.nextSpan().name("order.process")
            .tag("merchant.id", request.getMerchantId().toString())
            .start();

        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            Order order = createOrder(request); // auto-traced if using an instrumented DB
            Span paymentSpan = tracer.nextSpan().name("payment.charge").start();
            try (Tracer.SpanInScope ps = tracer.withSpan(paymentSpan)) {
                paymentService.charge(order);
            } finally {
                paymentSpan.end();
            }
            return order;
        } finally {
            span.end();
        }
    }
}

Java Flight Recorder: production profiling

JFR is a built-in profiler with under 1-2% overhead, safe for always-on production recording.

# Start a JFR recording (can attach to a running JVM):
jcmd <PID> JFR.start duration=60s filename=/tmp/recording.jfr settings=profile

# Analyze with JDK Mission Control:
jmc /tmp/recording.jfr

# Automated with the JFR API in code:
@Component
public class PerformanceRecorder {
    @Scheduled(fixedDelay = 3600000) // Every hour
    public void captureProfile() throws Exception {
        Path file = Path.of("/tmp/profile-" + System.currentTimeMillis() + ".jfr");
        Recording recording = new Recording();
        recording.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(1));
        recording.enable("jdk.GarbageCollection");
        recording.enable("jdk.SocketRead").withThreshold(Duration.ofMillis(10));
        recording.start();
        Thread.sleep(60000); // Record for 1 minute
        recording.dump(file);
        recording.stop();
        s3Client.upload(file); // Upload for analysis
    }
}

Thread dump analysis

# Capture a thread dump when the API is slow:
jstack <PID> > thread-dump.txt

# Or with jcmd:
jcmd <PID> Thread.print > thread-dump.txt

# Analysis: find BLOCKED threads
grep -A 5 "BLOCKED" thread-dump.txt

# Find threads waiting on a lock:
grep -B 2 "waiting to lock" thread-dump.txt

Dangerous patterns in a thread dump:

Many threads in WAITING state       -> All waiting for the same condition (pool exhaustion)
BLOCKED "waiting to lock <0x...>"   -> Lock contention
TIMED_WAITING "parking"             -> May be OK (async queue) or stuck (idle)
RUNNABLE with "socketRead"          -> Thread waiting on network I/O (blocking)

Async Profiler: CPU hotspot detection

# Async Profiler: accurate CPU profiling without JVM safepoint bias
./profiler.sh -d 30 -f /tmp/flamegraph.html -e cpu,alloc <PID>

# Flame graph: wider bars mean more CPU time
# Look for:
# - Wide bars in serialization code  -> optimize DTOs
# - Wide bars in reflection          -> add AfterburnerModule
# - Wide bars in GC                  -> reduce allocation
# - Wide bars in JDBC                -> N+1 queries, missing indexes

Part 14: Production Performance Incidents

Incident 1: API latency spikes from 80ms to 4 seconds

Symptoms: At 2:00 PM, P99 latency on /api/orders jumped from 80ms to 4,000ms. P50 was still around 90ms. Error rate was near zero. CPU and memory were normal.

Investigation:

-- Check for long-running queries
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle' AND now() - query_start > INTERVAL '1 second'
ORDER BY duration DESC;

-- Result: 1 query running for 3.5 seconds:
-- SELECT * FROM orders WHERE merchant_id = 123 ORDER BY created_at DESC

EXPLAIN ANALYZE SELECT * FROM orders
WHERE merchant_id = 123
ORDER BY created_at DESC
LIMIT 20;

-- Seq Scan on orders (cost=0.00..850000.00 rows=10000000)
-- Filter: merchant_id = 123
-- MISSING INDEX!

Root cause: Merchant 123 saw a 10x traffic increase after a marketing campaign. The query was fine before because the table was small. Now with 10 million rows, a sequential scan takes 3-4 seconds.

Why P50 was still OK: Most merchants have few orders, so their queries are fast. P99 reflects the largest merchants.

Fix:

CREATE INDEX CONCURRENTLY idx_orders_merchant_created
ON orders (merchant_id, created_at DESC);
-- CONCURRENTLY: does not block production traffic
-- Query time: 3.5s -> 3ms

Prevention:

# Slow query logging is mandatory:
# postgresql.conf
log_min_duration_statement = 500  # Log queries taking more than 500ms
auto_explain.log_min_duration = 1000

Incident 2: connection pool exhausted during a flash sale

Symptoms: 3:00 PM flash sale begins. Within 90 seconds, all requests fail with HikariPool: Connection is not available, request timed out after 5000ms. Database CPU is at 30%, meaning the DB is not busy.

Investigation:

// Metrics during the incident:
// hikaricp_connections_active = 10 (MAX)
// hikaricp_connections_pending = 847
// api.orders.latency.p99 = 5000ms (timeout)

// Searching for why connections are held so long:
// Thread dump shows all 10 threads stuck at:
// at com.example.NotificationService.sendPushNotification(NotificationService.java:45)
// at com.example.OrderService.placeOrder(OrderService.java:78)
// Push notification call (Google FCM) is blocking the transaction!

Root cause: placeOrder() calls FCM inside a @Transactional method. FCM takes 2-10 seconds during a flash sale. 10 threads multiplied by 10 seconds exhausts the pool within minutes.

Fix:

// Before:
@Transactional
public OrderResponse placeOrder(OrderRequest request) {
    Order order = createOrder(request);
    notificationService.sendPushNotification(order); // holds connection for 2-10 seconds!
    return OrderResponse.from(order);
}

// After:
@Transactional
public OrderResponse placeOrder(OrderRequest request) {
    Order order = createOrder(request);
    outboxRepo.save(OutboxMessage.forNotification(order)); // under 1ms
    return OrderResponse.from(order);
    // Transaction commits here, connection is released
}
// Outbox poller sends the notification asynchronously

Incident 3: Redis outage crashes the database

Symptoms: Redis cluster fails at 11:00 PM. Within 10 seconds, database CPU spikes from 20% to 100%. The database starts dropping connections. The entire platform is down for 20 minutes.

Root cause: Cache avalanche. 100% of requests hit Redis, all miss (Redis is down), all hit the database. The database receives 10x its normal traffic instantly. No circuit breaker for the database, no rate limiting.

Investigation timeline:

11:00:00  Redis master failure
11:00:01  Application detects Redis connection errors
11:00:02  All cache reads fail -> all requests hit DB
11:00:05  DB connection pool exhausted
11:00:10  DB CPU at 100%, queries timing out
11:00:15  Application cascade failing
11:00:30  On-call engineer paged

Fix:

// 1. Graceful degradation when Redis fails:
public ProductDto getProduct(Long id) {
    try {
        ProductDto cached = redisTemplate.opsForValue().get("product:" + id);
        if (cached != null) return cached;
    } catch (RedisConnectionFailureException e) {
        // Redis is down: log and continue to DB (do not rethrow!)
        log.warn("Redis unavailable, falling back to DB for product {}", id);
        redisDownMeter.increment();
    }
    return repo.findById(id).map(mapper::toDto).orElseThrow();
}

// 2. Local cache as a buffer when Redis is down:
@Cacheable(cacheManager = "localCacheManager", value = "products-local", key = "#id")
public ProductDto getProduct(Long id) {
    // Caffeine local cache reduces DB hits when Redis is down
}

// 3. Circuit breaker for DB queries when Redis is down:
@CircuitBreaker(name = "database", fallbackMethod = "getDatabaseFallback")
public ProductDto getProductFromDb(Long id) {
    return repo.findById(id).map(mapper::toDto).orElseThrow();
}
// If the DB is also overwhelmed: circuit breaker opens -> fail fast instead of cascading

Incident 4: retry storm brings down the platform

Symptoms: A payment service deploy fails: some instances return 500. Within 30 seconds, all services start failing. The platform is unreachable for 45 minutes.

Root cause:

Payment service instances: 3
1 instance has a bad deploy -> returns 500

Order Service: calls Payment -> 500 -> retries 3 times (100ms wait each)
Each order request generates 4 payment requests (1 + 3 retries)
Traffic to payment increases by 4x

Bad payment instance becomes more overloaded -> 2 remaining instances slow down too
Retries happen for both remaining instances
Traffic increases to 16x (4 retries * 4 retries)
All payment instances down
Order service retry storm
All services cascade fail

Fix:

// 1. Exponential backoff with jitter (reduces thundering herd)
// 2. Low max retry count (2-3, not 5-10)
// 3. Circuit breaker: stop retrying when failure rate is high
// 4. Total request timeout: even with retries, the total time must be capped

RetryConfig config = RetryConfig.custom()
    .maxAttempts(2)                    // Only 2 retries
    .waitDuration(Duration.ofMillis(200))
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(200, 2.0, 0.5, 2000))
    .build();

// 5. Server-side rate limiting to protect the payment service
// 6. Separate retry budget: do not retry if there are already too many failures

Incident 5: serialization produces a 50MB response

Symptoms: An API endpoint returns responses extremely slowly (30+ seconds) and occasionally causes OOM. Memory spikes whenever the endpoint is called.

Investigation:

// Endpoint:
@GetMapping("/merchants/{id}/full-report")
public MerchantReport getFullReport(@PathVariable Long id) {
    return merchantService.getFullReport(id);
}

// Service:
public MerchantReport getFullReport(Long id) {
    Merchant merchant = merchantRepo.findById(id).orElseThrow();
    // Hibernate EAGER loads everything:
    // merchant.getOrders() -> 50,000 orders
    // each order.getItems() -> 10 items
    // each item.getProduct() -> full product with binary data
    // Total: 50,000 * 10 * product (~100KB) = 50GB in the worst case
    return new MerchantReport(merchant);
}

Root cause: Object graph explosion. An entity with EAGER relationships combined with a serializer that traverses the entire graph.

Fix:

// 1. Never serialize entities directly -- always use DTOs
public record MerchantReportDto(
    Long merchantId,
    String merchantName,
    long orderCount,   // just a count, not a full list
    BigDecimal totalRevenue,
    List<OrderSummaryDto> recentOrders // only the 20 most recent orders
) {}

// 2. Database aggregation instead of loading all data
@Query("""
    SELECT new com.example.MerchantReportDto(
        m.id, m.name,
        COUNT(o.id),
        COALESCE(SUM(o.total), 0)
    )
    FROM Merchant m LEFT JOIN m.orders o
    WHERE m.id = :id
    GROUP BY m.id, m.name
    """)
MerchantReportDto findReportById(@Param("id") Long id);

// 3. Streaming for large responses
@GetMapping(value = "/merchants/{id}/orders/export",
            produces = MediaType.APPLICATION_NDJSON_VALUE)
public ResponseEntity<StreamingResponseBody> exportOrders(@PathVariable Long id) {
    // newline-delimited JSON streaming
}

Part 15: Senior Engineer Performance Checklist

Before launch: API design review

[] Pagination: use keyset/cursor, not OFFSET for data with more than 10K rows
[] Response size: are sparse fieldsets or projections available?
   No endpoint should be able to return unbounded data
[] Filtering: does every filter parameter have a corresponding index?
[] Sorting: only allow sort on indexed columns
[] Bulk operations: does a batch endpoint exist for use cases that need many items?
[] Idempotency: does every POST/PUT support an idempotency key?
[] Rate limiting: do abuse-prone endpoints have rate limiting?
[] Cache-Control headers set correctly for every endpoint?
[] Sensitive endpoints use no-store?

Before launch: database review

[] EXPLAIN ANALYZE run for every query with production data volume
[] No Seq Scan on any table with more than 100K rows
[] N+1 queries: verified with query counter in integration tests
[] All foreign keys have indexes
[] No SELECT * in production code
[] Connection pool sized correctly for the number of instances
[] Statement timeout is set (prevents hung queries)
[] Read-only transactions use @Transactional(readOnly=true)
[] Batch operations clear the persistence context periodically

Before launch: load testing

[] Load test with production-realistic data volume (not 100 rows)
[] Tested P50, P95, P99 -- not just the average
[] Tested at 1x, 2x, 5x expected peak traffic
[] Tested sustained load (not just spikes)
[] Tested graceful degradation: shut down Redis, shut down external services
[] Verified connection pool behavior under load
[] Verified GC behavior under load (no GC pauses above 200ms)
[] Tested retry behavior with slow or failing downstream services

Monitoring checklist

[] Latency: P50, P95, P99 alerting per endpoint
[] Error rate: alert if above 0.1%
[] Throughput: alert if it drops by more than 20% (indicates an issue)
[] HikariCP: pending connections, timeout count
[] JVM: heap usage, GC pause duration and frequency
[] External services: P99 latency, error rate, circuit breaker state
[] Database: slow queries (above 500ms), connection count, deadlocks
[] Cache: hit rate, eviction rate, memory usage
[] Thread pool: active threads, queue size

Incident response: performance triage

Latency spike:
1. Check error rate -- is there a correlation?
2. Check database slow queries (pg_stat_activity)
3. Check connection pool metrics (pending > 0?)
4. Check external service latency (traces)
5. Check GC activity (JVM metrics)
6. Check thread pool -- any blocked threads?
7. Check recent deploys -- who deployed what?

If DB:
  -> EXPLAIN ANALYZE the slow query
  -> Check for missing index
  -> Check lock contention (pg_locks)

If External Service:
  -> Check circuit breaker state
  -> Check retry behavior -- is it a retry storm?

If JVM:
  -> Thread dump -> find BLOCKED threads
  -> GC logs -> is major GC too frequent?
  -> Async profiler -> CPU hotspot

Conclusion

Performance engineering is not about adding cache to everything or raising thread counts. It is the ability to read a system: knowing where latency comes from, why P99 diverges so far from P50, why the database becomes a bottleneck past a certain scale, and why a retry storm can bring down an entire platform.

Three things that are consistently right for most systems:

1. Profile before you optimize. Guessing the bottleneck is usually wrong. JFR, Async Profiler, and distributed tracing tell you where time actually goes. Never optimize something you have not measured.

2. The database is usually the bottleneck. Fix it there first. Missing indexes, N+1 queries, and over-fetching data account for 80% of performance problems in new APIs. EXPLAIN ANALYZE is the most important tool you have.

3. Design for failure, not just the happy path. Timeouts, circuit breakers, retry with backoff, graceful degradation when the cache is down: these do not affect normal-case performance, but they determine your P99 and your system’s behavior under stress. Your P99 in production is your P50 during an incident.

A senior engineer does not memorize every optimization technique. They have a mental model of the request lifecycle, know how to ask the right questions, and know how to read data to find the bottleneck. That is what this post tries to convey.