BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Building Production-Ready tRPC APIs: The TypeScript Alternative to Apollo Federation

Building Production-Ready tRPC APIs: The TypeScript Alternative to Apollo Federation

Listen to this article -  0:00

Key Takeaways

  • tRPC delivers end-to-end type safety without schema definitions, eliminating 89% of API bugs we experienced with Apollo Federation in production environments handling 2.4M requests daily.
  • Migration from GraphQL Federation to tRPC reduced our P95 response times from 85ms to 28ms while cutting client bundle size by 80% (142KB → 28KB gzipped), dramatically improving user experience.
  • Production monorepo setup with Next.js 14 App Router enables shared TypeScript types across 12 microservices, eliminating the schema synchronization nightmare that plagued our GraphQL implementation.
  • Real-world benchmarks show tRPC cold starts are 75% faster than Apollo Federation (45ms vs 180ms), critical for serverless deployments and improved developer experience in local environments.
  • The complete absence of code generation step in tRPC reduced our CI/CD pipeline time by 40% and eliminated the entire category of build-time failures caused by schema mismatches between services.

Let me be brutally honest with you. Six months ago, I was a GraphQL Federation evangelist. We'd invested six months building out a federated graph with Apollo, complete with schema stitching, gateway configuration, and a complex CI/CD pipeline that regenerated types on every commit. It was beautiful on paper. In production? It was a disaster waiting to happen every single deploy.

The breaking point came during a routine Friday afternoon deployment. Our product team had updated a field type in one service, the schema regenerated successfully, tests passed, and we shipped it. Thirty minutes later, our mobile app started crashing because the iOS client was still using the old generated types from two hours earlier. The schema was versioned. The gateway was updated. But the client codegen hadn't run yet because someone had forgotten to trigger it. A classic GraphQL Federation pain point.

That's when I started researching tRPC. What caught my attention was the promise of end-to-end type safety without the schema ceremony. No SDL files. No codegen step. No federation gateway. Just TypeScript, from start to finish. I was skeptical, naturally. We'd already invested heavily in Apollo. But after seeing production metrics from companies running tRPC at scale, I convinced my team to build a proof of concept.

What follows is the complete story of our migration, including the mistakes we made, the performance wins we didn't expect, and a review of the production architecture we're running today to handle 2.4 million requests daily with 99.97% uptime. This isn't a tutorial for toy projects. This is what it actually takes to ship tRPC in production.

Figure 1: Notice how tRPC achieves end-to-end safety without schema definitions

The Technical Reality: What tRPC Actually Gives You

Type Safety Without the Schema Tax

Here's what nobody tells you about GraphQL Federation: the schema becomes a single point of failure. With tRPC, your TypeScript types are the contract. There's no intermediate representation. No SDL to maintain. No schema registry to keep in sync across environments.

When we were running Apollo Federation, here's what a typical type change looked like: Update GraphQL schema → Run codegen → Commit generated files → Update resolver implementation → Update client queries → Run client codegen → Deploy both services → Hope nothing broke.

With tRPC? Update TypeScript interface → That's it. The client immediately knows because they're sharing the same type definition.

Figure 2: Measured across 10,000 requests/minute sustained load

Performance That Actually Matters

We ran production load tests comparing our old Apollo Federation setup against the new tRPC implementation. The numbers were shocking. Cold start performance—critical for our serverless functions—improved by 75%. Apollo Federation's gateway overhead added 180ms on cold starts. tRPC? 45ms. That's before we even hit the actual business logic.

Average response time under sustained load dropped from 38ms to 12ms. But here's what really mattered: P95 and P99 latencies. With Apollo, our P95 was sitting at 85ms and P99 at 156ms. After migration, P95 dropped to 28ms and P99 to 42ms. Those tail latencies kill user experience, especially on mobile networks.

The bundle size story was equally dramatic. Our Apollo Client setup with Federation support weighed in at 142KB gzipped. tRPC with React Query? 28KB. That's 80% lighter. On slower connections, that translates to 2-3 seconds faster initial page load. Real users noticed the difference immediately.

Production Architecture: How We Actually Built This

Monorepo Setup That Works

Our production setup runs on a pnpm workspace monorepo, using Next.js 14 App Router for the frontend and tRPC for all API communication. We've got 12 microservices, each exposing its own tRPC router, and a gateway layer that merges them all. Here's what that structure looks like in practice:

Figure 3: 12 microservices, 2.4M requests/day, 99.97% uptime

Each service owns its domain logic and database. User service talks to PostgreSQL, product service uses MongoDB for catalog data, and order service leverages Redis for session management. The beauty of tRPC here is that type safety flows through the entire stack. When the product service changes a field type, TypeScript immediately flags every consumer.

Request Batching and Caching Strategy

One concern people raise about tRPC is the lack of built-in request batching like GraphQL provides. Here's the reality: React Query's batching is more than sufficient for 90% of use cases, and it's actually simpler to debug than GraphQL's DataLoader patterns. We're running 10,000 requests per minute through our production environment, and batching works perfectly.

Our caching layer combines Redis for shared data and React Query's intelligent cache on the client side. The combination is powerful because React Query knows exactly what data it has and can serve from cache instantly, while our server-side Redis cache handles cross-user data efficiently. We're seeing 87% cache hit rates on product data and 92% on user preferences.

The Migration Process: How We Actually Did It

Phase 1: Strangler Fig Pattern

We didn't do a big-bang rewrite. That's how you end up with six months of zero business value and a panicked executive team. Instead, we used the strangler fig pattern: run both systems in parallel, migrate endpoints one at a time, prove stability, move on.

We started with read-only endpoints that had high traffic but low business risk, such as user profile lookups, product catalog queries, and others. These gave us real production data about performance and reliability without risking critical write operations. We ran dual APIs for three weeks, comparing error rates and latency metrics before cutting over fully.

Phase 2: Critical Mutations

Once we had confidence in the reads, we tackled mutations, including order creation, payment processing, inventory updates, that is what actually costs money if it breaks. Here's where tRPC's type safety really shines. With GraphQL, we'd constantly deal with nullable fields, optional arguments, and schema drift. With tRPC, if the types compile, you've eliminated an entire class of API contract errors, not because TypeScript enforces runtime correctness, but because client and server can't silently diverge through stale codegen.

We found exactly two runtime errors during the mutation endpoint migration. Both were related to database connection pooling, not tRPC itself. Worth noting: this was a migration, not a greenfield build, so the business logic was already proven. Our GraphQL Federation rollout built both the API layer and the domain logic simultaneously, which explains the higher incident count. That said, the near-zero errors here specifically reflect tRPC eliminating the codegen sync problem.

Figure 4: Notice the dramatic drop after migration completion

Real Implementation: Code That Actually Ships

Server-Side Router Setup

Here's our actual production router setup, stripped of business logic but showing the real patterns we use. This handles authentication, request validation, error handling, and type merging across our microservices:

typescript
// apps/api/src/trpc.ts
import { initTRPC, TRPCError } from "@trpc/server";
import { Context } from "./context";
import superjson from "superjson";
const t = initTRPC.context<Context>().create({
  transformer: superjson,
  errorFormatter({ shape, error }) {
    return {
      ...shape,
      data: {
        ...shape.data,
        zodError:
          error.cause instanceof ZodError ? error.cause.flatten() : null,
      },
    };
  },
});
export const router = t.router;
export const publicProcedure = t.procedure;
// Authentication middleware
const isAuthed = t.middleware(async ({ ctx, next }) => {
  if (!ctx.session?.user) {
    throw new TRPCError({ code: "UNAUTHORIZED" });
  }
  return next({ ctx: { session: ctx.session, userId: ctx.session.user.id } });
});
export const protectedProcedure = t.procedure.use(isAuthed);

Client Setup with Next.js 14

Our Next.js setup uses the new App Router with React Server Components where possible. Here's the real client configuration we're running in production, including the HTTP batch link that handles request batching automatically:

typescript
// apps/web/src/trpc/client.ts
import { createTRPCReact } from "@trpc/react-query";
import { httpBatchLink } from "@trpc/client";
import type { AppRouter } from "@/server/routers/_app";
import superjson from "superjson";
export const trpc = createTRPCReact<AppRouter>();
export function createTRPCClient() {
  return trpc.createClient({
    links: [
      httpBatchLink({
        url: process.env.NEXT_PUBLIC_API_URL + "/api/trpc",
        transformer: superjson,
        headers: async () => {
          const session = await getSession();
          return {
            authorization: session?.token ? `Bearer ${session.token}` : "",
          };
        },
      }),
    ],
  });
}

Production Procedure Pattern

Here's how we structure our actual procedures. This pattern handles input validation with Zod, database transactions, error handling, and telemetry—everything you need for production:

typescript
// apps/api/src/routers/product.ts
import { z } from "zod";
import { router, protectedProcedure } from "../trpc";
import { prisma } from "../db";
import { TRPCError } from "@trpc/server";
export const productRouter = router({
  getById: protectedProcedure
    .input(z.object({ id: z.string().uuid() }))
    .query(async ({ input, ctx }) => {
      const product = await prisma.product.findUnique({
        where: { id: input.id },
        include: { variants: true, reviews: true },
      });
      if (!product) {
        throw new TRPCError({
          code: "NOT_FOUND",
          message: "Product not found",
        });
      }
      return product;
    }),
  create: protectedProcedure
    .input(
      z.object({
        name: z.string().min(1).max(200),
        description: z.string().max(5000),
        price: z.number().positive(),
        inventory: z.number().int().nonnegative(),
      })
    )
    .mutation(async ({ input, ctx }) => {
      // Production includes Datadog tracing here
      const product = await prisma.product.create({
        data: { ...input, createdBy: ctx.userId },
      });
      // Invalidate cache
      await redis.del(`product:${product.id}`);
      return product;
    }),
});

What We Learned: The Honest Mistakes and Wins

Mistakes We Made

First major mistake: trying to replicate GraphQL's field-level batching. We spent two weeks building a custom batching system before realizing React Query's built-in batching was perfectly adequate. We deleted 800 lines of code and performance actually improved because the simpler approach had less overhead.

Second mistake: over-validating on the client. We were running Zod validation on both client and server, thinking it would catch errors earlier. What actually happened is we created inconsistencies between client and server validation that led to confusing error states. Now we validate once on the server, period. The client trusts the TypeScript types.

Third mistake: not setting up proper monitoring early enough. tRPC is so fast that we didn't notice performance regressions until they became significant. Now we've got Datadog APM on every procedure, tracking P50, P95, P99 latencies, and error rates. The overhead is negligible and the visibility is invaluable.

Unexpected Wins

The biggest unexpected win was developer velocity. Our team ships features 40% faster now because they don't have to context-switch between SDL, codegen, and implementation. You write your procedure, TypeScript propagates the types, and you're done. No schema sync meetings. No waiting for codegen to run. Just code.

The second win was faster onboarding for new developers. With GraphQL Federation, new engineers needed a week to understand the schema, gateway, and codegen pipeline before they could contribute. With tRPC, they're shipping code on day two. If you know TypeScript and Next.js, you know our API.

The third win was testing. We eliminated an entire category of integration tests because TypeScript guarantees type safety end-to-end. We still test business logic thoroughly, but we don't need tests that verify "does the client handle this field correctly" because the types ensure correctness at compile time.

When NOT to Use tRPC

Let's be clear about this: tRPC is not a silver bullet. If you're building a public API that third parties will consume, GraphQL or REST makes more sense. You need schema documentation, versioning, and language-agnostic access. tRPC is TypeScript-only.

If you've got a mobile app built in Swift or Kotlin, tRPC won't help you. It's great for web applications where you control both client and server, but it doesn't solve cross-platform type safety like protobuf or GraphQL can.

And honestly, if your GraphQL setup is working fine and you're not experiencing the pain points we had, there's no reason to migrate. The grass isn't always greener. We migrated because Federation was actively costing us developer time and production stability. If that's not your situation, stick with what works.

Production Metrics: The Numbers That Matter

Here's our actual production data comparing the last month of Apollo Federation with the first month after full tRPC migration. These numbers are from Datadog APM, not synthetic benchmarks:

Metric Apollo Federation tRPC
Average Response Time 38ms 12ms (68% faster)
P95 Latency 85ms 28ms (67% faster)
Cold Start Time 180ms 45ms (75% faster)
Client Bundle Size 142KB gzipped 28KB (80% smaller)
Production Bugs/Month 88 (avg over 3 months) 7 (89% reduction)
CI/CD Pipeline Time 8.4 minutes 5.1 minutes (40% faster)

These metrics are from production, handling 2.4 million requests daily across 12 microservices. The bug reduction is the most significant—89% fewer production incidents directly translates to less firefighting and more feature development.

The Bottom Line: Would We Do It Again?

Absolutely. Without hesitation. The migration took us six weeks of focused engineering effort, and we've recouped that investment ten times over in reduced bug fixing, faster feature development, and improved developer experience. Our team ships features 40% faster, our users get better performance, and we sleep better at night knowing an entire class of runtime problems was removed thanks to improved type representation.

But here's the reality check: tRPC won't solve organizational problems. If your team struggles with GraphQL because of poor processes or unclear ownership, switching to tRPC won't magically fix that. What tRPC does fix is the entire category of problems related to schema synchronization, type generation, and API contract drift.

If you're running a TypeScript monorepo and experiencing pain with GraphQL Federation's complexity, give tRPC serious consideration. Start small—migrate one service, measure the results, and expand from there. That's what we did, and it fundamentally changed how we build APIs.

The complete code for our production setup, including the monorepo structure, router configurations, and testing patterns, is available in the GitHub repository link. It's real production code, battle-tested with 2.4 million requests daily. Use it as a starting point for your own migration.

About the Author

Rate this Article

Adoption
Style

BT