Palantir for Product

2026-03-22

Observation #1

Palantir started as a fraud detection company. At PayPal, Peter Thiel & co had built systems to spot fake transactions in a stream of millions of real ones. In 2003, he took that same idea and pointed it at a different problem: finding threats in intelligence data.

The first product was boring. Analysts at government agencies had data spread across dozens of incompatible databases. Palantir helped them link records and surface patterns.

It was basically a search tool.

But the bet was never on search.

Palantir's bet was that once you help an organization understand its own data, you become the system through which that organization "thinks". Palantir started by helping analysts find needles in haystacks → helped them understand the shape of the haystack → how those organizations made decisions.

Each step followed naturally from the last. Each step made the next one possible. And the data integrations, the institutional trust, the domain-specific calibration that accumulated over years could not be replicated by a competitor showing up later with a better algorithm.

The pattern is worth studying. You start with something concrete and immediately useful. You earn trust by solving a problem the customer already knows they have. Then you use that trust, and the data that comes with it, to solve problems the customer didn't know were solvable.

Observation #2

As models become more intelligent, the bottleneck shifts.

Today the bottleneck is implementation: can agents write more code? That capability is improving rapidly. Agents are running longer, producing orders of magnitude (log scale) more tokens (see METR).

What remains are the two ends: the input and the output.

Input is deciding what to work on.

Today, the process looks like a product manager doing user interviews, identifying a problem, proposing a solution, then getting an engineering team to execute and test it. This process is slow, lossy, and political. In a 100+ person company, most of the signal is already lost by the time it reaches an engineer.

The output is knowing whether a change was valuable. Did it break something? Did it improve retention? Did users even notice? Today you either wait weeks for an A/B test for larger tasks (hoping for statistical significance), or you guess.

Observation #3

A practical vision for us as a company must meet three conditions.

It must be ambitious enoughthat compounding effort creates compounding advantage. A problem space deep enough to sustain 1–5 years of focused work.
e.g. a skill file to improve design is too short, having a generic cure to cancer (with no biology background) is probably psychosis
The market must be a race to the top for the winners. The problem we work on must reward the winners with durable, compounding value (defensible moats).
e.g. agent orchestrator is a race to the bottom — thousands of competitors will exist and compete on price until the margin disappears
It must be winnable by a small & focused team of experts. Model/agent labs will not prioritize unglamorous plumbing that requires deep domain knowledge (analytics integrations, CI/CD hooks, team workflows). No single tool exists today for broad product review. The first team to own this surface credibly wins the category.

The Vision

1. Feature Verification

Agents must reliably test whether a newly implemented feature works. This is the most immediate unlock: a tight verification feedback loop that lets agents fully implement features end to end.

With it, an agent becomes a closed system (build → verify → ship → repeat).

2. Workflow Integrations

Testing is only useful if it lives inside a team's actual workflow (GitHub bot, Slack bot, recordings, pass/fail status on PRs).

3. Regression Testing

Reliable, robust, deterministic, non-flaky coverage across the whole product. This is the hard engineering problem.

How do you ensure that building with agents does not cause regressions, and when regressions do happen, feeding the failure back to the agents to fix them?

The loop closes tighter. Agents break something, agents learn about it, agents fix it.

4. Product Intelligence

Integrate analytics, session data, error logs (PostHog, Sentry, Zendesk) and start surfacing insights that go beyond "does it work."

Step 1–3 answer whether code functions correctly.

Step 4 asks whether it was the right product decision/implementation to make.

The gap between step 4 and step 5 is the largest. But the underlying technology is the same at every step: an agent that can navigate a browser, evaluate outcomes, and report what it found. The difference is what you ask it to evaluate.

5. User Simulation

Simulate users running through flows to surface product insights.

Give an agent a persona (a returning customer on a slow connection using an Android phone) and ask it to attempt your checkout as that person would. Multiply by a hundred personas and you see the shape of your product as your users see it.

There's early signal from research:

Simulated individuals replicate real human responses with high accuracy (Park et al., 2024)
LLM agents can run automated A/B tests on live web interfaces (Wang et al., 2025)
LLMs given preferences produce decision-making that matches real behavioral experiments (Horton et al., 2023)

Imagine redesigning an onboarding flow. Instead of shipping and waiting three weeks for data, you run a hundred simulated users through both versions overnight. The new flow is faster for power users but confusing for first-time visitors. You go to the PM: according to our simulations, this change is net positive. Ship it. No real sessions risked. No weeks waiting for statistical significance.

Defensibility

Simulation is genuinely hard, and that's the moat. Naively prompting a model a thousand times produces systematic miscalibration: overestimating performance on moderate tasks, underestimating on hard ones, with demographic biases that compound across runs. Accurate simulation requires real behavioral data, careful calibration against production outcomes, and domain-specific tuning. You can't solve this with a better prompt.

The personas and systems we build along this path are highly defensible. A simulation of Stripe's checkout flow is not transferable to Figma's onboarding. The personas are constructed from that company's PostHog sessions, calibrated against that company's conversion funnels, validated against that company's real A/B test outcomes.

Every simulation run generates feedback: this persona predicted a drop-off that matched production data, this one didn't, and that feedback tunes the next run.

Over months, the system accumulates these prediction-outcome pairs across every flow the company cares about. A competitor can replicate the PostHog integration.

They can ingest the same session data. What they cannot replicate is the validation history: which personas were accurate, which were noise, which edge cases turned out to matter, which behavioral patterns were artifacts of the model versus genuine signals from the user base.

That history is what separates a simulation that is directionally useful from one a PM actually trusts enough to ship against.

Simulation also has honest limits.

Known knowns you can simulate directly: existing users, established flows, defined personas
Known unknowns you can explore with variations: mutations of personas, edge-case devices, degraded connections
Unknown unknownswill break the model. A random influencer sends the majority of traffic from a demographic you never considered, and the simulation is useless for that cohort. Simulation is most valuable when your user base is relatively stable: mature businesses, series A+, with a defined customer profile they want to optimize.

The team that earns trust with regression testing earns the right to sell product intelligence. The team that earns trust with product intelligence earns the right to simulate. Each step is a product. Each product funds the next step.

Simulation is the natural evolution of testing.

Palantir started by helping analysts search databases. They ended up becoming the system through which entire organizations make decisions.

We are starting by helping agents verify that code works.

The question is how far up the stack that technology can reach.

We think the answer is: all the way to knowing your product better than you do.