H2 Framework: How We Test Our Automated Investing Infrastructure

At Wealthfront, we build automated investing systems that manage client portfolios with meticulous care by incorporating a large set of rules, optimizations, and input data. Testing such complex systems is inherently challenging; small code changes can interact in unexpected ways, which makes it crucial to have a robust framework that validates our investing decisions.

To monitor the behavior of our investing systems, we developed our H2 testing framework. H2 is a historical and hypothetical system that compares investing decisions we made in the past against the decisions we would make with an updated version of the investing evaluation code. We use this framework in two different testing pipelines:

H2-Historical, which checks that the investing output generated after every code change matches a static file of investing decisions we made in the past.
H2-Hypothetical, which verifies dynamically that investing decisions remain synonymous during large scale migrations by running two systems simultaneously and comparing the results.

By integrating H2-Historical into our automated deployment pipeline and H2-Hypothetical into our engineering workflow, we can develop new code fast without worrying about unintentionally changing automated investing behavior.

Motivation: The Challenges of Testing in a Real-World Environment

In an ideal world, testing financial software would be straightforward. We’d write unit, functional, and integration tests, confirm that our code behaves as expected, and deploy with confidence. But in reality, the complexity of investment systems and the unpredictability of real-world data means that these carefully packaged tests often aren’t enough.

The interactions between all of the variables associated with an investing decision form an intricate puzzle, which means it’s impractical to expect engineers to write test cases for the seemingly infinite array of possible variable combinations.

This is where our H2 testing framework comes in!

Here at Wealthfront, our belief in continuous deployment is supported and empowered by better testing. For us, this means that instead of maintaining a static staging environment, our tests leverage a high coverage, living framework that’s backed by historical and hypothetical comparisons to ensure that our investing decisions remain consistent even as we iterate on our financial infrastructure. These dynamic testing pipelines allow us to confidently deploy changes straight to production.

By designing our testing strategy around real-world data from snapshots of our own investing history, we intentionally seek out the inherent unpredictable intricacies of financial systems, rather than trying to simplify them away. We make an effort to capture a wide range of real-world cases – dirty, messy, unpredictable data – and validate that our investing behavior remains stable over time, unless we’re intentionally expecting otherwise. This approach has allowed us to build a more resilient and reliable platform – one that doesn’t just pass engineer written tests, but holds up against the intricacies of actual investing behavior.

H2-Historical: The Backbone of Production Testing

H2-Historical is the more widely used of our H2 testing strategies. This testing pipeline is run on every single merge to production and must pass before any investing services can be deployed. At its core, it ensures that every code change we deploy maintains the same investing behavior as before. Any intentional changes to our investing behavior will break H2-Historical, and new snapshots will have to be taken.

The key to H2-Historical is our investing snapshots. In an effort to capture a broad range of data, we select both typical investing days and days with a unique investing environment that may help stress test edge cases, such as:

Days with very high investing volume, which gives us insight into how our system handles scaling.
Days with a lot of dividend payouts, which could increase our volume of buy transactions when we reinvest these dividends.
Days where the market drops significantly, which could increase tax loss harvesting transactions.

On these selected days, we capture a representative set of accounts and record their investing activity as JSON files. For comprehensive coverage, our sampling methodology for accounts includes:

Completely random accounts – for general representativeness
Partially random accounts – like accounts that actively traded that day
Targeted small samples – selected to stress-test edge cases (such as accounts with PLOC or had a recent ACATS transfer)

The actual snapshots include, but aren’t limited to:

Account details
The general investing intent(s) for each account
- Like REBALANCE, INVEST, or WITHDRAWAL
The actual actions we generated for each account

Here is an example of what we might see when there is a failure in H2-Historical testing:

buy iid: BMY, expected: 2.000000, actual: 0
buy iid: JNJ, expected: 1.000000, actual: 0
buy iid: KHC, expected: 1.000000, actual: 0
buy iid: VB, expected: 5.000000, actual: 0
buy iid: VIG, expected: 23.000000, actual: 1.000000
buy iid: IEMG, expected: 86.000000, actual: 5.000000
buy iid: SCHF, expected: 195.000000, actual: 10.000000
buy iid: VTEB, expected: 203.000000, actual: 11.000000
sell iid: WFRPX, expected: 3343.000000, actual: 0
sell iid: OXY, matched: 3.000000
sell iid: VWO, matched: 12.000000

This failure was caused by the removal of code related to a discontinued investment strategy following the liquidation of its associated fund. This output tells us that with the code changes in place, we would have traded this account differently than we actually did on the day we took the snapshot, which is expected as we no longer trade that fund. We have since updated our snapshots to reflect these new expectations. The expected quantities are what we would expect to trade (because that’s what we actually did in the past), while the actual values are the output from running the new code flow with the proposed changes.

H2-Historical is not considered just another test suite, and failures are handled as seriously as if they were production errors. Engineers here at Wealthfront are expected to conduct postmortems upon an unexpected failure in H2-Historical. In these postmortems, we dive deep into the root cause of the unexpected trending behavior change and document action items to prevent the same issue from happening again.

While the above is an example where H2-Historical flagged an intentional change, we’ve also seen it correctly catch unintended developments. A recent postmortem highlighted an issue where a change to symbol resolution logic inadvertently would have caused trades to fail altogether for certain accounts. In this case, H2-Historical caught the issue before it could impact real client accounts. Here’s what happened:

A change removed logic that returned an old, potentially outdated symbol reference for a stock when no current valid symbol was available.
When this update was deployed, H2-Historical flagged multiple failures where trades didn’t execute for accounts with certain ETFs and stocks.
The issue stemmed from the fact that some instruments had undergone both a symbol and a CUSIP change before the snapshot was taken, meaning they were missing valid identifiers at test time.
Since this would have been unexpected and incorrect investing behavior, the team immediately reverted the change before it was deployed (blocked by the H2 failure) and ultimately implemented a fix that correctly handled symbol resolution.

This is exactly the kind of issue that H2-Historical is designed to catch. By flagging unexpected changes before they reach production, it helps ensure that every trade executes as intended, maintaining the reliability and consistency of our investing platform.

Because this test runs as part of our deployment pipeline, any failure will block releases. However, H2-Historical is not a development tool – we don’t test new code against it during development. Instead, we rely on a robust suite of unit, functional, and integration tests to catch most unintentional investing changes early. We intentionally treat H2-Historical as a final safeguard rather than a primary testing mechanism. This ensures we maintain rigorous testing throughout development and prevents overreliance on a sampling-based approach that, while powerful, isn’t exhaustive.

The H2-Historical testing pipeline ensures that even as we evolve our platform to add new features and improve existing infrastructure, we maintain stability in the real-world and real-time investing decisions we make for clients.

H2-Hypothetical: For a Different Testing Need

H2-Hypothetical, while still able to catch unintentional investing behavior changes, is primarily designed for a different use case: system migrations and large infrastructure changes.

When we’re migrating core parts of our investing system, such as moving to a new execution platform that defines a similar, but ultimately different set of Java classes, we need a way to verify that these changes don’t impact client investing outcomes. Unlike H2-Historical, we don’t use pre-recorded snapshots for comparison. Instead, it is more effective and relevant to use live comparisons between the old and new system, especially when the system we’re migrating off of is being incrementally updated and improved by other engineers at the same time.

Here’s how this works:

First, we take a large sample of real-world accounts, capturing all the relevant details that we need to evaluate an account for investing.
We run the same flow for evaluating accounts for investing using first the old system, then the new system we’re planning on migrating to.
We compare the outputs to confirm that they match. If they don’t, we investigate failures and identify the fix, unless the new system actually made an improvement.

Since H2-Hypothetical operates at a bigger scale, it can surface subtle behavioral differences from even more edge cases of variable interactions that might not be obvious in small sample testing. This allows us to catch tiny discrepancies before making a full migration.

This is an example of what a failed H2-Hypothetical test looks like:

java.lang.AssertionError: discrepancies for account expecting withdrawal
(AAPL,SELL) expected: 2.000000 actual: 2.000000
(STX,SELL) expected: 2.000000 actual: 2.000000
(IEMG,SELL) expected: 32.000000 actual: 32.000000
(VTEB,SELL) expected: 17.000000 actual: 17.000000
(VIG,SELL) expected: 13.000000 actual: 13.000000
(AMD,SELL) expected: 2.000000 actual: 2.000000
(VEA,SELL) expected: 29.000000 actual: 29.000000
(ANET,SELL) expected: 1.000000 actual: 1.000000
(GOOG,SELL) expected: 3.000000 actual: [nothing]
(MSFT,SELL) expected: 7.000000 actual: 6.000000

At first glance, it looks like the new system produced nearly identical investing decisions – except it chose not to sell any GOOG stock. Upon investigation, we realized this was actually an investing improvement resulting from our migration from classic Java classes to a phased Scala pipeline. The new pipeline dynamically adjusted the remaining withdrawal amount between ETF and stock liquidation phases, allowing us to sell fewer securities while still raising the necessary funds for the withdrawal. This change slightly reduced the number of instruments sold and ensured clients didn’t end up with excess uninvested cash. Since we confirmed that this is an improvement rather than a regression, we added this account and other related cases to a list of ignored accounts that we know are failing, but don’t need to be fixed. This allowed us to proceed with the migration while ensuring that true discrepancies would still be flagged for investigation.

While in this example, H2-Hypothetical caught an improvement, it more often flags unexpected deviations from the behavior we want – tiny discrepancies that indicate something didn’t migrate correctly. By catching these issues before a full rollout, H2-Hypothetical helps ensure that even large-scale infrastructure changes don’t introduce unintended client-facing differences.

While the output does look similar, H2-Hypothetical, unlike H2-Historical, is not an automated and required part of our deployment pipeline. Instead, it’s an optional but powerful tool used at the discretion of engineers when we anticipate infrastructure changes that could have unexpected side effects. H2-Hypothetical provides assurance in situations where we need to verify that a new platform behaves identically to the old one, and we can use it as a gauge of the level of completeness of an infrastructure migration.

Key Takeaways & Conclusion

Testing financial systems is complex. It’s important to maintain consistency and stability.
H2-Historical keeps day-to-day investing safe while we make code changes. Engineers take these failures seriously.
H2-Hypothetical uses real-time flow comparisons to help with big changes. Engineers use this to prevent unintended side effects during large scale migrations.
Capturing real-world messiness is crucial. The best test coverage comes from having unpredictable data that captures unique interactions between account variables.

At Wealthfront, we embrace innovation – but not at the expense of reliability. Our H2 testing framework provides the guardrails that allow us to continuously evolve our systems while ensuring that every trade decision remains consistent and intentional.

Using these two testing pipelines allows us to strike a balance between development velocity and investing integrity. Engineers can ship with confidence, knowing that our testing framework is always working behind the scenes to maintain the high standard of trust that our clients expect.

Disclosures

The information contained in this communication is provided for general informational purposes only, and should not be construed as investment or tax advice. Nothing in this communication should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Any links provided to other server sites are offered as a matter of convenience and are not intended to imply that Wealthfront Advisers or its affiliates endorses, sponsors, promotes and/or is affiliated with the owners of or participants in those sites, or endorses any information contained on those sites, unless expressly stated otherwise.

All investing involves risk, including the possible loss of money you invest, and past performance does not guarantee future performance. Please see our Full Disclosure for important details.

Investment management and advisory services are provided by Wealthfront Advisers LLC (“Wealthfront Advisers”), an SEC-registered investment adviser, and brokerage related products, including the Cash Account, are provided by Wealthfront Brokerage LLC (“Wealthfront Brokerage”), a Member of FINRA/SIPC. Financial planning tools are provided by Wealthfront Software LLC (“Wealthfront Software”).

Wealthfront Advisers, Wealthfront Brokerage, and Wealthfront Software are wholly-owned subsidiaries of Wealthfront Corporation.

Engineering Blog – Wealthfront