52,000 Stars and a Disclaimer: What People Actually Do With These Tools — aniketkarneai.com

The disclaimer on ai-hedge-fund is one of the most prominent things on the README:

Note: the system does not actually make any trades.

## Disclaimer
This project is for educational and research purposes only.
- Not intended for real trading or investment
- No investment advice or guarantees provided
- Creator assumes no liability for financial losses

TradingAgents says it too. Both repos have 50,000+ stars. Both repos are growing.

The question nobody in the comments asks: what happens when the people who star these repos actually use them?

The Gap Between “Educational” and “I Put It on My Brokerage Account”

I’ve been building multi-agent systems long enough to know how this goes. The pattern repeats in every domain:

Someone builds a tool with an “educational” disclaimer
10,000 people star it
Some percentage — maybe 5%, maybe 20% — remove the safety wiring and use it for real
Nobody talks about the failures because failures are embarrassing and sometimes illegal

With trading bots, the stakes are different. With money, people have strong incentives to push past disclaimers:

FOMO on gains: “The AI said buy NVDA and it went up 20% last quarter!”
Publication bias: Success stories get blog posts and YouTube videos. Failures get buried.
Complexity theater: The more complex the system looks, the more people trust it without evidence.
Backtesting illusion: If it backtested well on 2020-2023 data, it must work, right?

That last one is the most dangerous.

The Backtesting Problem

This is the part that worries me most.

Both repos have backtesting engines. ai-hedge-fund’s is particularly sophisticated — it simulates portfolio allocation, margin requirements, and computes real risk metrics (Sharpe ratio, Sortino ratio, max drawdown).

But backtesting is notoriously misleading:

Survivorship bias: If you backtest against “stocks that exist today,” you’re excluding stocks that went to zero and no longer exist. The historical S&P 500 looked worse in 2008 than it looks in backtests because we only remember the survivors.

Look-ahead bias: If you use fundamental data that’s reported quarterly, there’s a delay between the reporting period and when the data becomes public. A backtest that uses Q1 earnings on March 31 is cheating — investors didn’t know Q1 earnings until April.

Transaction costs: Backtests often ignore slippage, bid-ask spreads, and the market impact of large orders. A backtest that recommends buying 10% of daily volume will look very different from the reality of executing that trade.

Regime dependence: An equity long-only strategy backtested from 2010-2021 looks incredible. The same strategy from 2000-2010 looks terrible. Performance is as much about market conditions as about the strategy itself.

Overfitting: A complex multi-agent system has many tunable parameters. Given enough parameters and enough backtest data, you can make anything look good. The Sharpe ratio in the backtest is not the Sharpe ratio you’ll get forward.

Who Is Actually Using These?

Based on GitHub activity, issues, and Discord servers:

Tier 1: Curious developers (majority) — Clone it, read the code, maybe run it with fake tickers, never connect real money. Learn something. Move on.

Tier 2: Researchers (significant minority) — Run backtests, publish findings, build on the architecture for academic projects. These are the people the “educational” disclaimer actually targets.

Tier 3: Indie traders (small but real) — Connect to a paper trading account, test with play money. Some percentage of these graduate to real accounts when the paper results hold.

Tier 4: Full automation (smallest but most dangerous) — Set up the system, connect to a brokerage API, let it run. This is what the disclaimer says not to do. This is what some people are definitely doing right now.

The number in Tier 4 is unknowable. But with 52,000+ stars, even 1% is 520 people running a system with an explicit “not for real trading” disclaimer against real money.

The AI Credibility Problem

There’s a second-order effect: these systems make AI look credible in domains where credibility is actually dangerous.

When a system has 19 investor agents named after famous investors, a sophisticated backtesting engine, and well-commented Python code, it looks like it works. The visual design — charts, dashboards, multi-agent visualizations — reinforces this.

But looking like it works and working are different things.

The underlying assumption — that the aggregate judgment of 19 AI agents will outperform a simple index fund — has never been demonstrated empirically. These are not hedge funds. They’re prototypes.

This isn’t unique to trading bots. I’ve seen it in every AI application domain: medical diagnosis, legal research, code generation, writing. When the AI looks sophisticated, people trust it beyond what it deserves.

The financial domain is particularly dangerous because losses are quantifiable and public, there’s strong incentive to take credit for wins and hide losses, and “the AI was wrong” doesn’t satisfy regulators or spouses.

What the Developers Can Do

The developers of these systems are in a bind. They can’t prevent misuse. They can:

Keep the disclaimers prominent — which both do
Not build brokerage integrations — which both don’t
Build in guardrails — TradingAgents’ risk management veto is interesting; it makes the system refuse trades in some conditions
Publish transparent failure modes — “here’s when this system will lose money”

The third option is underused. The best way to make a system that’s misused less is to make it fail more visibly and clearly when conditions are wrong. A system that says “I don’t know” is safer than a system that always has an opinion.

Both repos currently have opinions. They always produce a signal. That’s the product design choice that enables misuse.

What Users Can Do

If you’ve starred one of these repos and are tempted to run it against real money:

Paper trade first. Use the backtesting engine against historical data. Then use a brokerage paper trading mode for 3-6 months. Track your results honestly.
Understand what you’re trusting. The multi-agent structure doesn’t add intelligence. It adds latency and cost. The question is whether the individual agents are making good decisions, not whether 19 agents is better than 1.
Read the backtest limitations. If you’re sharing backtest results, disclose the look-ahead bias, survivorship bias, and transaction cost assumptions explicitly.
Size small. If you must use it, use it with money you can lose entirely. Position size is risk management.
Diversify your inputs. Don’t rely on any single system’s output, AI or human.

The Honest Version

Here’s what I’d say to the 520 people probably running this against real money right now:

The multi-agent architecture is intellectually interesting. The backtesting engine is a solid piece of engineering. The 19 investor personas are a fun storytelling device.

But the AI doesn’t know something the market doesn’t know. It doesn’t have insider information. It doesn’t have a model of your personal risk tolerance. It doesn’t know that next quarter’s earnings will surprise to the downside.

What it has is a set of financial screens that worked in the past, run through an LLM that generates plausible-sounding explanations for whatever numbers it sees.

That can be useful for research. It’s not a substitute for judgment.

Aniket Karne

DevOps & AI Engineer · Amsterdam

Back to all posts