The Data Problem Nobody Talks About in AI Trading Bots — aniketkarneai.com

Everyone wants to talk about the AI. Nobody wants to talk about the data.

I spent a week reading the code behind the two most popular open-source AI trading frameworks — ai-hedge-fund and TradingAgents — and the most interesting thing I found wasn’t in the agent code at all. It was in the data fetching.

The AI is the easy part. The data is the actual problem.

What These Systems Actually Need

Before we talk costs, let’s understand what data these systems consume:

Price data: Daily open/high/low/close/volume for any ticker. Seems simple. It’s not.

Fundamental data: Income statement (revenue, net income, EPS), balance sheet (assets, liabilities, equity, shares outstanding), cash flow statement (operating cash flow, capex, free cash flow). This is what the Warren Buffett and Ben Graham agents are built on.

Market cap and share count: Current shares outstanding × current price = market cap. This changes constantly via buybacks and dilution.

Valuation metrics: P/E ratio, P/B ratio, EV/EBITDA, ROE, debt-to-equity, operating margin. Calculated from the above.

News and sentiment: Headlines, social media, analyst ratings. Scraped or API-fetched.

Technical indicators: RSI, MACD, moving averages. Calculated from price data.

For ai-hedge-fund, each investor agent fetches a subset of this. The Warren Buffett agent alone makes 3 API calls per ticker: financial metrics, financial line items, and market cap. For 3 tickers × 19 agents, that’s potentially 57 API calls per run.

The Free Option: yfinance

Both frameworks support yfinance — a Python package that scrapes Yahoo Finance data for free. No API key needed.

# yfinance in TradingAgents default config
config["data_vendors"] = {
    "core_stock_apis": "yfinance",
    "technical_indicators": "yfinance", 
    "fundamental_data": "yfinance",
    "news_data": "yfinance",
}

Sounds great. Free data. But yfinance comes with serious caveats:

Rate limiting: Yahoo Finance will rate-limit or ban your IP if you query too aggressively. Backtesting over 10 years of daily data for multiple tickers can trigger this in minutes.

Data quality: Yahoo Finance data is occasionally wrong. Revenue figures that don’t match SEC filings. Shares outstanding data can be stale. Dividends get misattributed sometimes.

Coverage gaps: Insider trading data, institutional ownership, short interest — these aren’t in yfinance. The Burry-style “find the short” analysis can’t happen with Yahoo Finance data alone.

No guarantee of uptime: Yahoo Finance isn’t an API. It’s a website. They change their backend without notice. yfinance breaks periodically and needs patching.

The Paid Option: This Is Where It Gets Expensive

The serious data providers:

Provider	What’s Available	Rough Cost
Polygon.io	Real-time + historical, fundamentals, news	$200-2000/month
Alpha Vantage	Stock data, FX, crypto, technical indicators	$49-250/month
Financial Datasets API	Balance sheet, income, cash flow, metrics	$50-500/month
Bloomberg	Everything, but terminal is ~$25K/year	$2,000+/month
SEC EDGAR	Free, but raw — needs parsing	Your time

For ai-hedge-fund’s recommended FINANCIAL_DATASETS_API_KEY, expect to pay at minimum $50-100/month for the tier that covers all the line items these agents need.

TradingAgents is smarter about this — it caches aggressively. First run fetches live data; subsequent runs use ~/.tradingagents/ cache. But the initial fetch still costs.

What the AI Actually Does With Bad Data

This is the part nobody writes blog posts about.

Most of the investor agents in ai-hedge-fund do quantitative screening first, then hand off to the LLM. The quantitative screens — ROE > 15%, debt-to-equity < 0.5, operating margin > 15% — are applied to the raw data.

If the data is wrong, the screen is wrong. A stock with fabricated revenue figures will pass a revenue growth screen. A company with hidden liabilities will look healthy on a debt ratio check.

The LLM is downstream of the data. Garbage in, garbage out — but with more expensive compute in the middle.

There’s also the timeliness problem. Most fundamental data is quarterly. A DCF model using 3-month-old financials can miss a sudden earnings drop. The market can move significantly in a quarter.

The Sentiment Data Problem Is Worse

Technical analysis (RSI, MACD) is mathematical — you apply it to price data and it gives you a number. Reproducible. Testable.

Sentiment analysis is harder. “The news was positive about Apple today” requires fetching relevant news articles, identifying which are about the company (not competitors, not industry-wide), classifying tone, weighing by source credibility, and aggregating across time.

All of this is noisy. News about Apple’s supply chain issues could be classified as negative even if it implies higher margins. A positive article written by a known bull could be factored too heavily.

Both frameworks use simple heuristics for sentiment scoring. Neither does it well, and both acknowledge this in their disclaimers.

What I Learned Building Data Pipelines

I maintain a multi-agent system for software development (aco-system). It doesn’t deal with financial data — it deals with code. But the data problem is analogous:

Garbage data dominates the cost of the AI. My agents spend non-trivial time handling malformed responses, missing context, stale caches. The AI logic is maybe 20% of the complexity. The data plumbing is 80%.

For financial systems, this ratio is probably worse. Code has well-defined syntax; financial data has well-defined schemas but real-world data is dirty, late, and sometimes wrong.

The Real Cost of “Free” Data

Let’s add it up for a hobby project running one of these systems:

yfinance (free): $0/month, but rate-limited, potentially wrong, no guarantees
Alpha Vantage (cheapest paid tier): $49/month for 75 requests/minute
Financial Datasets API (what ai-hedge-fund recommends): $50-200/month
LLM costs (Groq free tier, rate limited; OpenAI GPT-4o): $0-50/month

For a serious hobby project: $50-250/month. For a research project that actually backtests across decades: $500+/month easily.

The developers who build these systems know this. The disclaimers about “educational use only” aren’t just legal CYA — they’re an acknowledgment that the infrastructure to run this seriously costs real money, and the free version is a demo.

Why Does This Keep Getting Left Out?

Two reasons:

The AI is sexier. Blog posts about multi-agent systems get clicks. Blog posts about API rate limiting get none.

Data quality is hard to show. You can’t screenshot “my data was wrong for 3 hours last Tuesday.” It’s invisible failure.

But if you’re building anything that relies on data — not just trading bots, but any AI system — the data layer is where you’ll spend most of your debugging time.

The best systems are designed with data quality in mind from the start: aggressive caching, fallback providers, explicit data freshness timestamps, validation checks against known-good benchmarks. Both repos do some of this, but neither does it comprehensively.

Aniket Karne

DevOps & AI Engineer · Amsterdam

Back to all posts