The Regex That Silently Killed My JSON Arrays — aniketkarneai.com

Every few weeks, a bug teaches you something you already knew but kept forgetting: regex is not the right tool for structured text parsing. Here’s one that cost me a few hours of debugging in the ACO system’s JSON extraction utility.

The Problem

The ACO system runs LLMs through multi-stage pipelines — PM, Planner, Architect, Developer, QA. Each stage returns a JSON response that the next stage needs to parse. LLMs, however, don’t just return JSON. They return JSON wrapped in reasoning blocks, explanatory text, code fences, and occasionally a wall of <thinking> tags.

The extract_json function was supposed to handle this. It uses depth-counting — count opening { and [ characters, track nesting depth, stop when you hit the matching close. This is the correct approach for JSON embedded in messy text. It correctly handles {"key": "value with } inside"} because it only counts { and } when outside of strings.

The bug lived in Step 1: stripping LLM reasoning/thinking blocks before depth-counting begins.

The Investigation

The pipeline worked perfectly in direct tests. Architect responses parsed correctly every time. But in production — when the orchestrator spawns agents as subprocesses and reads their stdout — Architect would return a clean approval response and the system would crash with JSONDecodeError.

The JSON was definitely there. I added debug prints and watched it arrive. The text started with {"approved": true. But extract_json returned None.

Iter 1 changed: '{'

That debug line showed the text starting with { after Step 1 ran. So the depth-counting should have found the JSON immediately. But it wasn’t.

The problem: the original Step 2 used this pattern to strip thinking blocks:

text = re.sub(r'\[([^\]]*)\]', '', text)  # Old buggy code

[^\]]* matches anything inside [] that’s not a ]. Sounds fine. But it matches anything inside square brackets — including valid JSON array markers. When an LLM response contained "concerns": [], this regex turned it into "concerns": — a dangling key with no value. Now the JSON was syntactically broken, and the depth-counter found nothing to parse.

Why did tests pass? Because the test harness called agents directly in the same process. The LLM returned JSON with reasoning blocks, but those blocks didn’t happen to contain [] pairs in the right places. The subprocess isolation of production exposed it; the test session didn’t.

The Fix

The fix has two parts:

1. Stop using regex to strip thinking blocks.

# Step 1 now does targeted stripping only:
text = re.sub(r'<[^/>]*(?:/>|>.*?</[^/>]*>)', '', text, flags=re.DOTALL)  # paired tags
text = re.sub(r'〗.*?『', '', text, flags=re.DOTALL)  # Unicode bracket thinking
text = re.sub(r'OKEN', '', text)  # bare bracket thinking (MiniMax-style)
text = re.sub(r'</?[a-zA-Z][^>]*>', '', text)  # bare closing tags

Each pattern targets a specific known format. None of them match arbitrary [] pairs.

2. The depth-counter does the real work.

Once thinking blocks are stripped, depth-counting takes over:

first_char = text[0]
if first_char == '{':
    opening_char, closing_char = '{', '}'
elif first_char == '[':
    opening_char, closing_char = '[', ']'
else:
    # find first { or [ in text
    ...

# Depth-counting scan
for i in range(start, len(text)):
    if ch == '"':
        in_string = not in_string
        continue
    if in_string:
        continue
    if ch == opening_char:
        depth += 1
    elif ch == closing_char:
        depth -= 1
        if depth == 0:
            return json.loads(text[start:i+1])

This finds the actual JSON boundaries regardless of what surrounds them. The thinking block stripping is conservative cleanup — it removes known noise, not structural delimiters.

What This Taught Me

Two rules I keep relearning:

Test at the integration level, not just the unit level. Unit tests for extract_json passed because they used clean fixture data. The bug only surfaced in the full subprocess pipeline. The test_water_reminder_workflow.py integration test — which actually spawns all agents end-to-end — caught it.
Regex for structure is a liability. [^\]]* looks like it’s matching “non-] characters” but it’s really “grab everything between brackets, including brackets that are JSON.” If you need to parse structured text, use a parser. Depth-counting is a parser for JSON. For XML, use an XML parser. Don’t use regex as a substitute.

The ACO system now has 8 integration tests covering the full pipeline. The extract_json function has a note: “Do NOT use regex to strip thinking blocks — it can destroy JSON arrays.” Future me will appreciate that.

Aniket Karne

DevOps & AI Engineer · Amsterdam

Back to all posts