Summary of "GLM 5.1 vs MiniMax M2.7 — The Brutal Coding Test via OpenClaw"

GLM 5.1 (Z.ai) vs MiniMax M2.7 (MiniMax) — OpenClaw live coding comparison

Context / setup

Planned tests & outcomes

The session ran a sequence of realistic engineering tasks. Each item below lists what was tested, the outcome, and the winner.

  1. Code creation — Draw a world map in raw SVG + Bezier arc animation (vanilla JS, no libraries)

    • What was tested: pure generation from a single prompt, path animation, SVG map accuracy.
    • Outcome:
      • GLM 5.1 produced a more realistic, responsive result (live-feeds, correct-looking world map).
      • MiniMax produced a reasonable attempt but looked less “real-time.”
    • Winner: GLM 5.1
  2. COBOL modernization — Translate a JSON parser module from a legacy COBOL Minecraft server to idiomatic Python with tests

    • What was tested: reading unfamiliar legacy codebase, translating to clean Python, following repo patterns.
    • Outcome:
      • GLM 5.1 progressed much faster and produced a substantive translation and refactor, demonstrating strong legacy-code handling.
      • MiniMax was slower on this task.
    • Winner: GLM 5.1
  3. Feature addition (FastAPI threat intel platform) — Add a watchlist system (save IOCs to named watchlists + alerting)

    • What was tested: understanding a complex modern stack, adding model/repo/service/schema/routes, Alembic migrations, hooks into the Celery pipeline, full CRUD and alerting flow.
    • Outcome:
      • GLM 5.1: fast, architecturally sound; created layers and a handwritten Alembic migration, implemented alerting and CRUD.
      • MiniMax: more thorough and modular — separated alert routes, a separate alert schema file, explicit Alembic migration with upgrade/downgrade, FK ordering, indexing, and guards to avoid duplicate alerts. Slower and used more tokens.
    • Notes: Token usage — MiniMax ≈ 62K tokens vs GLM ≈ 32K for this task.
    • Winner: MiniMax M2.7 (for architectural completeness and correctness)
  4. Bug finding & fixing — Find and fix the single most critical production bug (no hints)

    • What was tested: code comprehension, reasoning, fixes, and secondary issue detection.
    • Outcome:
      • Both models found the same primary bug (silent exception / failure flag in task.py).
      • GLM’s fix: flipped the failure flag; lower token usage and faster, but did not fully utilize retry semantics.
      • MiniMax’s fix: more complete — activated self.retry with exponential backoff, fixed a dead max-retry code path, and also flagged an unrelated JWT ‘nbf’ security issue (secondary bug). Produced deeper output at higher token cost.
    • Winner: MiniMax M2.7 (for depth and additional security discovery)
  5. Refactoring — Break apart a “god” module and restructure problematic areas

    • What was tested: high-level architecture, query reduction, making scoring injectable, introducing interfaces/protocols, and efficient upsert vs ORM loops.
    • Outcome:
      • GLM 5.1: split the god module into focused files (ingestion, bulk ingestion, context loading, facade), kept routes unchanged, and made scoring injectable. Clean and fast but shallower.
      • MiniMax: took a more senior-engineer approach — created a package, formal protocol interfaces, injectable classes, used native upsert for bulk paths, and caught more issues. More comprehensive structural changes.
    • Winner: MiniMax M2.7

Overall analysis / verdict

Guides / tutorials referenced

Main speakers / sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video