Joe Resident on Nostr: o3 isn't as good as I hoped, but it's still an increment in the SOTA. 69% on ...
o3 isn't as good as I hoped, but it's still an increment in the SOTA.
69% on SWE-Bench Verified! The regression line over the past 2 years still points to 100‰ by year end!
Frankly I think the real story is how cheaply Gemini 2.5 is delivering 64% on SWE-Bench
Exciting times! Coding with Gemini 2.5 is so satisfying, a big step up from deepseek V3.1, which is what I was using before.
#ai #llm #o3
Published at
2025-04-16 20:24:33 UTCEvent JSON
{
"id": "7a68e09f8126af89131671c5530c27619de9baad9401e239c818a94748b31fa7",
"pubkey": "a43b0118fd72492f2ba11290cccb27418b1fdbb7ce3a122d229404e57a75975a",
"created_at": 1744835073,
"kind": 1,
"tags": [
[
"t",
"ai"
],
[
"t",
"llm"
],
[
"t",
"o3"
]
],
"content": "o3 isn't as good as I hoped, but it's still an increment in the SOTA. \n\n69% on SWE-Bench Verified! The regression line over the past 2 years still points to 100‰ by year end!\n\nFrankly I think the real story is how cheaply Gemini 2.5 is delivering 64% on SWE-Bench\n\nExciting times! Coding with Gemini 2.5 is so satisfying, a big step up from deepseek V3.1, which is what I was using before.\n\n#ai #llm #o3",
"sig": "a82858309a655a85207cd2ea1b075dc18ef0b98c6f8bc6b334bb40dd5208ab56a28abd85948978921f854bdead35a56653704abddb53f73a55da0604c8cf2a10"
}