Crypto M - Crypto News

🚀 OpenAI Introduces New Benchmark Test for AI Information Retrieval

According to PANews, OpenAI has released a new benchmark test called BrowseComp, designed to evaluate AI agents' ability to find difficult-to-access information on the internet. This test includes 1,266 challenging questions, aiming to simulate an 'online treasure hunt' within complex information networks, where answers are hard to find but easy to verify. The questions span various fields, including film, technology, and history, and are significantly more difficult than existing tests like SimpleQA.

The AIGC Open Community reports that this benchmark is highly challenging, with OpenAI's own models, GPT-4o and GPT-4.5, achieving accuracy rates of only 0.6% and 0.9%, respectively. Even with the browser-enabled GPT-4o, the accuracy only reaches 1.9%. However, OpenAI's newly released Agent model, Deep Research, has achieved a much higher accuracy rate of 51.5%.

#OpenAI #BrowseComp #AI #InformationRetrieval #BenchmarkTest #GPT4o #GPT45 #DeepResearch #Accuracy #OnlineTreasureHunt

24 views00:05

Crypto M - Crypto News

🚀 AI TRENDS | Zhipu AI Increases Price of New GLM-5.1 Model by 10%

Zhipu AI (02513.HK) has announced the release of its new flagship model, GLM-5.1, with a 10% price increase. According to Jin10, this model is the only open-source model capable of sustaining operations for eight hours. In the SWE-bench Pro benchmark test, which closely simulates real software development, GLM-5.1 became the first domestic model to surpass Opus 4.6. OpenRouter indicates that with this release, Zhipu AI has again raised the price of its GLM model by 10%. Following the price adjustment, the token price for cache hits in the Coding scenario of GLM-5.1 is now comparable to that of Claude Sonnet 4.6 by Anthropic. This marks the first time a domestic large model has achieved price parity with leading overseas manufacturers in core scenarios.

#AI #ZhipuAI #GLM51 #PriceIncrease #OpenSourceModel #BenchmarkTest #SWEbenchPro #Opus46 #CodingScenario #ClaudeSonnet46 #Anthropic #DomesticModel #TechTrends

2 views02:44

About

Blog

Apps

Platform