Published on 04 Feb 2025

DeepSeek vs ChatGPT: The verdict

DeepSeek has made its debut, and it has been a “deep surprise”. This rising Chinese artificial intelligence (AI) company is said to be capable of training new models that rival existing large language models at a very low cost.

The strategic timing of its release just days before the Year of the Wood Snake was no coincidence. With the long Chinese New Year holiday ahead, idle Chinese users eager for something new, would be tempted to install the application and try it out, quickly spreading the word through social media. But launching during the Year of the Snake was not enough — DeepSeek soon made another audacious move. Two days before Chinese New Year, it released its text-to-image model, Janus-Pro 7B, to capitalise on the optimal Chinese New Year time window.

Global users of other major AI models were eager to see if Chinese claims that DeepSeek V3 (DS-V3) and R1 (DS-R1) could rival OpenAI’s ChatGPT-4o (CG-4o) and o1 (CG-o1) were true. Social media was flooded with test posts, but many users could not even tell V3 and R1 apart, let alone figure out how to switch between them. Cross-platform comparisons were mostly random, with users drawing conclusions based on gut feelings. Testing methods also varied, leading to different conclusions.

... real-world AI tasks go far beyond answering exam questions — the real challenge lies in breadth of knowledge, flexible retrieval and deep investigation.
DeepSeek’s official website lists benchmark inference efficiency scores comparing DS-V3 with CG-4o and other mainstream models, showing that DS-V3 performs reliably, even surpassing some competitors in certain metrics. However, these “exam scores” only reflect models’ average performance in multiple-choice or constrained Q&A tasks, where models can be specifically optimised, much like “teaching to the test”. But real-world AI tasks go far beyond answering exam questions — the real challenge lies in breadth of knowledge, flexible retrieval and deep investigation. High scores in a controlled environment do not guarantee dominance in the real world; an AI’s true capabilities are seen when it faces unpredictable, real-life task prompts.

Battle of the AIs

To find out the strengths, weaknesses and suitable applications of each model, we conducted three rounds of tests from a scientific perspective on the first two days of Chinese New Year. CG-4o and DS-V3 are all-rounders, excelling in general knowledge and reasoning, making them suitable for a variety of tasks. CG-o1 and DS-R1, meanwhile, shine in specific tasks but have varying strengths and weaknesses when handling more complex or open-ended problems.

Three rounds of testing were conducted surrounding the themes of “cultural research”, “creative writing” and “planning and decision-making”, spanning multidimensional abilities such as knowledge accuracy, command of language style, logical reasoning and task execution.

For each round of testing, the four models each generates two responses. We selected the best response from each model as their “final submission” for comparison, and scored them based on six criteria: accuracy of content, structural coherence, completeness of expression, clarity of language, relevance to the theme, and innovativeness.

Test 1: names of horses

This test requires the models to verify the ancient names and definitions of 41 horse types (for example, zhui (骓) refers to a horse with a pale/greyish-white coat; while ju (驹) refers to a horse under two years old).

DS-V3 merely repeated the list item by item, correcting some errors. Its scores across all six evaluation criteria ranged from 2/5 to 3.5/5. CG-4o, DS-R1 and CG-o1 all provided additional historical context, modern applications and sentence examples. Among them, DS-R1 and CG-o1 even conducted in-depth scholarly research, with detailed content and clear logic.

The strongest performer overall was CG-o1, which demonstrated a thorough thought process and precise analysis, earning a perfect score of 5/5. DS-R1 was better in research but had a more academic tone, resulting in a slightly lower clarity of expression (3.5/5) compared to CG-o1’s 4.5/5. CG-4o demonstrated fluent language and rich cultural supplementary information, making it suitable for the general reader.

DS-R1’s “The True Story of a Screen Slave” came closest to capturing Lu Xun’s style. It was rich in symbolism and allegory, satirising phone worship through the fictional deity “Instant Manifestation of the Great Joyful Celestial Lord”...

Test 2: An imitation of Lu Xun’s diction

The four models were asked to write a satirical essay in the style of Chinese writer and literary critic Lu Xun’s prose, avoiding internet slang and limiting themselves to literary expression. The essays were also expected to demonstrate Lu Xun’s critical spirit, writing style and thought model.

DS-R1’s “The True Story of a Screen Slave” came closest to capturing Lu Xun’s style. It was rich in symbolism and allegory, satirising phone worship through the fictional deity “Instant Manifestation of the Great Joyful Celestial Lord” and incorporating symbolic settings like the “Phone Abstinence Society”, earning a perfect 5/5 for creativity and depth of expression. Reading it was like seeing Lu Xun reborn, with a pen in hand satirising humanity.

CG-o1’s “The Cage of Freedom” offered a solemn and analytical critique of social media addiction. It was logically sound and philosophically rich, but less symbolic, while still maintaining a certain degree of Lu Xun’s style (depth of expression: 4.5/5). CG-4o’s “The Biography of the Heads-Down Tribe” delivered a powerful critique with a proper structure, suitable for modern essay styles. However, its level of satire was relatively mild (depth of expression: 4/5). DS-V3’s “Diary of a Smartphone Madman” is relatively plain, resembling a personal reflection, delivering a weaker level of satire and depth of critique (depth of expression: 3/5).

Overall, DS-R1 most successfully captured Lu Xun’s style and excelled in allegorical satire; CG-o1 leaned more towards a rational analysis, while CG-4o is suitable for the general audience. DS-V3, on the other hand, lacked distinctiveness.

Test 3: Crafting a spring cleaning plan

The four AI models were challenged to create a seven-day Chinese New Year cleaning plan, progressing from easier to harder tasks, and offering advice on overcoming hoarding tendencies.

DS-R1 gamifies decluttering with features like reminder cards and celebratory music, emphasising psychological growth and mindset shifts. CG-o1 offers a pragmatic, logically rigorous approach based on three decluttering principles. CG-4o provides a structured daily cleaning plan targeting specific areas, effectively integrating psychological advice with practical application. DS-V3 presented a sound structure, but lacked detail; its task arrangements were haphazard and its psychological guidance was weak.

Rated on a scale of 5, DS-R1 came out on top in both psychological adjustment and creativity (both 5/5). CG-o1 is best when it comes to execution and logic (both 5/5). CG-4o balanced psychological construction and operability (both 5/5); whereas DS-V3 serves as a “summary” suitable for users who only need a rough guideline (execution and psychological adjustment both 3/5). Overall, DS-R1 makes decluttering more immersive, CG-o1 is ideal for efficient execution, while CG-4o is a compromise between the two.

... this test is only applicable to Chinese text generation tasks, and does not cover programming, mathematics or multilingual capabilities. Different users have different needs; the best AI model is the one most suited to users’ requirements.

No such thing as the ‘best’ AI

The three rounds of testing revealed the different focuses of the four models, emphasising that task suitability is an important consideration when choosing which model to use.

CG-4o is an all-rounder, suitable for broad application, while CG-o1 is clear in logic and well-researched, ideal for precise task execution. Meanwhile, DS-R1 excels in cultural expression and the use of symbols and allegories, thus making it suitable for creative tasks. DS-V3 is better for information organisation or general direction guidance, ideal for those needing a TL;DR (too long; didn’t read — a quick summary, in other words).

Additionally, the training data of each model affects its performance on specific tasks. For instance, DS-R1 performed well in tests imitating Lu Xun’s style, possibly due to its rich Chinese literary corpus, but if the task was changed to something like “write a job application letter for an AI engineer in the style of Shakespeare”, ChatGPT might outshine it.

It is important to note that these test results merely reflect present circumstances — the situation could change with model upgrades. Furthermore, this test is only applicable to Chinese text generation tasks, and does not cover programming, mathematics or multilingual capabilities. Different users have different needs; the best AI model is the one most suited to users’ requirements.

The generative AI competition is certainly unfolding with great anticipation. Alibaba swiftly announced the Qwen2.5 Max version on the first day of Chinese New Year, clearly not wanting DeepSeek to steal all the limelight. If DeepSeek is indeed able to create an AI comparable to top-tier models at just 10% of training costs, it would undoubtedly be a disruptive breakthrough in cost-effectiveness that could potentially reshape the entire industrial ecosystem.

The AI face-off is just beginning to unfold. Rather than being swayed by marketing and reviews, why not try them yourself? Ultimately, the strengths and weaknesses of a model can only be verified through practical application. More importantly, AI evolution never stops; the standing of a model today does not determine its prospects tomorrow. Instead of clinging to outdated assumptions, it would be better to approach AI with an open mind by testing and experimenting with various models to truly make AI a helpful assistant.

This article was first published in Lianhe Zaobao as “当DeepSeek遇上ChatGPT”.

Read the original article here.

DeepSeek vs ChatGPT: The verdict

Battle of the AIs

Test 1: names of horses

Test 2: An imitation of Lu Xun’s diction

Test 3: Crafting a spring cleaning plan

No such thing as the ‘best’ AI

Quick links

Get in touch

Connect with us