Startup World

The remarkable success of OpenAIs o1 series and DeepSeek-R1 has unequivocally demonstrated the power of large-scale reinforcement learning (RL) in eliciting sophisticated reasoning behaviors and significantly enhancing the capabilities of large language models (LLMs).However, the core training methodologies behind these groundbreaking reasoning models often remain veiled in their technical reports.
Recent community efforts have predominantly focused on mathematical reasoning, leaving the challenge of cross-domain generalization largely unexplored.
Furthermore, standard Reinforcement Learning from Preference Optimization (GRPO) training is plagued by common issues such as performance bottlenecks, inefficient sample utilization, and difficulties in cultivating specialized reasoning skills when dealing with mixed-domain datasets.
These challenges complicate the effective scaling of RL methods for LLMs.Addressing these limitations, researchers from the Kwaipilot team at Kuaishou have introduced a novel reinforcement learning framework: Two-Staged history-Resampling Policy Optimization (SRPO).
This innovative approach is designed to systematically tackle the aforementioned training challenges across multiple dimensions.
The team has publicly released a technical report detailing the intricacies of their training method and has also open-sourced the SRPO-Qwen-32B model.Notably, this work marks the first instance of achieving DeepSeek-R1-Zero-level performance concurrently in both mathematical and code domains.
By leveraging the same base model as DeepSeek (Qwen2.5-32B) and employing a purely reinforcement learning training approach, SRPO has achieved impressive results on the AIME24 (50) and LiveCodeBench (41.6) benchmarks, surpassing the performance of DeepSeek-R1-Zero-32B.Even more remarkably, SRPO achieves this level of performance with only one-tenth of the training steps required by R1-Zero.Challenges with Vanilla GRPOIn their initial explorations, the Kwaipilot team experimented with the standard GRPO algorithm.
However, they quickly encountered bottlenecks that prevented the model from reaching the desired R1-Zero performance levels.
These issues included:Cross-Domain Optimization Conflicts (Math vs.
Code): Mathematical problems tend to elicit longer and more detailed reasoning trajectories (Long CoT), while code data exhibits a weaker inclination towards this.
Directly mixing these two data types led to conflicts, resulting in suboptimal performance in both domains.Reduced Training Efficiency due to Similar Group Rewards: The GRPO algorithm relies on the variance of non-zero rewards within a sampled group to calculate the advantage.
When rollouts within a group yield nearly identical reward values, the calculated advantage approaches zero.
If a significant portion of the training batch exhibits this phenomenon, effective gradient contributions become minimal, drastically reducing training efficiency.Premature Performance Saturation: GRPO training encountered early performance plateaus and reward saturation on benchmark evaluations.
This issue was partly attributed to insufficient data quality.
When the training data lacks sufficient complexity or diversity, particularly with an abundance of simpler problems, the model tends to conservatively maintain its performance on easier tasks, hindering its ability to develop the complex and in-depth reasoning required for challenging problems.Two-Staged TrainingTo address the inherent response length conflicts between mathematical and code domains, the Kwaipilot team implemented a two-stage training paradigm:Stage 1: Eliciting Reasoning Abilities: This initial training phase focuses exclusively on challenging mathematical data.
The primary goal is to fully incentivize the models test-time scaling, fostering capabilities such as reflective pausing, backtracking, and step-by-step decomposition.Stage 2: Skill Integration: In this stage, code data is introduced into the training process.
Building upon the reasoning foundation established in Stage 1, this phase aims to further enhance coding abilities while progressively strengthening procedural thinking, recursion, and tool-calling capabilities.Comparative Analysis of Training StrategiesThe impact of different training data strategies on response length was analyzed, revealing the following insights:Mixed Training: Models trained on a mixture of math and code data showed limited growth in response length and poor benchmark performance.
While math problems elicited some reasoning patterns, code problems often resulted in short, direct responses focused on immediate code output with minimal preliminary analysis or planning.Math-Only Training: Training solely on mathematical data led to a stable increase in response length and excellent performance on math benchmarks.
Crucially, it fostered strong and generalizable reasoning abilities; when faced with programming tasks, the model attempted detailed, step-by-step reasoning, including meticulous checking and revisiting steps in mathematical problem-solving.Code-Only Training: While showing improved performance on code benchmarks, the development of explicit reasoning behavior was minimal, and achieving significant increases in response length proved difficult.
Responses to both code and math problems were noticeably shorter compared to math-only training, with code solutions often being directly generated without substantial step-by-step reasoning or initial analysis.Staged Training: The two-stage training approach proposed by the Kwaipilot team yielded superior results in both mathematical and programming domains.
The model consistently generated detailed step-by-step reasoning for math problems and structured reasoning patterns for programming tasks.
Notably, complex behaviors emerged, such as the model spontaneously utilizing code to assist in mathematical reasoning.History ResamplingThe Kwaipilot team observed that during the mid-to-late stages of training, nearly 50% of the sampled groups within a batch produced identical rewards.
This often occurred when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.To address this inefficiency and improve the quality of the gradient signal, they introduced History Resampling.
During training, they recorded the reward outcomes of all rollouts within each epoch.
At the end of an epoch, they reconstructed the dataset for the next epoch based on the following criteria:Filtering Overly Simple Samples: Samples where all rollouts resulted in correct answers were excluded, as they provided no informative signal for policy improvement.Retaining Informative Samples: Samples with diverse outcomes (both correct and incorrect) or all incorrect outcomes were retained.
These samples generated positive reward variance, ensuring non-zero advantages and effective gradient signals.
Furthermore, difficult samples where all rollouts were incorrect in the current epoch were also kept.
The rationale is that these initially challenging problems might become relatively easier for the updated policy, thus generating effective gradients in subsequent training.
This strategy aligns with the principle of curriculum learning, gradually exposing the model to increasingly challenging samples on average to enhance training efficiency.Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency and resulted in more stable response length growth.DataThe Kwaipilot team performed meticulous data cleaning and filtering on publicly available Code&Math datasets.
They applied heuristic rules to filter out irrelevant URLs, formatting noise, and ensured the completeness of core fields (question and answer ground truth) in the original data.
Following the data cleaning approach of PRIME for mathematical data, they removed multi-part questions, pure proof-based problems, and those requiring image or table understanding.
For code data, they excluded problems dependent on specific environments, file I/O, or network interactions, focusing on algorithmic logic.Before data ingestion, they conducted correctness verification for both math and code problems to ensure the accuracy and solvability of the answers, discarding those with incorrect or ambiguous solutions.
Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).Experimental ResultsThis section details the experimental results obtained using the SRPO method.
The Kwaipilot team focused on observing the changes in reward and metrics such as response length during training.Training ProcessThe figure above illustrates the complete reward curve and response length curve during SRPO training.
After the initial reward growth began to plateau, the training transitioned into the second stage.
At the beginning of the second stage, the overall reward decreased due to the models prior lack of training on code, followed by a steady increase in reward during subsequent training.
Integrating code data did not significantly increase the response length, which aligned with their expectations.
Simultaneously, benchmark results indicated a continuous and stable improvement in both the mathematical and coding abilities of the model, demonstrating the effectiveness of the new method.Specifically, History Resampling ensured that gradient updates remained effective at each training step, directly increasing the proportion of informative gradients.
This enhanced sampling efficiency led to stable reward growth, clearly showcasing the improved training efficiency achieved by the resampling strategy.Reasoning BehaviorsThe Kwaipilot team identified three representative reflective patterns: recheck, hesitation, and exploration.
They statistically analyzed responses containing these patterns and recorded the average response length for each.
During RL training, they observed a gradual increase in the frequency of the models self-reflection, correction, and backtracking, indicating the emergence of a self-verification ability.
They posit that the emergence of reflection, akin to human cognitive processes, in the model during RL is an adaptive behavior resulting from the policy optimization process.As shown in the figure above, the model exhibited almost no proactive checking and reflection of previous reasoning steps in the early stages of training.
However, as training progressed, the model displayed significant reflective and backtracking behaviors, forming response patterns such as step-by-step reasoning, numerical substitution, step-by-step verification, and self-optimization.Interestingly, they also discovered that the model learned to spontaneously use program code for verification when solving mathematical problems.
It would first provide a solution process through mathematical reasoning and then proactively write program code to verify the correctness of the solution.
These instances demonstrated the models ability to leverage procedural thinking for self-correction and multiple attempts, further indicating that in the later stages of training, the model had mastered broad thinking and the integrated application of various code-based reasoning approaches for problem-solving.The Paper SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM is on arXivTry with the SRPO-Qwen-32BModel on HuggingFaceLike this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Anthropic, Google score win by nabbing OpenAI-backed Harvey as a user


Y Combinator states Google is a 'monopolist' that has actually 'stunted' the start-up ecosystem


UP.Labs-Porsche’s newest startup wants to be the Plaid of automotive retail


At A Technology NewsRoom All Stage 2025, Rob Biederman will help founders rethink how to scale


Insurtech Bestow lands $120M Series D from Goldman Sachs, Smith Point Capital


VPN company says it didn't know customers had lifetime memberships, cancels them


FCC commissioner writes op-ed titled, “It’s time for Trump to DOGE the FCC“


Copyright Office head fired after reporting AI training isn't always fair usage


New pope chose his name based upon AI's dangers to human dignity


Germ-theory skeptic RFK Jr. goes swimming in sewage-tainted water


United States and China pause tariffs for 90 days as Trump declares historical trade win


Nintendo warns that it can brick Switch consoles if it detects hacking, piracy


A new era in cancer therapies is at hand


Kratos Develops Two Secretive Loyal Wingman Drones Aimed at European Market


Ondas Gets $3.4 M Iron Drone Raider Counter-UAS System Order from Europe


UK RAF Tests Launch of FPV Drones from Helicopters


NATS Unveils Digital Solutions to Power the Future of Advanced Air Mobility in the UK


NAVAIR to Recompete MARV-EL Unmanned Logistics Rotorcraft Contest


Arlington broadens drone program to accelerate authorities response


A3: North American robotic orders remain stable to begin 2025


Universal Robots releases the UR15, its fastest cobot yet


SS Innovations to send SSi Mantra 3 to FDA in July


Waymo robotaxis to map Boston


Orbbec designs Gemini 435Le to help robots see farther, navigate smarter


Realtime Robotics launches Resolver for motion planning, simulation


Congressman is investigating fintech Ramp's effort to win $25M federal contract


Google launches new initiative to back startups building AI


The tinkerers who opened an elegant coffee machine to AI brewing


The Last of Us episode 5 recap: There’s something in the air


The Justice League is not impressed in Peacemaker S2 teaser


Market groups are not pleased about the impending demise of Energy Star


uAvionix Launches skyAlert: Wearable Aircraft Alerting Device for UAS Operators and Visual Observers


American Startup Aims to Deliver Helicopter Performance at Drone Economics


China’s Weather Drones Experiment – One Cup of Cloud Seed Makes 30 Swimming Pools of Rain


General Atomics Gets $11M MQ-9B Protector Support Contract for the UK RAF


US Navy Air-Launches Next-Gen Missile from Unmanned Aircraft


Humanoid robots can benefit from high-performance seals, says Freudenberg


Standard Bots launches 30kg robot arm and U.S. production facility


Physical fitness tracker Whoop faces unhappy clients over upgrade policy


Elizabeth Holmes’ partner reportedly fundraising for new blood-testing startup


A Technology NewsRoom All Stage 2025 invites Boldstart partner Ellen Chisa to talk early-stage enterprise bets


A Technology NewsRoom All Stage 2025: Prepare 4 VC's Jason Kraus will advise on how to turn mayhem into momentum


When doctors describe your brain scan as a “starry sky,” it’s not good


New Lego-building AI creates models that actually stand up in real life


Wearables company's endless complimentary hardware upgrades were too good to be true


Google’s search antitrust trial is wrapping up—here’s what we learned


Linux kernel is leaving 486 CPUs behind, only 18 years after the last one made


Trump kills broadband grants, calls digital equity program “racist and illegal”


Kids are short-circuiting their school-issued Chromebooks for TikTok clout


Celsius founder Alex Mashinsky sentenced to 12 years for “unbank yourself” scam


Do not look now, but a verified gamer is leading the Catholic Church


Trump cuts tariff on UK automobiles; American carmakers not pleased about it


Doom: The Dark Ages review: Shields up!


Europe launches program to entice scientists away from the US


A star has been destroyed by a wandering supermassive black hole


Rocket Report: Rocket Lab to demo cargo delivery; America’s new ICBM in trouble


UK Certifies Protector as First of its Kind Remotely Piloted Aircraft


DSTA and MBDA Deepen Partnership to Advance C-UAS Capabilities


Latvia's Origin Robotics Unveils BLAZE, a Cost-Effective AI-Powered Drone Interceptor


DZYNE Delivers New Autonomous Cargo Glider ‘Grasshopper’ to US Air Force


OA-1K Skyraider II Walk-Around with Test Pilot


Safety and efficiency in robotics design


ABB upgrades Flexley Mover AMR with visual SLAM capabilities


Northeastern soft robotic arm wins MassRobotics Form Function Challenge at Robotics Summit


Sonair debuts ADAR, a 3D ultrasonic sensor for autonomous mobile robots


Scaling startups in the European market


Investing in overlooked European ecosystems


The US is examining Benchmark's financial investment into Chinese AI startup Manus


The Department of Labor just dropped its investigation into Scale AI


Serena-backed health tech lands first FDA approval for home cervical cancer test


Startups Weekly: Different paths on the road to liquidity


Rippling raises $450M at a $16.8 B evaluation, exposes YC is a client


Meta's speeding up the 'Mad Men to Math Men' pipeline


New RSV vaccine, treatment linked to dramatic fall in child hospitalizations


A Soviet-era spacecraft built to land on Venus is falling to Earth instead


AI usage harms expert reputation, study recommends


Fidji Simo signs up with OpenAI as new CEO of Applications


DOGE software engineer's computer system infected by info-stealing malware


Trump just made it much harder to track the nation’s worst weather disasters


Senate passes harsh Republican strategy to block Wi-Fi hotspots for schoolkids


Report: DOGE supercharges mass-layoff software, renames it to sound less dystopian


Microsoft efficiently raises high-end Surface prices by terminating base models


Trump’s NIH ignored court order, cut research grants anyway


Google counters after Apple officer says AI is injuring search


Apple: “Hundreds of millions to billions” lost without App Store commissions


Belief in fake news linked to bothersome social media use


Trump admin to roll back Biden's AI chip limitations


USPTO declines Tesla Robotaxi trademark as simply descriptive


Elon Musk is accountable for killing the world's poorest children, says Bill Gates


Anduril Shows Mass Production of Roadrunner Loitering Interceptor


Teledyne FLIR Defense Unveils Multiple Upgrades to Black Hornet 4 Nano-Drone


HENSOLDT and Quantum Systems Partner to Drive Innovation in Software-Defined Defence


TEKEVER Becomes Europe’s Newest UAS Unicorn


Ukrainian Drones Destroy Russian ‘Zaslon’ Naval Radar on Wheels


Bridger drones are sniffing out methane leaks in remote places


New US-made drone battery offers over 3-hour flight time


U.S. automotive industry increased robot installations by 10% in 2024


Uber investing $100M into WeRide to bring robotaxis to 15 cities


Ex-Synapse CEO reportedly trying to raise $100M for his new humanoid robotics venture


Social media startup Fizz sues Instacart and Partiful for trademark infringement over new Fizz app


Sequoia leads $1.5 B tender sell automation startup Clay


NASA scrambles to cut ISS activity due to budget plan problems


WhatsApp provides no cryptographic management for group messages


Genetic-engineered germs break down industrial contaminants


Matter update may finally take the tedium out of setting up your smart home


We have reached the “severed fingers and abductions” stage of the crypto revolution


Cue: Apple will add AI search in mobile Safari, challenging Google


Starlink: Here's a complimentary dish antenna-- if you pay $120 a month rather of $90


VMware perpetual license holders receive cease-and-desist letters from Broadcom


Ars Technica’s gift guide for Mother’s Day: Give mom some cool things


Everything you ever wished to know about four-wheel steering


Open source project curl is sick of users submitting “AI slop” vulnerabilities


Trump tariffs could make Americans pay $123B more annually for 10 common gadgets


The Third Crisis dawns in Foundation S3 teaser


Ford raises rates on Mexican-made vehicles-- but not the complete tariff cost


Dangerous clear-air turbulence is worsening due to global warming


Amazon's Vulcan robot uses force picking up to stow products


RoboBusiness 2025 call for speakers now open


Fastino trains AI models on inexpensive gaming GPUs and just raised $17.5 M led by Khosla


Rove, founded by a 22-year-old, is assisting Gen Z make airline company miles without charge card


BluSmart investors propose $30M in new funding to revive the Uber rival


ServiceNow acquires Data.World months after snatching up Moveworks


Carta abandons startup shutdown business, instead backs SimpleClosure’s $15M Series A


Video game, Sett, funding: A start-up structure AI representatives for video game advancement emerges from stealth with $27M


Jury orders NSO to pay $167 million for hacking WhatsApp users


The business with the world's largest airplane now has a hypersonic rocket airplane


Trump and DOJ try to spring former county clerk Tina Peters from prison


Trump admin selects COVID critic to be top FDA vaccine regulator


FAA green-lights Starship launches every other week from Starbase


Apps like Kindle are already taking advantage of court-mandated iOS App Store changes


2025 Alfa Romeo Tonale Turbo review: Italian charm that cuts both ways


Nvidia GeForce xx60 series is PC gaming’s default GPU, and a new one is out May 19


For how long will Switch 2's Game Key Cards keep working


Trump administration cuts off all future federal financing to Harvard


Find my… bicycle


Musk's politics see Tesla sales collapse in Europe


Data centers say Trump's crackdown on renewables bad for business, AI


Lighter, less expensive Surface Laptop conserves a little money however quits a lot


Microsoft's 12-inch Surface Pro is cheaper however unfixes a decade-old style issue


Tuesday Telescope: After spacewalking, an astronaut strikes lightning


Man pleads guilty to using malicious AI software to hack Disney employee


Heartbreaking video shows lethal risk of skipping measles vaccine


Signal clone used by Trump main stops operations after report it was hacked


OpenAI scraps controversial plan to become for-profit after installing pressure


Silvus Unveils New DualStream PTT Controller


AeroVironment Red Dragon: A New Breed of Fully Autonomous, GPS-Denied One-Way Attack UAS


Quantum Systems Raises €160M Series C Funding


Palladyne AI and Red Cat Complete Successful Cross-Platform Collaborative Drone Flight


AFRL Awards URSA MAJOR $28.6M Contract for Responsive Space, Hypersonic, and On-Orbit Propulsion


StormShroud Marks the Future of UK Air Combat Power


Two Russian Su-30 Flankers Downed by AIM-9s Fired from Ukrainian Drone Boats


Northrop Grumman Lumberjack Jet-Powered One-Way Attack Munition


DIU, NORTHCOM, JCO Announce Solicitation for Joint Low-Collateral Defeat Capabilities


UK Research on Drones' Role in Future Construction


Insta360 X5 vs GoPro Max: Which is the best 360 camera


DJI adds supercharged challenge detection to Matrice 4D drones


DJI teases new drone with spinning triple-camera system


Ghana turns to Zipline drones amidst USAID supply disruptions


Skydio provides X10D drones for United States Army's recon missions


Frenzy of leaked photos show DJI Osmo 360 'model'


Insta360 X5 vs. X4 cam: What's actually newWhen Insta360 released the X4 last year, it felt like a giant leap forward for 360 ° content developers. With spectacular 8K video and a smooth design, it rapidly became the go-to cam for travelers, vloggers, a


Drones are spying on US bases-- Congress wants a repair


Recapping Robotics Summit Expo 2025


Teradyne Robotics makes leadership modifications at MiR, UR


Leading 10 robotics developments of April 2025


Aurora starts driverless commercial trucking in Texas


igus presents Iggy Rob affordable humanoid for service, industrial applications


HEBI Robotics wins RBR50 award for 'inchworm' robot family


AI information startup WisdomAI catches $23M with a smart method to avoid hallucinations


Ox Security lands a fresh $60M to scan for vulnerabilities in code


Particle brings its AI-powered news reader to the web


Finom, an opposition bank focused on SMBs, lands $105M in development funding from General Catalyst


NewLimit, founded by Coinbase CEO Brian Armstrong, raises $130M to develop age-reversing treatments


Agree.com raises $7.2 M to take on Docusign, Bill.com with AI


Relevance AI raises $24M to help services construct AI agents


Meet Posha, a countertop robot that cooks your meals for you


Employer.com scoops up another fintech in purchase of MainStreet.com


A stealth AI model beat DALL-E and Midjourney on a popular benchmark — its creator just landed $30M


What is Mistral AI Everything to know about the OpenAI competitor


Layoffs hit General Fusion as the fusion power startup runs short on cash


Rork's founders were almost broke when a viral tweet resulted in $2.8 M and a16z


A new startup called Bono aims to modernize the way people donate to charities 


Datadog acquires Eppo, a feature-flagging and experimentation platform