DeepSeek-affiliated Hangzhou DeepSeek AI Fundamental Technology Research Co.997 Archives Ltd. today filed a patent for a new web data collection system designed to improve efficiency and data quality. The patent outlines a method for discovering more webpage links while minimizing website traffic impact. It assesses downloaded content to predict the quality of undiscovered links, prioritizing high-value data and reducing redundant downloads. Efficient web data collection is crucial for training large language models (LLMs), which power AI systems like ChatGPT. Existing techniques struggle with incomplete link retrieval, excessive downloads that can crash websites, and low-quality data filtering. DeepSeek’s proposed system aims to solve these issues by optimizing data allocation and maintaining metadata accuracy. [iThome, in Chinese]
Related Articles
The 'Wonderful World of Disney' may soon meet the disastrous world of Trump's tweets
2025-06-26 06:21
2949 views
Read More
'The Phillip DeFranco Show' wins show of the year at the Streamy Awards
2025-06-26 06:18
710 views
Read More
Why watch Pence vs. Kaine? Because this is the real presidential debate.
2025-06-26 06:16
2532 views
Read More
Twitter is out of time, despite Jack Dorsey's best turnaround efforts
2025-06-26 04:52
104 views
Read More
Debate moderator Elaine Quijano became the internet's patron saint of patience tonight
2025-06-26 04:44
2585 views
Read More