当前位置: 首页 > news >正文

All Our N-gram are Belong to You

为什么80%的码农都做不了架构师?>>>   hot3.png

Google的超大5元语言模型

----------------------------------

《Beautiful Data》第14章,讲得是Google的超大5元语言模型

对此模型有兴趣的读者可以查阅,下文

----------------------------------

Google Research Blog上的文章《Official Google Research Blog: All Our N-gram are Belong to You》

-----------------------------------

Posted by Alex Franz and Thorsten Brants, Google Machine Translation Team


Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

Watch for an announcement at the Linguistics Data Consortium ( LDC), who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.

Update (22 Sept. 2006): The LDC now has the data available in their catalog. The counts are as follows:
File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens:    1,024,908,267,229
Number of sentences:    95,119,665,584
Number of unigrams:         13,588,391
Number of bigrams:         314,843,401
Number of trigrams:        977,069,902
Number of fourgrams:     1,313,818,354
Number of fivegrams:     1,176,470,663

The following is an example of the 3-gram data contained this corpus:
ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
ceramics collection | 59
ceramics collections , 66
ceramics collections . 60
ceramics combined with 46
ceramics come from 69
ceramics comes from 660
ceramics community , 109
ceramics community . 212
ceramics community for 61
ceramics companies . 53
ceramics companies consultants 173
ceramics company ! 4432
ceramics company , 133
ceramics company . 92
ceramics company </S> 41
ceramics company facing 145
ceramics company in 181
ceramics company started 137
ceramics company that 87
ceramics component ( 76
ceramics composed of 85
ceramics composites ferrites 56
ceramics composition as 41
ceramics computer graphics 51
ceramics computer imaging 52
ceramics consist of 92

The following is an example of the 4-gram data in this corpus:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
serve as the information 838
serve as the informational 41
serve as the infrastructure 500
serve as the initial 5331
serve as the initiating 125
serve as the initiation 63
serve as the initiator 81
serve as the injector 56
serve as the inlet 41
serve as the inner 87
serve as the input 1323
serve as the inputs 189
serve as the insertion 49
serve as the insourced 67
serve as the inspection 43
serve as the inspector 66
serve as the inspiration 1390
serve as the installation 136
serve as the institute 187
serve as the institution 279
serve as the institutional 461
serve as the instructional 173
serve as the instructor 286
serve as the instructors 161
serve as the instrument 614
serve as the instruments 193
serve as the insurance 52
serve as the insurer 82
serve as the intake 70
serve as the integral 68

转载于:https://my.oschina.net/nord/blog/109747

相关文章:

  • 域用户权限|运行软件
  • 使用GitHub进行版本管理
  • RAC 开启gsd和oc4j服务
  • 让LINUX发出声音
  • 如何在Linux单用户模式下修改fstab文件
  • Nginx+proxy_cache高速缓存配置
  • WIN7 共享网络方法
  • shell脚本编程基础
  • HDU-2059 龟兔赛跑 动态规划
  • 简述WebService与.NET Remoting的区别及适应场合
  • Java开源报表JasperReport、iReport4.5.1使用详解(二)
  • 本公司信息发布系统的优点
  • 判断元素是否可见的jQuery 新窗口打开图片
  • Linux内核中_IO,_IOR,_IOW,_IOWR宏的用法与解析【转】
  • Revit参数族之ZP系列消声器
  • 2017年终总结、随想
  • CAP理论的例子讲解
  • echarts的各种常用效果展示
  • Java-详解HashMap
  • JS数组方法汇总
  • leetcode-27. Remove Element
  • Python - 闭包Closure
  • React Transition Group -- Transition 组件
  • SegmentFault 2015 Top Rank
  • Sublime text 3 3103 注册码
  • ⭐ Unity 开发bug —— 打包后shader失效或者bug (我这里用Shader做两张图片的合并发现了问题)
  • 山寨一个 Promise
  • 想写好前端,先练好内功
  • postgresql行列转换函数
  • Semaphore
  • 扩展资源服务器解决oauth2 性能瓶颈
  • ​​快速排序(四)——挖坑法,前后指针法与非递归
  • #我与Java虚拟机的故事#连载06:收获颇多的经典之作
  • $L^p$ 调和函数恒为零
  • (04)Hive的相关概念——order by 、sort by、distribute by 、cluster by
  • (Java实习生)每日10道面试题打卡——JavaWeb篇
  • (黑客游戏)HackTheGame1.21 过关攻略
  • (三分钟了解debug)SLAM研究方向-Debug总结
  • (十八)用JAVA编写MP3解码器——迷你播放器
  • (转)使用VMware vSphere标准交换机设置网络连接
  • .NET Core中的去虚
  • .net MySql
  • .NET 应用架构指导 V2 学习笔记(一) 软件架构的关键原则
  • .net中生成excel后调整宽度
  • [ Linux Audio 篇 ] 音频开发入门基础知识
  • [ web基础篇 ] Burp Suite 爆破 Basic 认证密码
  • [ACL2022] Text Smoothing: 一种在文本分类任务上的数据增强方法
  • [bzoj1912]异象石(set)
  • [C#]C# winform实现imagecaption图像生成描述图文描述生成
  • [C语言][C++][时间复杂度详解分析]二分查找——杨氏矩阵查找数字详解!!!
  • [Deep Learning] 神经网络基础
  • [EULAR文摘] 利用蛋白组学技术开发一项蛋白评分用于预测TNFi疗效
  • [Gradle] 在 Eclipse 下利用 gradle 构建系统
  • [HNOI2018]排列
  • [html] 动态炫彩渐变背景