mysql数据库千万级别数据的查询优化和分页测试

http://blog.sina.com.cn/s/blog_438308750100im0b.html

我原来的公司是一家网络游戏公司,其中网站交易与游戏数据库结合通过ws实现的,但是交易记录存放在网站上,级别是千万级别的数据库是mysql数据库.

可能有人会问mysql是否支持千万级数据库,还有既然已经到了这个数据量公司肯定不差,为什么要用mysql而不用oracle这里我做一下解答
1. mysql绝对支持千万级数据库是可以肯定的，
2. 为什么选择择mysql呢？
1> 第一也是最主要的一条是mysql他能做到。
2>在第一点前提下以下的就不是太重要了,mysql相对操作简单,测试容易,配置优化也相对容易很多
3>我们这里的数据仅仅是为了记录交易保证交易是被记录的,对于查询的还是相对少只有管理后台操作中需要对数据库进行查询
4>数据结构简单,而且每条记录都非常小,因为查询速度不管和记录条数有关和数据文件大小也有直接关系.
5>我们采用的是大小表的解决办法,每天大概需要插入数据库好几百万条,这里可能还是有人怀疑,其实没问题,如果批量插入我测试的在普通的pc机子上带该一个线程并发我插入的是6千万条记录大概需要“JDBC插入6000W条数据用时:9999297ms”,小表保存最近插入的内容,把几天前的保存到大表中,这里我说的就是大表大概6-7千万条数据;

带着这些疑问和求知欲望咱们来做一个测试，因为在那个时候我也不是dba不知道人家是怎么搞的能够做成这么大的数据量,我们平时叶总探讨一些相关的内容

1.mysql的数据查询,大小字段要分开,这个还是有必要的,除非一点就是你查询的都是索引内容而不是表内容,比如只查询id等等
2.查询速度和索引有很大关系也就是索引的大小直接影响你的查询效果,但是查询条件一定要建立索引,这点上注意的是索引字段不能太多，太多索引文件就会很大那样搜索只能变慢,
3.查询指定的记录最好通过Id进行in查询来获得真实的数据.其实不是最好而是必须，也就是你应该先查询出复合的ID列表,通过in查询来获得数据

  我们来做一个测试ipdatas表：
  CREATE TABLE `ipdatas` (
   `id` INT(11) NOT NULLAUTO_INCREMENT,
   `uid` INT(8) NOT NULL DEFAULT'0',
   `ipaddress` VARCHAR(50) NOTNULL,
   `source` VARCHAR(255) DEFAULTNULL,
   `track` VARCHAR(255) DEFAULTNULL,
   `entrance` VARCHAR(255)DEFAULT NULL,
   `createdtime` DATETIME NOTNULL DEFAULT '0000-00-00 00:00:00',
   `createddate` DATE NOT NULLDEFAULT '0000-00-00',
   PRIMARY KEY (`id`),
   KEY `uid` (`uid`)
  ) ENGINE=MYISAM AUTO_INCREMENT=67086110 DEFAULTCHARSET=utf8;
这是我们做的广告联盟的推广ip数据记录表，由于我也不是mysql的DBA所以这里咱们仅仅是测试
  因为原来里面有大概7015291条数据

这里我们通过jdbc的batch插入6000万条数据到此表当中“JDBC插入6000W条数据用时:9999297ms”；
大概用了两个多小时,这里面我用的是batch大小大概在1w多每次提交,还有一点是每次提交的数据都很小，而且这里用的myisam数据表，因为我需要知道mysql数据库的大小以及索引数据的大小结果是
ipdatas.MYD 3.99 GB (4,288,979,008 字节)
ipdatas.MYI 1.28 GB (1,377,600,512 字节)
这里面我要说的是如果真的是大数据如果时间需要索引还是最好改成数字字段,索引的大小和查询速度都比时间字段可观。

  步入正题:
  1.全表搜索
返回结构是67015297条数据
   SELECT COUNT(id) FROMipdatas;
   SELECT COUNT(uid) FROMipdatas;
   SELECT COUNT(*) FROMipdatas;
  首先这两个全表数据查询速度很快,mysql中包含数据字典应该保留了数据库中的最大条数
查询索引条件
   SELECT COUNT(*) FROM ipdatasWHERE uid=1;  返回结果时间:2分31秒594
   SELECT COUNT(id) FROM ipdatasWHERE uid=1;  返回结果时间:1分29秒609
   SELECT COUNT(uid) FROM ipdatasWHERE uid=1; 返回结果时间:2分41秒813
  第二次查询都比较快因为mysql中是有缓存区的所以增大缓存区的大小可以解决很多查询的优化，真可谓缓存无处不在啊在程序开发中也是层层都是缓存
查询数据
   第一条开始查询
   SELECT * FROM ipdatas ORDER BYid DESC LIMIT 1,10 ; 31毫秒
   SELECT * FROM ipdatas LIMIT1,10 ; 15ms

   第10000条开始查询
   SELECT * FROM ipdatas ORDER BYid ASC LIMIT 10000,10 ; 266毫秒
   SELECT * FROM ipdatas LIMIT10000,10 ; 16毫秒

   第500万条开始查询
   SELECT * FROM ipdatas LIMIT5000000,10 ;11.312秒
   SELECT * FROM ipdatas ORDER BYid ASC LIMIT 5000000,10 ; 221.985秒
  这两条返回结果完全一样,也就是mysql默认机制就是id正序然而时间却大相径庭

   第5000万条开始查询
   SELECT * FROM ipdatas LIMIT60000000,10 ;66.563秒 (对比下面的测试)
   SELECT * FROM ipdatas ORDER BYid ASC LIMIT 50000000,10; 1060.000秒
   SELECT * FROM ipdatas ORDER BYid DESC LIMIT 17015307,10; 434.937秒
  第三条和第二条结果一样只是排序的方式不同但是用时却相差不少，看来这点还是不如很多的商业数据库,像oracle和sqlserver等都是中间不成两边还是没问题，看来mysql是开始行越向后越慢，这里看来可以不排序的就不要排序了性能差距巨大,相差了20多倍

查询数据返回ID列表
   第一条开始查
   select id from ipdatas orderby id asc limit 1,10; 31ms
   SELECT id FROM ipdatas LIMIT1,10 ; 0ms

   第10000条开始
   SELECT id FROM ipdatas ORDERBY id ASC LIMIT 10000,10; 68ms
   select id from ipdatas limit10000,10;0ms

   第500万条开始查询
   SELECT id FROM ipdatas LIMIT5000000,10; 1.750s
   SELECT id FROM ipdatas ORDERBY id ASC LIMIT 5000000,10;14.328s

   第6000万条记录开始查询
   SELECT id FROM ipdatas LIMIT60000000,10; 116.406s
   SELECT id FROM ipdatas ORDERBY id ASC LIMIT 60000000,10; 136.391s

   select id from ipdataslimit 10000002,10; 29.032s
   select id from ipdatas limit20000002,10; 24.594s
   select id from ipdatas limit30000002,10; 24.812s
   select id from ipdatas limit40000002,10; 28.750s  84.719s
   select id from ipdatas limit50000002,10; 30.797s  108.042s
   select id from ipdatas limit60000002,10; 133.012s  122.328s

   select * from ipdatas limit10000002,10; 27.328s
   select * from ipdatas limit20000002,10; 15.188s
   select * from ipdatas limit30000002,10; 45.218s
   select * from ipdatas limit40000002,10; 49.250s  50.531s
   select * from ipdatas limit50000002,10; 73.297s  56.781s
   select * from ipdatas limit60000002,10; 67.891s  75.141s

   select id from ipdatasorder by id asc limit 10000002,10; 29.438s
   select id from ipdatas orderby id asc limit 20000002,10; 24.719s
   select id from ipdatas orderby id asc limit 30000002,10; 25.969s
   select id from ipdatas orderby id asc limit 40000002,10; 29.860d
   select id from ipdatas orderby id asc limit 50000002,10; 32.844s
   select id from ipdatas orderby id asc limit 60000002,10; 34.047s

   至于SELECT * ipdatas order byid asc 就不测试了大概都在十几分钟左右
   可见通过SELECT id不带排序的情况下差距不太大,加了排序差距巨大
   下面看看这条语句
   SELECT * FROM ipdatas WHERE idIN(10000,100000,500000,1000000,5000000,10000000,2000000,30000000,40000000,50000000,60000000,67015297);
   耗时0.094ms
  可见in在id上面的查询可以忽略不计毕竟是6000多万条记录，所以为什么很多lucene或solr搜索都返回id进行数据库重新获得数据就是因为这个,当然lucene/solr+mysql是一个不错的解决办法这个非常适合前端搜索技术,比如前端的分页搜索通过这个可以得到非常好的性能.还可以支持很好的分组搜索结果集,然后通过id获得数据记录的真实数据来显示效果真的不错,别说是千万级别就是上亿也没有问题,真是吐血推荐啊.

上面的内容还没有进行有条件的查询仅仅是一些关于orderby和limit的测试,请关注我的下一篇文件对于条件查询的1亿数据检索测试

最近做了个项目，实现对存在千万条记录的库表进行插入、查询操作。原以为对数据库的插入、查询是件很容易的事，可不知当数据达到百万甚至千万条级别的时候，这一切似乎变得相当困难。几经折腾，总算完成了任务。

　　1、避免使用Hibernate框架

　　Hibernate用起来虽然方便，但对于海量数据的操作显得力不从心。

　　关于插入：

　　试过用Hibernate一次性进行5万条左右数据的插入，若ID使用sequence方式生成，Hibernate将分5万次从数据库取得5万个sequence，构造成相应对象后，再分五万次将数据保存到数据库。花了我十分钟时间。主要的时间不是花在插入上，而是花在5万次从数据库取sequence上，弄得我相当郁闷。虽然后来把ID生成方式改成increase解决了问题，但还是对那十分钟的等待心有余悸。

　　关于查询：

　　Hibernate对数据库查询的主要思想还是面向对象的，这将使许多我们不需要查询的数据占用了大量的系统资源(包括数据库资源和本地资源)。由于对Hibernate的偏爱，本着不抛弃、不放弃的作风，做了包括配SQL，改进SQL等等的相当多的尝试，可都以失败告终，不得不忍痛割爱了。

　　2、写查询语句时，要把查询的字段一一列出

　　查询时不要使用类似select * from x_table的语句，要尽量使用select id,name from x_table，以避免查询出不需要的数据浪费资源。对于海量数据而言，一个字段所占用的资源和查询时间是相当可观的。

　　3、减少不必要的查询条件

　　当我们在做查询时，常常是前台提交一个查询表单到后台，后台解析这个表单，而后进行查询操作。在我们解析表单时，为了方便起见，常常喜欢将一些不需要查询的条件用永真的条件来代替(如：select count(id) from x_table where name like ‘%’)，其实这样的SQL对资源的浪费是相当可怕的。我试过对于同样的近一千万条记录的查询来说，使用select count(id) from x_table 进行表查询需要11秒，而使用select count(id) from x_table where name like ‘%’却花了33秒。

　　4、避免在查询时使用表连接

　　在做海量数据查询时，应尽量避免表连接(特别是左、右连接)，万不得已要进行表连接时，被连接的另一张表数据量一定不能太大，若连接的另一张表也是数万条的话，那估计可以考虑重新设计库表了，因为那需要等待的时间决不是正常用户所能忍受的。

　　5、嵌套查询时，尽可能地在第一次select就把查询范围缩到最小

　　在有多个select嵌套查询的时候，应尽量在最内层就把所要查询的范围缩到最小，能分页的先分页。很多时候，就是这样简单地把分页放到内层查询里，对查询效率来说能形成质的变化。

　　就是这些了，希望对遇到类似问题的朋友们能有所帮助!

一、Nathan的感悟

我比较喜欢Nathan（Apache Storm的创始人）的一篇博客：《You should blog even if you have no readers》，可以用来回答这个问题，内容如下：

Spencer Fry wrote a great post on "Why entrepreneurs should write." I would further add that the benefits of writing are so extraordinary that you should write a blog even if you have no readers (and regardless of whether you're an entrepreneur).
I have over 50 unfinished drafts. Some of them are just a few ideas scribbled down arguing with myself. Most of them will never be published, yet I got value out of writing all of them.
Writing makes you a better reader
Blogging has changed how I read other people's writing.
In struggling to find the right ways to structure and present my posts, I am much more attuned to what makes a good argument and what makes a bad argument. I am better at seeing holes in other people's reasoning.
At the same time, when reading I am less likely to fall into the trap of discrediting a post with weak counterclaims. In most any post, there are likely to be counterclaims that are based on exceptional cases. Internet commenters love to point these out. However, these exceptional cases miss the main thrust of the post, and by understanding the implicit backdrop behind a post's argument, I get a lot more value out of reading.
I'm also more aware of the style of good writers. I mentally take note of the ways good writers phrase their ideas. I'd always enjoyed Paul Graham's writing, but now I really appreciate how he organizes his posts. He has an awesome ability to suck you into his world and show you what it looks like from his perspective. I've learned a lot about good writing from reading Bradford Cross's blog; his posts have a clear arc and make excellent use of short paragraphs to keep the posts flowing.
Writing makes you smarter
Writing reveals holes in your thinking. When your ideas are written and looking back at you, they're a lot less convincing than when they're just in your head. Writing forces you to mature your ideas by thinking through counterarguments.
Writing helps you organize your thoughts in a coherent way. This makes you a much better conversationalist when these topics come up. I can't count the number of times I've had deeper conversations with people because I had matured my ideas offline.
Consider anything else a side benefit
Everything else writing gives you -- personal branding, networking, inbound opportunities -- are just side benefits. They're potentially very large side benefits, but they are not the main reason you should write.
You should write because writing makes you a better person.