(本来只是随手记录没想到成了博客里最受欢迎的文章… 不过看看这个问题里排名第一的回答,这才叫大数据,不知道比我们高到哪里去了 ╮(╯_╰)╭)


之前下载过一些裤子,但是没怎么用过……一来是维护自己的社工库费时费力,二来在线的社工库也基本够用了。

后来导了一个10多G的裤子进 Mysql ,单表,数据有近2亿条,即使加了索引查起来也慢的不行,有时候还没有用暴力搜索文本内容的小工具查的快。于是查了下资料,试了试用 Coreseek 做全文索引,速度不错(毫秒级)。

官网下载最新版(4.1)的 Coreseek ,解压到某个目录,打开 etc 目录,里面有个 csft_mysql.conf 配置示例文件,贴一下我的配置文件(命名为 sed.conf ):

#源定义
source sed
{
    type                    = mysql

    sql_host                = localhost
    sql_user                = test
    sql_pass                = test
    sql_db                    = sed
    sql_port                = 3306
    sql_query_pre            = SET NAMES utf8

    sql_query                = SELECT `id`, `username`, `password`, `email`, `salt`, `order` FROM md5_sgk    #sql_query第一列id需为整数,username、password、email等作为字符串/文本字段,被全文索引
    sql_attr_uint            = id           #从SQL读取到的值必须为整数
    #sql_field_string    = username
    #sql_field_string    = password
    #sql_field_string    = email
    #sql_attr_timestamp        = date_added  #从SQL读取到的值必须为整数,作为时间属性

    sql_query_info_pre      = SET NAMES utf8                                        #命令行查询时,设置正确的字符集
    sql_query_info            = SELECT `id`, `username`, `password`, `email`, `salt`, `order` FROM md5_sgk WHERE id=$id #命令行查询时,从数据库读取原始数据信息
}

#index定义
index sed
{
    source            = sed           #对应的source名称
    path            = E:/SQL_DATA/coreseek/var/data/sed #请修改为实际使用的绝对路径,例如:/usr/local/coreseek/var/...
    docinfo            = extern
    mlock            = 0
    morphology        = none
    min_word_len        = 1
    ondisk_dict     = 1   #索引不载入内存而是保存在硬盘上
    html_strip                = 0

    #中文分词配置,详情请查看:http://www.coreseek.cn/products-install/coreseek_mmseg/
    #charset_dictpath = /usr/local/mmseg3/etc/ #BSD、Linux环境下设置,/符号结尾
    charset_dictpath = E:/SQL_DATA/coreseek/etc/ #Windows环境下设置,/符号结尾,最好给出绝对路径,例如:C:/usr/local/coreseek/etc/...
    charset_type        = zh_cn.utf-8
}

#全局index定义
indexer
{
    mem_limit            = 1024M
}

#searchd服务定义
searchd
{
    listen                  =   9312
    read_timeout        = 5
    max_children        = 30
    max_matches            = 1000
    seamless_rotate        = 0
    preopen_indexes        = 0
    unlink_old            = 1
    pid_file = E:/SQL_DATA/coreseek/var/log/searchd_mysql.pid  #请修改为实际使用的绝对路径,例如:/usr/local/coreseek/var/...
    log = E:/SQL_DATA/coreseek/var/log/searchd_mysql.log        #请修改为实际使用的绝对路径,例如:/usr/local/coreseek/var/...
    query_log = E:/SQL_DATA/coreseek/var/log/query_mysql.log #请修改为实际使用的绝对路径,例如:/usr/local/coreseek/var/...
    binlog_path =                                #关闭binlog日志
}  

具体的配置信息可以看看官方手册

有些资料里使用了 sql_field_string,但是据测试很可能出现 too many string attributes (current index format allows up to 4 GB) 这个错误,不知是不是因为是32位程序的原因,所以最好不要用,或者可以试试64位的 Sphinx

然后把 searchd 注册成服务 searchd --install --config ..\etc\sed.conf,以免每次使用都要在命令行窗口打开。

接着用 indexer -c ..\etc\sed.conf --all --rotate 建立全文索引。
如果有增量索引的还要合并索引(我测试的时候并没有做增量索引,可以参考这篇文章。)

索引建立完后试试能否正常搜索: search -c ..\etc\sed.conf adwin ,如果出现 search error: failed to open xxx.sph: No such file or directory 这个错误,可能原因有:

It seems that
a) the lock files were lost somehow, and
b) you ran indexer without --rotate switch.
Normally, searchd would create .spl lock files on startup (and kill them on shutdown),
and indexer would then refuse to reindex without --rotate switch.
indexers sends SIGHUP to searchd when completed (and it should print that out). searchd
then does the following:
1) waits for all children to exit
2) renames current index files to .old
3) renames .new index files to current
4) tries to load current index files
5) if something failed, it rollbacks current files from .old and new to .new
In your case I would start with double checking directory access rights, users which both
indexer and searchd run under, etc.

据我的经验一般把 data 目录下的文件名里多出来的.new去掉就好。

最后调用自带的 sphinxapi.php 即可,我直接用的 Ph4nt0m 的源码(界面用 Bootstrap 重写了下):

<?php
// 引用sphinxapi类
require "sphinxapi.php";
//关闭错误提示
error_reporting(E_ALL & ~E_NOTICE);
$num = 0;
if (!empty($_GET) && !empty($_GET['q'])) {
    $Keywords = strip_tags(trim($_GET['q']));
    if (!empty($_GET['m']) && 1 == $_GET['m']) {
        $Keywords = substr(md5($Keywords), 8, 16);
    }
    if (!empty($_GET['m']) && 2 == $_GET['m']) {
        $Keywords = md5($Keywords);
    }
    $cl = new SphinxClient();
    // 返回结果设置
    $cl->SetServer('127.0.0.1', 9312);
    $cl->SetConnectTimeout(3);
    $cl->SetArrayResult(true);
    // 设置是否全文匹配
    if (!empty($_GET) && !empty($_GET['f'])) {
        $cl->SetMatchMode(SPH_MATCH_ALL);
    } else {
        $cl->SetMatchMode(SPH_MATCH_ANY);
    }
    if (!empty($_GET) && !empty($_GET['p'])) {
        $p = !intval(trim($_GET['p'])) == 0 ? intval(trim($_GET['p'])) - 1 : 0;
        $p = $p * 20;
        // 我在sed.conf 设置了最大返回结果数1000。但是我在生成页码的时候最多生成20页,我想能满足大部分搜索需求了。
        // 以下语句表示从P参数偏移开始每次返回20条。
        $cl->setLimits($p, 20);
    } else {
        $cl->setLimits(0, 20);
    }

    $res = $cl->Query(".$Keywords.", "*");

    @mysql_connect("localhost", "test", "test"); //数据库账号密码
    mysql_select_db("sed"); //数据库库名名

    if (is_array($res["matches"])) {
        foreach ($res["matches"] as $docinfo) {
            $ids = $ids . $docinfo[id] . ',';
        }
        $ids = rtrim($ids, ',');
        $sql = "select * from md5_sgk where id in($ids)"; //注意修改表名
        mysql_query("set names utf8");
        $ret = mysql_query($sql);
        $num = mysql_num_rows($ret);
    }
}
?>
<!DOCTYPE html>
<html>
<head>
   <title>The Web of Answers</title>
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-with,initial-scal=1">
   <link href="css/bootstrap.min.css" rel="stylesheet">
   <script src="js/jquery.js"></script>
   <script src="js/bootstrap.min.js"></script>
   <script>
    function check(form){
        if(form.q.value==""){
          alert("Not null !");
          form.q.focus();
          return false;
        }
    }
    </script>
   <style>
        h1 {
            font-family: Times New Roman, Lucida Handwriting;
        }
        body {
            background-image: url(img/bg.jpg);
        }
   </style>
</head>
<body>
    <div class="container" id="container">
        <div id="page-header">
            <h1 class="text-center"> The Web of Answers </h1>
        </div>

        <div class="row">
        <form action="" method="get" class="form-horizontal" role="form">

            <div id="checkbox" class="col-md-6 col-md-offset-3">
                <label class="checkbox-inline">
                    <input type="checkbox" id="full" name="f" value="1">   完整匹配
                </label>
                <label class="checkbox-inline">
                      <input type="checkbox" id="md5_16" name="m" value="1">
                       MD5匹配(16位)
                </label>
                <label class="checkbox-inline">
                      <input type="checkbox" id="md5_32" name="m" value="2">
                       MD5匹配(32位)
                </label>
            </div>

            <div class="input-group col-md-6 col-md-offset-3">
                <input type="text" class="form-control" name="q" placeholder="请输入" value="<?php echo strip_tags(trim($_GET['q']));?>">
                    <div class="input-group-btn">
                        <button type="submit" class="btn btn-primary" onclick="check(form)">Search</button>
                    </div>
             </div>

        </form>
    </div>
    <br>
<?php
if (0 == !$num) {
    echo "<div class=\"row\">
    <div class=\"alert alert-success alert-dismissible col-md-10 col-md-offset-1\" role=\"alert\">
    <button type=\"button\" class=\"close\" data-dismiss=\"alert\"><span aria-hidden=\"true\">&times;</span><span class=\"sr-only\">Close</span></button>
    找到与<b>&nbsp{$Keywords}&nbsp</b>相关的结果 {$res[total_found]} 个。用时 {$res[time]} 秒。</div>";
    echo "<div class=\"table-responsive col-md-10 col-md-offset-1\">
        <table class=\"table table-striped table-hover\">
          <tr>
          <th>Username</th>
          <th>Email</th>
          <th>Password</th>
          <th>Salt</th>
          <th>From</th>
          </tr>";
    while ($row = mysql_fetch_assoc($ret)) {
        echo "<tr><td>" . $row['username'] . "</td>";
        echo "<td>" . $row['email'] . "</td>";
        echo "<td>" . $row['password'] . "</td>";
        echo "<td>" . $row['salt'] . "</td>";
        echo "<td>" . $row['order'] . "</td></tr>";
    }
    echo "</table></div></div>";
} else {
    if (!empty($_GET) && !empty($_GET['q'])) {
        echo "<div class=\"alert alert-warning alert-dismissible col-md-10 col-md-offset-1\" role=\"alert\">
        <button type=\"button\" class=\"close\" data-dismiss=\"alert\"><span aria-hidden=\"true\">&times;</span><span class=\"sr-only\">Close</span></button>
        找不到与<b>&nbsp{$Keywords}&nbsp</b>相关的结果。请更换其他关键词试试。</div></div>";
    }
}
?>
    <div id="pages">
    <center>
        <nav>
            <ul class="pagination">
<?php
if (0 == !$num) {
    $pagecount = (int) ($res[total_found] / 20);
    if (!($res[total_found] % 20) == 0) {
        $pagecount = $pagecount + 1;
    }
    if ($pagecount > 20) {
        $pagecount = 20;
    }
    $highlightid = !intval(trim($_GET['p'])) == 0 ? intval(trim($_GET['p'])) : 1;

    for ($i = 1; $i <= $pagecount; $i++) {
        if ($highlightid == $i) {
            echo "<li class=\"active\"><a href=\"#\">{$i}<span class=\"sr-only\">(current)</span></a></li>";
        } else {
            echo "<li><a href=\"index.php?q={$Keywords}&p={$i}\">{$i}</a></li>";
        }

    }
}
?>
            </ul>
        </nav>
    </center>
    </div>

    <div id="footer">
        <p class="text-center">
            The Web of Answers &copy;2010-2015 | Powered by b0rg
        </p>
    </div>

    </div>
</body>
</html>

效果:
sed.png

参考资料:

MySQL+Sphinx+CoreSeek数据库查询脚本
经验分享:社会工程学数据库搭建TIPS
搭建coreseek(sphinx+mmseg3)详细安装配置+php之sphinx扩展安装+php调用示例
Coreseek 4.1 参考手册