搭建基于PHP+Mysql+Coreseek的社工库

（本来只是随手记录没想到成了博客里最受欢迎的文章… 不过看看这个问题里排名第一的回答，这才叫大数据，不知道比我们高到哪里去了 ╮(╯_╰)╭）

之前下载过一些裤子，但是没怎么用过……一来是维护自己的社工库费时费力，二来在线的社工库也基本够用了。

后来导了一个10多G的裤子进 Mysql ，单表，数据有近2亿条，即使加了索引查起来也慢的不行，有时候还没有用暴力搜索文本内容的小工具查的快。于是查了下资料，试了试用 Coreseek 做全文索引，速度不错（毫秒级）。

从官网下载最新版（4.1）的 Coreseek ，解压到某个目录，打开 etc 目录，里面有个 csft_mysql.conf 配置示例文件，贴一下我的配置文件（命名为 sed.conf ）：

#源定义
source sed
{
    type                    = mysql

    sql_host                = localhost
    sql_user                = test
    sql_pass                = test
    sql_db                    = sed
    sql_port                = 3306
    sql_query_pre            = SET NAMES utf8

    sql_query                = SELECT `id`, `username`, `password`, `email`, `salt`, `order` FROM md5_sgk    #sql_query第一列id需为整数，username、password、email等作为字符串/文本字段，被全文索引
    sql_attr_uint            = id           #从SQL读取到的值必须为整数
    #sql_field_string    = username
    #sql_field_string    = password
    #sql_field_string    = email
    #sql_attr_timestamp        = date_added  #从SQL读取到的值必须为整数，作为时间属性

    sql_query_info_pre      = SET NAMES utf8                                        #命令行查询时，设置正确的字符集
    sql_query_info            = SELECT `id`, `username`, `password`, `email`, `salt`, `order` FROM md5_sgk WHERE id=$id #命令行查询时，从数据库读取原始数据信息
}

#index定义
index sed
{
    source            = sed           #对应的source名称
    path            = E:/SQL_DATA/coreseek/var/data/sed #请修改为实际使用的绝对路径，例如：/usr/local/coreseek/var/...
    docinfo            = extern
    mlock            = 0
    morphology        = none
    min_word_len        = 1
    ondisk_dict     = 1   #索引不载入内存而是保存在硬盘上
    html_strip                = 0

    #中文分词配置，详情请查看：http://www.coreseek.cn/products-install/coreseek_mmseg/
    #charset_dictpath = /usr/local/mmseg3/etc/ #BSD、Linux环境下设置，/符号结尾
    charset_dictpath = E:/SQL_DATA/coreseek/etc/ #Windows环境下设置，/符号结尾，最好给出绝对路径，例如：C:/usr/local/coreseek/etc/...
    charset_type        = zh_cn.utf-8
}

#全局index定义
indexer
{
    mem_limit            = 1024M
}

#searchd服务定义
searchd
{
    listen                  =   9312
    read_timeout        = 5
    max_children        = 30
    max_matches            = 1000
    seamless_rotate        = 0
    preopen_indexes        = 0
    unlink_old            = 1
    pid_file = E:/SQL_DATA/coreseek/var/log/searchd_mysql.pid  #请修改为实际使用的绝对路径，例如：/usr/local/coreseek/var/...
    log = E:/SQL_DATA/coreseek/var/log/searchd_mysql.log        #请修改为实际使用的绝对路径，例如：/usr/local/coreseek/var/...
    query_log = E:/SQL_DATA/coreseek/var/log/query_mysql.log #请修改为实际使用的绝对路径，例如：/usr/local/coreseek/var/...
    binlog_path =                                #关闭binlog日志
}

具体的配置信息可以看看官方手册。

有些资料里使用了 sql_field_string，但是据测试很可能出现 too many string attributes (current index format allows up to 4 GB) 这个错误，不知是不是因为是32位程序的原因，所以最好不要用，或者可以试试64位的 Sphinx。

然后把 searchd 注册成服务 searchd --install --config ..\etc\sed.conf，以免每次使用都要在命令行窗口打开。

接着用 indexer -c ..\etc\sed.conf --all --rotate 建立全文索引。
如果有增量索引的还要合并索引（我测试的时候并没有做增量索引，可以参考这篇文章。）

索引建立完后试试能否正常搜索： search -c ..\etc\sed.conf adwin ，如果出现 search error: failed to open xxx.sph: No such file or directory 这个错误，可能原因有：

It seems that
a) the lock files were lost somehow, and
b) you ran indexer without --rotate switch.
Normally, searchd would create .spl lock files on startup (and kill them on shutdown),
and indexer would then refuse to reindex without --rotate switch.
indexers sends SIGHUP to searchd when completed (and it should print that out). searchd
then does the following:
1) waits for all children to exit
2) renames current index files to .old
3) renames .new index files to current
4) tries to load current index files
5) if something failed, it rollbacks current files from .old and new to .new
In your case I would start with double checking directory access rights, users which both
indexer and searchd run under, etc.

据我的经验一般把 data 目录下的文件名里多出来的.new去掉就好。

最后调用自带的 sphinxapi.php 即可，我直接用的 Ph4nt0m 的源码（界面用 Bootstrap 重写了下）：

<?php
// 引用sphinxapi类
require "sphinxapi.php";
//关闭错误提示
error_reporting(E_ALL & ~E_NOTICE);
$num = 0;
if (!empty($_GET) && !empty($_GET['q'])) {
    $Keywords = strip_tags(trim($_GET['q']));
    if (!empty($_GET['m']) && 1 == $_GET['m']) {
        $Keywords = substr(md5($Keywords), 8, 16);
    }
    if (!empty($_GET['m']) && 2 == $_GET['m']) {
        $Keywords = md5($Keywords);
    }
    $cl = new SphinxClient();
    // 返回结果设置
    $cl->SetServer('127.0.0.1', 9312);
    $cl->SetConnectTimeout(3);
    $cl->SetArrayResult(true);
    // 设置是否全文匹配
    if (!empty($_GET) && !empty($_GET['f'])) {
        $cl->SetMatchMode(SPH_MATCH_ALL);
    } else {
        $cl->SetMatchMode(SPH_MATCH_ANY);
    }
    if (!empty($_GET) && !empty($_GET['p'])) {
        $p = !intval(trim($_GET['p'])) == 0 ? intval(trim($_GET['p'])) - 1 : 0;
        $p = $p * 20;
        // 我在sed.conf 设置了最大返回结果数1000。但是我在生成页码的时候最多生成20页，我想能满足大部分搜索需求了。
        // 以下语句表示从P参数偏移开始每次返回20条。
        $cl->setLimits($p, 20);
    } else {
        $cl->setLimits(0, 20);
    }

    $res = $cl->Query(".$Keywords.", "*");

    @mysql_connect("localhost", "test", "test"); //数据库账号密码
    mysql_select_db("sed"); //数据库库名名

    if (is_array($res["matches"])) {
        foreach ($res["matches"] as $docinfo) {
            $ids = $ids . $docinfo[id] . ',';
        }
        $ids = rtrim($ids, ',');
        $sql = "select * from md5_sgk where id in($ids)"; //注意修改表名
        mysql_query("set names utf8");
        $ret = mysql_query($sql);
        $num = mysql_num_rows($ret);
    }
}
?>
<!DOCTYPE html>
<html>
<head>
   <title>The Web of Answers</title>
   <meta charset="UTF-8">
   <meta name="viewport" content="width=device-with,initial-scal=1">
   <link href="css/bootstrap.min.css" rel="stylesheet">
   <script src="js/jquery.js"></script>
   <script src="js/bootstrap.min.js"></script>
   <script>
    function check(form){
        if(form.q.value==""){
          alert("Not null !");
          form.q.focus();
          return false;
        }
    }
    </script>
   <style>
        h1 {
            font-family: Times New Roman, Lucida Handwriting;
        }
        body {
            background-image: url(img/bg.jpg);
        }
   </style>
</head>
<body>
    <div class="container" id="container">
        <div id="page-header">
            <h1 class="text-center"> The Web of Answers </h1>
        </div>

        <div class="row">
        <form action="" method="get" class="form-horizontal" role="form">

            <div id="checkbox" class="col-md-6 col-md-offset-3">
                <label class="checkbox-inline">
                    <input type="checkbox" id="full" name="f" value="1">   完整匹配
                </label>
                <label class="checkbox-inline">
                      <input type="checkbox" id="md5_16" name="m" value="1">
                       MD5匹配（16位）
                </label>
                <label class="checkbox-inline">
                      <input type="checkbox" id="md5_32" name="m" value="2">
                       MD5匹配（32位）
                </label>
            </div>

            <div class="input-group col-md-6 col-md-offset-3">
                <input type="text" class="form-control" name="q" placeholder="请输入" value="<?php echo strip_tags(trim($_GET['q']));?>">
                    <div class="input-group-btn">
                        <button type="submit" class="btn btn-primary" onclick="check(form)">Search</button>
                    </div>
             </div>

        </form>
    </div>
    <br>
<?php
if (0 == !$num) {
    echo "<div class=\"row\">
    <div class=\"alert alert-success alert-dismissible col-md-10 col-md-offset-1\" role=\"alert\">
    <button type=\"button\" class=\"close\" data-dismiss=\"alert\"><span aria-hidden=\"true\">&times;</span><span class=\"sr-only\">Close</span></button>
    找到与<b>&nbsp{$Keywords}&nbsp</b>相关的结果 {$res[total_found]} 个。用时 {$res[time]} 秒。</div>";
    echo "<div class=\"table-responsive col-md-10 col-md-offset-1\">
        <table class=\"table table-striped table-hover\">
          <tr>
          <th>Username</th>
          <th>Email</th>
          <th>Password</th>
          <th>Salt</th>
          <th>From</th>
          </tr>";
    while ($row = mysql_fetch_assoc($ret)) {
        echo "<tr><td>" . $row['username'] . "</td>";
        echo "<td>" . $row['email'] . "</td>";
        echo "<td>" . $row['password'] . "</td>";
        echo "<td>" . $row['salt'] . "</td>";
        echo "<td>" . $row['order'] . "</td></tr>";
    }
    echo "</table></div></div>";
} else {
    if (!empty($_GET) && !empty($_GET['q'])) {
        echo "<div class=\"alert alert-warning alert-dismissible col-md-10 col-md-offset-1\" role=\"alert\">
        <button type=\"button\" class=\"close\" data-dismiss=\"alert\"><span aria-hidden=\"true\">&times;</span><span class=\"sr-only\">Close</span></button>
        找不到与<b>&nbsp{$Keywords}&nbsp</b>相关的结果。请更换其他关键词试试。</div></div>";
    }
}
?>
    <div id="pages">
    <center>
        <nav>
            <ul class="pagination">
<?php
if (0 == !$num) {
    $pagecount = (int) ($res[total_found] / 20);
    if (!($res[total_found] % 20) == 0) {
        $pagecount = $pagecount + 1;
    }
    if ($pagecount > 20) {
        $pagecount = 20;
    }
    $highlightid = !intval(trim($_GET['p'])) == 0 ? intval(trim($_GET['p'])) : 1;

    for ($i = 1; $i <= $pagecount; $i++) {
        if ($highlightid == $i) {
            echo "<li class=\"active\"><a href=\"#\">{$i}<span class=\"sr-only\">(current)</span></a></li>";
        } else {
            echo "<li><a href=\"index.php?q={$Keywords}&p={$i}\">{$i}</a></li>";
        }

    }
}
?>
            </ul>
        </nav>
    </center>
    </div>

    <div id="footer">
        <p class="text-center">
            The Web of Answers &copy;2010-2015 | Powered by b0rg
        </p>
    </div>

    </div>
</body>
</html>

效果：

参考资料：

MySQL+Sphinx+CoreSeek数据库查询脚本
 经验分享：社会工程学数据库搭建TIPS
搭建coreseek(sphinx+mmseg3)详细安装配置+php之sphinx扩展安装+php调用示例
 Coreseek 4.1 参考手册

已有 43 条评论

oott123

February 5th, 2015 at 11:37 am

真有闲…

回复
1. admin
  
  February 5th, 2015 at 12:44 pm
  
  折腾了几天…… 搞这玩意太麻烦，还是用在线的算了 Orz
  
  回复
  1. 你好
    
    April 1st, 2015 at 08:51 pm
    
    你好，请问怎么在前端做去重复查询？谢谢
    
    回复
    
    你好
    
    April 1st, 2015 at 10:00 pm
    
    同时求一下你的这个的完整源码，可以不，方便的话发我邮箱可以不？
    
    回复
    
    admin
    
    April 1st, 2015 at 11:01 pm
    
    前端去重用JS实现吧，貌似有现成的库，不过我并不是很熟。
    至于完整源码上面就是，只是要另外下载Bootstrap而已。
    
    你好
    
    April 12th, 2015 at 10:36 pm
    
    好的，谢谢哈
城市

April 14th, 2015 at 12:30 pm

请问 Coreseek 32位最多索引量只有40亿，有什么方法解决吗

回复
1. admin
  
  April 14th, 2015 at 02:41 pm
  
  没有办法，好像是因为32位程序最多只能使用4G内存的缘故，索引一多内存就不够用了，所以等传说中的 CoreSeek5 吧 2333 或者试试 Sphinx
  还有个办法使用 Python 搞分布式查询，可以搜搜《基于Coreseek+Python的分布式全文检索方法》这篇论文
  
  回复
  1. 你好
    
    April 14th, 2015 at 02:48 pm
    
    哦，那好吧。。
    
    回复
小白

April 24th, 2015 at 03:36 am

不懂能做个教程吗？

回复
1. admin
  
  April 24th, 2015 at 10:49 am
  
  不能 ╮(╯_╰)╭
  
  回复
looder

May 2nd, 2015 at 11:50 pm

您好，请问在php+coreseek+mysql下，我已经将数据库中的A表做好索引并正常搜索，然后加入B表，我该如何操作才能同时搜索A表和B表并反馈？谢谢

回复
1. admin
  
  May 3rd, 2015 at 03:30 pm
  
  你可以看看下篇文章...
  
  回复
awk

June 14th, 2015 at 10:56 pm

/usr/local/coreseek/etc search 关键字，我都能搜索到了，
大神请问为啥WEB情况下查询不到呢

回复
1. awk
  
  June 14th, 2015 at 11:21 pm
  
  /usr/local/coreseek/bin/searchd 打开9312端口
  好了，这玩意真的是相当麻烦啊
  
  回复
  1. admin
    
    June 15th, 2015 at 09:59 pm
    
    搞定就好，貌似Linux上装这玩意更麻烦…
    
    回复
  2. ok
    
    March 17th, 2017 at 10:50 am
    
    l/coreseek/etc search 关键字，我都能搜索到了，但web查不到
    我出现了跟你一样的问题，但是9312端口用netstat -na查看已经打开了，该怎么处理怎么办
    
    回复
awk

June 20th, 2015 at 12:04 pm

群主，还有个问题，是不是sphinx索引有2G的限制？
启./searchd 服务时，总是报错，说什么超过2G了

回复
awk

June 20th, 2015 at 12:10 pm

index 'mydb1': search error: mmap() failed: Cannot allocate memory (length=2020043912).
蛋疼很，不是索引体积不能超过物理内存啊

回复
Max

September 1st, 2015 at 11:09 pm

请教如何搜索带@的邮件地址，只要一搜索带@的内容网页就会空白

回复
1. admin
  
  September 2nd, 2015 at 11:40 am
  
  参考这个 http://stackoverflow.com/questions/15744134/how-to-allow-special-characters-as-search-keyword-in-sphinx
  
  回复
manx

September 15th, 2015 at 10:03 pm

请把你搜索页面的源码和CSS等文件打包发给我吧，谢谢，22157891@qq.com

回复
redboy

October 12th, 2015 at 05:54 pm

修改成以下代码就不用自行下载Bootstrap了，直接就可以用了
//修改成官方远程的jsURL
//修改成官方远程的jsURL

回复
redboy

October 12th, 2015 at 05:56 pm

href="//cdn.bootcss.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet"
src="//cdn.bootcss.com/jquery/1.11.3/jquery.min.js"
src="//cdn.bootcss.com/bootstrap/3.3.5/js/bootstrap.min.js"

回复
Coreseek PHP Mysql社工库经验 | Ones

January 24th, 2016 at 02:43 pm

[...] 岚光Coreseek社工库搭建 [...]

回复
阳光梦

April 14th, 2016 at 09:41 pm

您好，我初次接触全文搜索，你有没有哪篇文章介绍环境搭建的，越详细越好，谢谢！

回复
搭建基于PHP+Mysql+Coreseek的社工库 | 丁哥²º¹⁶

July 3rd, 2016 at 02:29 am

[...]引用至博客[...]

回复
一些大数据的资料 - Oiinfo.com

August 11th, 2016 at 06:00 pm

[...]岚光，搭建基于PHP+Mysql+Coreseek的社工库，03/02/15 （这里提到了我当时没有解决的超4G问题）[...]

回复
JIAYNG

October 15th, 2016 at 11:54 am

我好像解决了索引超过4gb的问题哈哈

回复
大数据检索系统搭建相关资料 - Na1vE - 又一个naive站点

November 13th, 2016 at 10:53 pm

[...]岚光，搭建基于PHP+Mysql+Coreseek的社工库[...]

回复
ChunSource

December 30th, 2016 at 11:33 pm

我下了200度G的数据，就是在研究这个研究半天还是不懂，有会的人吗，call我

回复
zero

January 20th, 2017 at 04:35 pm

帅哥前段太漂亮了能发一份吗？
haoshuaiwang#live.com

回复
ok

March 12th, 2017 at 04:08 pm

你好，我看了几个关于sphinx的文章，他们第一步都是安装mmseg3请问如果按照你的教程还需要安装mmseg3吗

回复
1. 0x0d
  
  March 12th, 2017 at 09:34 pm
  
  不需要
  
  回复
wayne0426

April 25th, 2017 at 11:02 pm

大师，/usr/local/coreseek/bin/search abcd 能测试成功，但是在PHP页面上搜索不到结果，页面上也没有出现错误，一般是什么原因？

回复
1. wayne0426
  
  April 25th, 2017 at 11:08 pm
  
  明白了，如果PHP加载了sphinx插件的话，无法引入sphinxapi.php文件会报错。不引入sphinxapi.php文件不会报错但是无法查询到结果，把PHP上的sphinx插件卸载掉，再引入sphinxapi.php文件就可以正常查询到结果了。
  
  回复
  1. wing
    
    August 20th, 2017 at 06:21 pm
    
    怎么卸载php的sphinx插件
    
    回复
a

May 20th, 2017 at 08:23 pm

求前端源码

回复
claysec

June 20th, 2017 at 07:48 pm

请问博主php文件怎么实现单库多表查询呢

回复
wing

August 15th, 2017 at 11:48 am

indexer -c ..etcsed.conf --all --rotate indexr command not find

回复
beng

October 22nd, 2017 at 10:54 am

配置遇到问题了，能加我qq指导下吗

回复
xing

December 22nd, 2017 at 11:26 pm

你好我这出现了一些小问题命令行里search可以出现结果但是php界面不报错但是显示无结果求解

回复
ka

December 28th, 2017 at 01:43 pm

数据超2亿时能成功建立索引但启动coreseekbinsearchd -c 时 out of memory /一台双核8G 和一台 24核 24G 内存(内存只去到6G就不再上了还有19G空闲) 也显示out of memory 难道要coreseek 4.1 x64？ google了也没X64的相关信息到底有没有咋办

回复

岚光

Geek & Hack

Find more

已有 43 条评论

添加新评论 »

岚光

Geek & Hack

Find more

首页 > Windows,网络安全 > 搭建基于PHP+Mysql+Coreseek的社工库 2015-02-03

已有 43 条评论

添加新评论 »