当前位置：首页 > news >正文

项目日记(3) boost搜索引擎

news 来源：原创 2024/7/5 2:21:25

1. 准备工作

2. 搜索初始化

3. 搜索部分

4. 对content部分处理

5. 服务器编写

前言: 上次在项目日记(2)写了index索引, 这次就可以进行search搜索了. 不多说快看. 先点个一键三联. 蟹蟹!!!

1. 准备工作

后面需要倒排索引的结构体, 先准备好. words是后面一个文档里面出现的关键字.

    //倒排索引的结构struct InvertedElemPrint{uint64_t doc_id; //文档idint weight;      //文档权重vector<string> words;  //倒排关键字数组;InvertedElemPrint():doc_id(0),weight(0){}};

2.搜索初始化

前面创建的index进行构造单例; InitSearcher:初始化搜索, 就是创建index单例, 以及使用input文档建立索引;这些我们在index的时候都做好了直接引用即可.

class Searcher{private://索引indexns_index::Index* index;public:Searcher(){}~Searcher(){}public://搜索初始化, input就是文档内容//创建单例以及索引void InitSearcher(const string& input){//1.获取或者创建index对象;index = ns_index::Index::GetInstance();cout << "获取index单例成功..." << endl;//2.根据index对象建立索引;index->BuildIndex(input);cout << "建立正排倒排索引成功..." << endl;}

3. 搜索部分

1. 实现对query关键字进行分词; 并且存放到word里面, 前面我们写的util.hpp里面有进行分词的CutString直接使用;

2. 根据不同的分词建立索引, 因为我们在搜索的时候会有大小写, 但是结果是大小写不区分都能查出来.所以使用到boost标准库里面的to_lower接口; 根据关键词进行倒排索引, 通过倒排索引的结果填充倒排信息.

3.合并排序, 一个关键字可能对应多个文档; 根据权重进行排序;

4. 构建json, 根据查找出来的结果, 构建json串, 完成序列化和反序列化;

5. 还要对content的查找的关键字进行截取, GetDesc就是完成这个任务的.

//query是关键字, json_string返回给浏览器搜索结果.void Search(const string& query, string* json_string){//1.分词;将输入的关键字进行分词.并且用word存放vector<string> words;ns_util::JiebaUtil::CutString(query, &words);//2.触发; 根据不同的分词进行index, 忽略大小写.vector<InvertedElemPrint> inverted_list_all;//文档id和倒排结构unordered_map<uint64_t, InvertedElemPrint> tokens_map;for(string word : words){boost::to_lower(word);//根据分词关键字建立倒排索引, ns_index::InvertedList* inverted_lsit = index->GetInvertedIndex(word);//建立失败, 就继续;if(nullptr == inverted_lsit){continue;}//将倒排索引的结果用item接收.插入到文档内for(const auto& elem : *inverted_lsit){auto& item = tokens_map[elem.doc_id];item.doc_id = elem.doc_id;item.weight += elem.weight;item.words.push_back(elem.word); //文档关键字;}}for(const auto& item : tokens_map){inverted_list_all.push_back(move(item.second));}//3.合并排序; 因为一个关键字可能对应多个文档id.//降序;sort(inverted_list_all.begin(), inverted_list_all.end(), \[](const InvertedElemPrint& e1, const InvertedElemPrint& e2)\{return e1.weight > e2.weight;});//4.构建, 根据查找出来的结果,建立json串, jsoncpp, 完成序列化和反序列化;//创建json对象;Json::Value root;for(auto& item : inverted_list_all){//正排索引ns_index::DocInfo* doc = index->GetForwardIndex;if(nullptr == doc){continue;}Json::Value elem;elem["title"] = doc->title;elem["desc"] = GetDesc(doc->content, item.words[0]);elem["url"] = doc->url;elem["id"] = (int)item.doc_id;elem["weight"] = item.weight;root.append(elem);}Json::FastWriter writer;*json_string = writer.write(root);}

4. 对content部分处理

GetDesc用来截取关键字前后内容的, search是algorithm库里面的接口进行查找.

string GetDesc(const string& html_content, const string& word){//找到word在html_content中首次出现, 以及前面50个和后面100个内容;const int prev_step = 50;const int next_step = 100;//1.找到关键词首次出现的地方;//tolower将大写转小写;auto iter = search(html_content.begin(), html_content.end(), word.begin(), word.end(), [](int x, int y){return (tolower(x) == tolower(y));});if(iter == html_content.end()){return "None1";}//distance返回两个迭代器的距离;int pos = distance(html_content.begin(), iter);//2.获取首次关键词前50到后100的位置;int start = 0;int end = html_content.size() - 1;if(pos > start + prev_step) start = pos - prev_step;if(pos < end - next_step) end = pos + next_step;//3.截取start和end的子串;if(start >= end) return "None2";string desc = html_content.substr(start, end - start);desc +="...";return desc; }

5. 服务器编写

这里使用到httplib的库, 自己可以到gitee里面查找下载到xshell里面就可以使用了.

首先初始化搜索.使用httplib建立库, 服务端获取关键字使用search将数据给json, 再使用客户端传递json.

#include <iostream>
#include "searcher.hpp"
#include "cpp-httplib/httplib.h"//原数据存放的地址;
const string input = "data/raw_html/raw.txt";
//目标网址.
const string root_path = "./wwwroot";int main()
{ns_searcher::Searcher search;search.InitSearcher(input);//使用到httplib库.并且建立服务端.httplib::Server svr;svr.set_base_dir(root_path.c_str());//服务端获取关键字, 使用json把数据读出来.svr.Get("/s", [](const httplib::Request& req, httplib::Response& rsp) {//如果没有输入搜索内容if(!req.has_param("word")){rsp.set_content("必须要输入搜索的关键字!", "text/plain; charset=utf-8");return;}//获取关键字;string word = req.get_param_value("word");cout << "用户在搜索" << word << endl;string json_string;//进行查找.search.Search(word, &json_string);//将内容进行连接.交给服务端.rsp.set_content(json_string, "application/json");});cout << "服务器编写成功..." << endl;svr.listen("0.0.0.0", 8081);return 0;
}