当前位置: 首页 > news >正文

nltk关键字抽取与轻量级搜索引擎(Whoosh, ElasticSearcher)

背景

有时候你想用一句完整的话或一个文本在基于关键字的搜索引擎里搜索,但是如果把整个文本放进去搜索的话,效果不是很好,因为你的搜索引擎是基于关键字而不是sematic search。那怎么抽取关键字呢?

利用NLTK抽取关键的代码

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')def extract_keywords(text):# Tokenize the textwords = word_tokenize(text)# Remove stopwordsstop_words = set(stopwords.words('english'))filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]print('filtered words:', filtered_words)# Calculate word frequencyfreq_dist = FreqDist(filtered_words)# Extract keywords based on frequency or other criteriakeywords = [word for word, freq in freq_dist.most_common(10)]  # Adjust the number of keywords as neededreturn keywordsif __name__ == '__main__':text = """Elasticsearch provides powerful search capabilities and is commonly used in production environments for large-scale document search and retrieval. However, it might be overkill for small projects or scenarios where simpler solutions like Whoosh are sufficient. Choose the solution that best fits your needs."""keywords = extract_keywords(text)print(keywords)

执行结果

filtered words: ['elasticsearch', 'provides', 'powerful', 'search', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document', 'search', 'retrieval', 'however', 'might', 'overkill', 'small', 'projects', 'scenarios', 'simpler', 'solutions', 'like', 'whoosh', 'sufficient', 'choose', 'solution', 'best', 'fits', 'needs']
['search', 'elasticsearch', 'provides', 'powerful', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document']

基于关键的搜索-whoosh

from keywords_extractor import *from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser# Define the schema for the index
schema = Schema(question=TEXT(stored=True))# Create or open the index
INDEX_DIR = "indexdir"
ix = create_in(INDEX_DIR, schema)  # Use create_in for creating a new index or open_dir for opening an existing one# Index your documents (replace doc_content with the actual content of your documents)
writer = ix.writer()
doc_content = "what is angular"questions = ["How to implement autocomplete, I don't know?", "How does Angular work?", "how Python programming language", "Example question", "Another question"]for question in questions:writer.add_document(question=question)writer.commit()# Search using keywords
search_keywords = extract_keywords(doc_content)
query_str = " OR ".join(search_keywords)
print(query_str)with ix.searcher() as searcher:query_parser = QueryParser("question", ix.schema)query = query_parser.parse(query_str)results = searcher.search(query)for result in results:print(result)

执行结果

filtered words: ['angular']
angular
<Hit {'question': 'How does Angular work?'}>

基于关键搜索- elastic search

from elasticsearch import Elasticsearch# Connect to the Elasticsearch server (make sure it's running)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])# Create an index
index_name = "your_index_name"if not es.indices.exists(index=index_name):es.indices.create(index=index_name, ignore=400)# Index a document (replace doc_content with the actual content of your documents)
doc_content = "This is the content of your document."
document = {"content": doc_content}es.index(index=index_name, body=document)# Search using keywords
search_keywords = extract_keywords(doc_content)
query_body = {"query": {"terms": {"content": search_keywords}}
}results = es.search(index=index_name, body=query_body)for hit in results['hits']['hits']:print(hit['_source'])

相关文章:

  • 代码随想录算法训练营第17天
  • 运行yolo v8 YOLOv8-CPP-Inference C++部署遇到的问题
  • SQL Server ISO镜像文件安装
  • 【C++】类和对象(一)
  • 代理IP在游戏中的作用有哪些?
  • MyBaties-增删查改
  • MongoDB日期存储与查询、@Query、嵌套字段查询实战总结
  • 【ArcGIS微课1000例】0099:土地利用变化分析
  • 路飞项目--04
  • 防御保护笔记02
  • ID3算法 决策树学习 Python实现
  • 顺序表的奥秘:高效数据存储与检索
  • LLM之llm-viz:llm-viz(3D可视化GPT风格LLM)的简介、安装和使用方法、案例应用之详细攻略
  • 【leetcode】01背包总结
  • 这些好用小众的知识库软件,快收藏起来
  • #Java异常处理
  • [译] React v16.8: 含有Hooks的版本
  • 【跃迁之路】【735天】程序员高效学习方法论探索系列(实验阶段492-2019.2.25)...
  • 30天自制操作系统-2
  • co.js - 让异步代码同步化
  • gf框架之分页模块(五) - 自定义分页
  • Javascript编码规范
  • jquery ajax学习笔记
  • Kibana配置logstash,报表一体化
  • laravel5.5 视图共享数据
  • leetcode386. Lexicographical Numbers
  • opencv python Meanshift 和 Camshift
  • QQ浏览器x5内核的兼容性问题
  • SpingCloudBus整合RabbitMQ
  • Vue ES6 Jade Scss Webpack Gulp
  • vue从创建到完整的饿了么(18)购物车详细信息的展示与删除
  • 基于MaxCompute打造轻盈的人人车移动端数据平台
  • 前端攻城师
  • 数据结构java版之冒泡排序及优化
  • 这几个编码小技巧将令你 PHP 代码更加简洁
  • 微龛半导体获数千万Pre-A轮融资,投资方为国中创投 ...
  • ​ 全球云科技基础设施:亚马逊云科技的海外服务器网络如何演进
  • ‌Excel VBA进行间比法设计
  • #我与Java虚拟机的故事#连载10: 如何在阿里、腾讯、百度、及字节跳动等公司面试中脱颖而出...
  • (1/2)敏捷实践指南 Agile Practice Guide ([美] Project Management institute 著)
  • (2015)JS ES6 必知的十个 特性
  • (arch)linux 转换文件编码格式
  • (pojstep1.1.1)poj 1298(直叙式模拟)
  • (undone) MIT6.824 Lecture1 笔记
  • (考研湖科大教书匠计算机网络)第一章概述-第五节1:计算机网络体系结构之分层思想和举例
  • (一)Thymeleaf用法——Thymeleaf简介
  • (转)chrome浏览器收藏夹(书签)的导出与导入
  • ***原理与防范
  • *p=a是把a的值赋给p,p=a是把a的地址赋给p。
  • ... 是什么 ?... 有什么用处?
  • .net 4.0发布后不能正常显示图片问题
  • .NET Core 实现 Redis 批量查询指定格式的Key
  • .NET HttpWebRequest、WebClient、HttpClient
  • .NET Standard / dotnet-core / net472 —— .NET 究竟应该如何大小写?
  • .Net 代码性能 - (1)