超碰超碰在线chaopeng,欧美中文字幕99

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

手把手教你使用Flask搭建ES搜索引擎(預(yù)備篇)

[[406279]]

1 前言

Elasticsearch 是一個開源的搜索引擎，建立在一個全文搜索引擎庫 Apache Lucene? 基礎(chǔ)之上。

10年積累的成都網(wǎng)站建設(shè)、網(wǎng)站制作經(jīng)驗，可以快速應(yīng)對客戶對網(wǎng)站的新想法和需求。提供各種問題對應(yīng)的解決方案。讓選擇我們的客戶得到更好、更有力的網(wǎng)絡(luò)服務(wù)。我雖然不認(rèn)識你，你也不認(rèn)識我。但先網(wǎng)站設(shè)計后付款的網(wǎng)站建設(shè)流程，更有新建免費網(wǎng)站建設(shè)讓你可以放心的選擇與我們合作。

那么如何實現(xiàn) Elasticsearch和 Python 的對接成為我們所關(guān)心的問題了 (怎么什么都要和 Python 關(guān)聯(lián)啊)。

2 Python 交互

所以，Python 也就提供了可以對接 Elasticsearch的依賴庫。

 
 
 
 
  
  
  
  pip install elasticsearch

初始化連接一個 Elasticsearch 操作對象。

 
 
 
 
  
  
  
  def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"): 
  
  
  
   
  
  
  
      # self.es = Elasticsearch([ip], http_auth=('username', 'password'), port=9200) 
  
  
  
      self.es = Elasticsearch("localhost:9200") 
  
  
  
      self.index_type = index_type 
  
  
  
      self.index_name = index_name

默認(rèn)端口 9200，初始化前請確保本地已搭建好 Elasticsearch的所屬環(huán)境。

根據(jù) ID 獲取文檔數(shù)據(jù)

 
 
 
 
  
  
  
  def get_doc(self, uid): 
  
  
  
      return self.es.get(index=self.index_name, id=uid)

插入文檔數(shù)據(jù)

 
 
 
 
  
  
  
  def insert_one(self, doc: dict): 
  
  
  
      self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
  
  
  
   
  
  
  
  def insert_array(self, docs: list): 
  
  
  
      for doc in docs: 
  
  
  
          self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

搜索文檔數(shù)據(jù)

 
 
 
 
  
  
  
  def search(self, query, count: int = 30): 
  
  
  
      dsl = { 
  
  
  
          "query": { 
  
  
  
              "multi_match": { 
  
  
  
                  "query": query, 
  
  
  
                  "fields": ["title", "content", "link"] 
  
  
  
              } 
  
  
  
          }, 
  
  
  
          "highlight": { 
  
  
  
              "fields": { 
  
  
  
                  "title": {} 
  
  
  
              } 
  
  
  
          } 
  
  
  
      } 
  
  
  
      match_data = self.es.search(index=self.index_name, body=dsl, size=count) 
  
  
  
      return match_data 
  
  
  
   
  
  
  
  def __search(self, query: dict, count: int = 20): # count: 返回的數(shù)據(jù)大小 
  
  
  
      results = [] 
  
  
  
      params = { 
  
  
  
          'size': count 
  
  
  
      } 
  
  
  
      match_data = self.es.search(index=self.index_name, body=query, params=params) 
  
  
  
      for hit in match_data['hits']['hits']: 
  
  
  
          results.append(hit['_source']) 
  
  
  
   
  
  
  
      return results

刪除文檔數(shù)據(jù)

 
 
 
 
  
  
  
  def delete_index(self): 
  
  
  
      try: 
  
  
  
          self.es.indices.delete(index=self.index_name) 
  
  
  
      except: 
  
  
  
          pass

好啊，封裝 search 類也是為了方便調(diào)用，整體貼一下。

 
 
 
 
  
  
  
  from elasticsearch import Elasticsearch 
  
  
  
   
  
  
  
   
  
  
  
  class elasticSearch(): 
  
  
  
   
  
  
  
      def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"): 
  
  
  
   
  
  
  
          # self.es = Elasticsearch([ip], http_auth=('elastic', 'password'), port=9200) 
  
  
  
          self.es = Elasticsearch("localhost:9200") 
  
  
  
          self.index_type = index_type 
  
  
  
          self.index_name = index_name 
  
  
  
   
  
  
  
      def create_index(self): 
  
  
  
          if self.es.indices.exists(index=self.index_name) is True: 
  
  
  
              self.es.indices.delete(index=self.index_name) 
  
  
  
          self.es.indices.create(index=self.index_name, ignore=400) 
  
  
  
   
  
  
  
      def delete_index(self): 
  
  
  
          try: 
  
  
  
              self.es.indices.delete(index=self.index_name) 
  
  
  
          except: 
  
  
  
              pass 
  
  
  
   
  
  
  
      def get_doc(self, uid): 
  
  
  
          return self.es.get(index=self.index_name, id=uid) 
  
  
  
   
  
  
  
      def insert_one(self, doc: dict): 
  
  
  
          self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
  
  
  
   
  
  
  
      def insert_array(self, docs: list): 
  
  
  
          for doc in docs: 
  
  
  
              self.es.index(index=self.index_name, doc_type=self.index_type, body=doc) 
  
  
  
   
  
  
  
      def search(self, query, count: int = 30): 
  
  
  
          dsl = { 
  
  
  
              "query": { 
  
  
  
                  "multi_match": { 
  
  
  
                      "query": query, 
  
  
  
                      "fields": ["title", "content", "link"] 
  
  
  
                  } 
  
  
  
              }, 
  
  
  
              "highlight": { 
  
  
  
                  "fields": { 
  
  
  
                      "title": {} 
  
  
  
                  } 
  
  
  
              } 
  
  
  
          } 
  
  
  
          match_data = self.es.search(index=self.index_name, body=dsl, size=count) 
  
  
  
          return match_data

嘗試一下把 Mongodb 中的數(shù)據(jù)插入到 ES 中。

 
 
 
 
  
  
  
  import json 
  
  
  
  from datetime import datetime 
  
  
  
  import pymongo 
  
  
  
  from app.elasticsearchClass import elasticSearch 
  
  
  
   
  
  
  
  client = pymongo.MongoClient('127.0.0.1', 27017) 
  
  
  
  db = client['spider'] 
  
  
  
  sheet = db.get_collection('Spider').find({}, {'_id': 0, }) 
  
  
  
   
  
  
  
  es = elasticSearch(index_type="spider_data",index_name="spider") 
  
  
  
  es.create_index() 
  
  
  
   
  
  
  
  for i in sheet: 
  
  
  
      data = { 
  
  
  
              'title': i["title"], 
  
  
  
              'content':i["data"], 
  
  
  
              'link': i["link"], 
  
  
  
              'create_time':datetime.now() 
  
  
  
          } 
  
  
  
   
  
  
  
      es.insert_one(doc=data)

到 ES 中查看一下，啟動 elasticsearch-head 插件。

如果是 npm 安裝的那么 cd 到根目錄之后直接 npm run start 就跑起來了。

本地訪問 http://localhost:9100/

發(fā)現(xiàn)新加的 spider 數(shù)據(jù)文檔確實已經(jīng)進去了。

3 爬蟲入庫

要想實現(xiàn) ES 搜索，首先要有數(shù)據(jù)支持，而海量的數(shù)據(jù)往往來自爬蟲。

為了節(jié)省時間，編寫一個最簡單的爬蟲，抓取百度百科。

簡單粗暴一點，先遞歸獲取很多很多的 url 鏈接

 
 
 
 
  
  
  
  import requests 
  
  
  
  import re 
  
  
  
  import time 
  
  
  
   
  
  
  
  exist_urls = [] 
  
  
  
  headers = { 
  
  
  
      'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36', 
  
  
  
  } 
  
  
  
   
  
  
  
  def get_link(url): 
  
  
  
      try: 
  
  
  
          response = requests.get(url=url, headers=headers) 
  
  
  
          response.encoding = 'UTF-8' 
  
  
  
          html = response.text 
  
  
  
          link_lists = re.findall('.*?]*?)".*?', html) 
  
  
  
          return link_lists 
  
  
  
      except Exception as e: 
  
  
  
          pass 
  
  
  
      finally: 
  
  
  
          exist_urls.append(url) 
  
  
  
   
  
  
  
   
  
  
  
  # 當(dāng)爬取深度小于10層時，遞歸調(diào)用主函數(shù)，繼續(xù)爬取第二層的所有鏈接 
  
  
  
  def main(start_url, depth=1): 
  
  
  
      link_lists = get_link(start_url) 
  
  
  
      if link_lists: 
  
  
  
          unique_lists = list(set(link_lists) - set(exist_urls)) 
  
  
  
          for unique_url in unique_lists: 
  
  
  
              unique_url = 'https://baike.baidu.com/item/' + unique_url 
  
  
  
   
  
  
  
              with open('url.txt', 'a+') as f: 
  
  
  
                  f.write(unique_url + '\n') 
  
  
  
                  f.close() 
  
  
  
          if depth < 10: 
  
  
  
              main(unique_url, depth + 1) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      start_url = 'https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91' 
  
  
  
      main(start_url)

把全部 url 存到 url.txt 文件中之后，然后啟動任務(wù)。

 
 
 
 
  
  
  
  # parse.py 
  
  
  
  from celery import Celery 
  
  
  
  import requests 
  
  
  
  from lxml import etree 
  
  
  
  import pymongo 
  
  
  
  app = Celery('tasks', broker='redis://localhost:6379/2') 
  
  
  
  client = pymongo.MongoClient('localhost',27017) 
  
  
  
  db = client['baike'] 
  
  
  
  @app.task 
  
  
  
  def get_url(link): 
  
  
  
      item = {} 
  
  
  
      headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'} 
  
  
  
      res = requests.get(link,headers=headers) 
  
  
  
      res.encoding = 'UTF-8' 
  
  
  
      doc = etree.HTML(res.text) 
  
  
  
      content = doc.xpath("http://div[@class='lemma-summary']/div[@class='para']//text()") 
  
  
  
      print(res.status_code) 
  
  
  
      print(link,'\t','++++++++++++++++++++') 
  
  
  
      item['link'] = link 
  
  
  
      data = ''.join(content).replace(' ', '').replace('\t', '').replace('\n', '').replace('\r', '') 
  
  
  
      item['data'] = data 
  
  
  
      if db['Baike'].insert(dict(item)): 
  
  
  
          print("is OK ...") 
  
  
  
      else: 
  
  
  
          print('Fail')

run.py 飛起來

 
 
 
 
  
  
  
  from parse import get_url 
  
  
  
   
  
  
  
  def main(url): 
  
  
  
      result = get_url.delay(url) 
  
  
  
      return result 
  
  
  
   
  
  
  
  def run(): 
  
  
  
      with open('./url.txt', 'r') as f: 
  
  
  
          for url in f.readlines(): 
  
  
  
              main(url.strip('\n')) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      run()

黑窗口鍵入

 
 
 
 
  
  
  
  celery -A parse worker -l info -P gevent -c 10

哦豁 !! 你居然使用了 Celery 任務(wù)隊列，gevent 模式，-c 就是10個線程刷刷刷就干起來了，速度杠杠的 !!

啥?分布式? 那就加多幾臺機器啦，直接把代碼拷貝到目標(biāo)服務(wù)器，通過 redis 共享隊列協(xié)同多機抓取。

這里是先將數(shù)據(jù)存儲到了 MongoDB 上(個人習(xí)慣)，你也可以直接存到 ES 中，但是單條單條的插入速度堪憂(接下來會講到優(yōu)化，哈哈)。

使用前面的例子將 Mongo 中的數(shù)據(jù)批量導(dǎo)入到 ES 中，OK !!!

到這一個簡單的數(shù)據(jù)抓取就已經(jīng)完畢了。

好啦，現(xiàn)在 ES 中已經(jīng)有了數(shù)據(jù)啦，接下來就應(yīng)該是 Flask web 的操作啦，當(dāng)然，Django，F(xiàn)astAPI 也很優(yōu)秀。嘿嘿，你喜歡 !!

分享標(biāo)題：手把手教你使用Flask搭建ES搜索引擎(預(yù)備篇)
文章地址：http://www.dlmjj.cn/article/dpsjcpc.html

日本综合一区二区|亚洲中文天堂综合|日韩欧美自拍一区|男女精品天堂一区|欧美自拍第6页亚洲成人精品一区|亚洲黄色天堂一区二区成人|超碰91偷拍第一页|日韩av夜夜嗨中文字幕|久久蜜综合视频官网|精美人妻一区二区三区