Python爬取百度指数中的搜索指数

本文是在实际需要中使用爬虫获取数据,然后进行对应的数据分析,仅是学习用途,特此记录。

1.环境:Python3.7+PyCharm 1.1 所需要的库:datetime,requests,execjs(非必须)

1.2 为了更好的查看json数据,建议在chrome中安装JSONView插件(https://github.com/gildas-lormeau/JSONView-for-Chrome)

2.百度指数中的数据获取难点: 2.1 百度指数的URL请求地址返回的数据,并不是可以直接进行json解析使用的数据,而是加密之后的数据和uniqid,需要通过uniqid再次请求对应的地址(后面部分介绍)获取到解密的密钥,然后在前端页面进行解密,然后再渲染到折线图中。

2.2 必须要在百度指数页面登录百度账号,由于时间关系,本次数据爬取都是在登录之后进行的操作。

2.3 需要将前端解密代码转化为Python代码,获取直接使用前端代码也可以。

2.3.1 不转换像下面这样使用也可以解密,直接利用execjs直接JavaScript代码即可。

# Python的强大之处就在于,拥有很强大的第三方库,可以直接执行js代码,即对解密算法不熟悉,无法转换为Python代码时,直接执行js代码即可
    js = execjs.compile('''
            function decryption(t, e){
                for(var a=t.split(""),i=e.split(""),n={},s=[],o=0;o<a.length/2;o++)
                    n[a[o]]=a[a.length/2+o]
                for(var r=0;r<e.length;r++)
                    s.push(n[i[r]])
                return s.join("")
            }
    ''')
    res = js.call('decryption', key, source)  # 调用此方式解密,需要打开上面的注解

2.3.2 前端JavaScript代码对应的Python代码

# 搜索指数数据解密
def decryption(keys, data):
    dec_dict = {}
    for j in range(len(keys) // 2):
        dec_dict[keys[j]] = keys[len(keys) // 2 + j]

    dec_data = ''
    for k in range(len(data)):
        dec_data += dec_dict[data[k]]
    return dec_data

2.4 获取自己登陆之后的Cookie(必须要有,否则无法获取到数据),具体的Cookie获取如下图,请注意看我下图标红的地方。

3.爬取数据的步骤 3.1 构建请求头,爬虫必须,请求头直接全部复制2.4中的请求头即可。

header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': '你登陆之后的Cookie',
        'Host': 'index.baidu.com',
        'Referer': 'https://index.baidu.com/v2/main/index.html',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
        'Cipher-Text': '你登录之后的 Cipher-Text'
    }

``

3.2 分析url

3.2.1 请求数据的url,2.4已经给出

```python
dataUrl = 'https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22%E4%B8%BD%E6%B1%9F%E5%8F%A4%E5%9F%8E%22,%22wordType%22:1%7D]]&days=30'

其中,汉字和部分符号被替换,只需找到对应的汉字部分即可,%22是",所以,哪里是汉字,对比浏览器的地址栏就晓得了吧,url最后的days=30,代表获取一个月的数据,从当前日期的前一天往前推一个月,可以根据需要修改days获取更多的数据或者更少的数据。在浏览器中输入dataUrl中的内容,可以得到以下数据

经过对all,pc,wise对应的数据进行解密,和搜索指数的折线图显示的数据对比,发现all部分的数据就是搜索指数的数据。本次请求返回的数据就在这里了,可以看到uniqid,而且每次刷新加密的数据和uniqid都会变。

3.2.2 获取密钥的url

经过多次分析,发现请求数据的url下面的uniqid出现在了下面这个url中

因此需要先对请求数据对应的url进行数据获取,解析出搜索指数对应的加密数据和uniqid,然后拼接url获取密钥,最后调用解密方法解密即可获取到搜索指数的数据。

keyUrl = 'https://index.baidu.com/Interface/ptbk?uniqid='

3.2.3 找到了对应的url,我们的爬虫也就完成了,接下来就是发送请求,解析数据,然后对数据进行解密即可。

4.完整代码

import datetime

import requests
import execjs


# 搜索指数数据解密
def decryption(keys, data):
    dec_dict = {}
    for j in range(len(keys) // 2):
        dec_dict[keys[j]] = keys[len(keys) // 2 + j]

    dec_data = ''
    for k in range(len(data)):
        dec_data += dec_dict[data[k]]
    return dec_data


if __name__ == "__main__":
    scenicName = '丽江古城'

    dataUrl = 'https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22' + scenicName + '%22,%22wordType%22:1%7D]]&days=30'
    keyUrl = 'https://index.baidu.com/Interface/ptbk?uniqid='
    header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': '你登陆之后的Cookie',
        'Host': 'index.baidu.com',
        'Referer': 'https://index.baidu.com/v2/main/index.html',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
        'Cipher-Text': '你登录之后的 Cipher-Text'
    }
    # 设置请求超时时间为30秒
    resData = requests.get(dataUrl, timeout=30, headers=header)

    uniqid = resData.json()['data']['uniqid']
    print("uniqid:{}".format(uniqid))
    keyData = requests.get(keyUrl + uniqid, timeout=30, headers=header)
    keyData.raise_for_status()
    keyData.encoding = resData.apparent_encoding

    # 开始对json数据进行解析
    startDate = resData.json()['data']['userIndexes'][0]['all']['startDate']
    print("startDate:{}".format(startDate))
    endDate = resData.json()['data']['userIndexes'][0]['all']['endDate']
    print("endDate:{}".format(endDate))
    source = (resData.json()['data']['userIndexes'][0]['all']['data'])  # 原加密数据
    print("原加密数据:{}".format(source))
    key = keyData.json()['data']  # 密钥
    print("密钥:{}".format(key))

    # Python的强大之处就在于,拥有很强大的第三方库,可以直接执行js代码,即对解密算法不熟悉,无法转换为Python代码时,直接执行js代码即可
    # js = execjs.compile('''
    #         function decryption(t, e){
    #             for(var a=t.split(""),i=e.split(""),n={},s=[],o=0;o<a.length/2;o++)
    #                 n[a[o]]=a[a.length/2+o]
    #             for(var r=0;r<e.length;r++)
    #                 s.push(n[i[r]])
    #             return s.join("")
    #         }
    # ''')
    # res = js.call('decryption', key, source)  # 调用此方式解密,需要打开上面的注解

    res = decryption(key, source)
    # print(type(res))
    resArr = res.split(",")

    dateStart = datetime.datetime.strptime(startDate, '%Y-%m-%d')
    dateEnd = datetime.datetime.strptime(endDate, '%Y-%m-%d')
    dataLs = []
    while dateStart <= dateEnd:
        dataLs.append(str(dateStart))
        dateStart += datetime.timedelta(days=1)
        # print(dateStart.strftime('%Y-%m-%d'))

    ls = []
    for i in range(len(dataLs)):
        ls.append([scenicName, dataLs[i], resArr[i]])

    for i in range(len(ls)):
        print(ls[i])

5.总结 总的来说,本次爬虫大体完成,在代码的编写之余,查阅了解密算法的Python实现,还查看了对日期的操作博客,所有的博客地址如下:

https://blog.csdn.net/weixin_41074255/article/details/90579939
https://blog.csdn.net/junli_chen/article/details/52944724
https://blog.csdn.net/lilongsy/article/details/80242427
https://blog.csdn.net/philip502/article/details/14004815/

感谢各位大牛的博客,因为有了你们我才能完成这篇博客,本文只为记录我在实际中遇到的问题和解决的方法,如有不足还请见谅,若有更好的解决方式,可以评论出来大家一起参考。

自己使用抓取指数程序

  • baidu.py

# coding: utf-8

# import execjs
import threading
import queue
from _env import _global, _proxies, _open, _from_file_name, _to_file_name, _error_file_name, _read_time, _startDate, _endDate, _threads
import requests
import time
import pandas as pd
import random
from fake_useragent import UserAgent
import re


num_of_threads = _threads  # 假如有 5 个线程
q = queue.Queue()  # 创建一个FIFO队列对象,不设置上限
threads = []   # 创建一个线程池


# 搜索指数数据解密
def decryption(keys, data):
    dec_dict = {}
    for j in range(len(keys) // 2):
        dec_dict[keys[j]] = keys[len(keys) // 2 + j]

    dec_data = ''
    for k in range(len(data)):
        dec_data += dec_dict[data[k]]
    return dec_data


# 获取数据
def response(word, Cookie, Cipher_Text):
    scenicName = word

    dataUrl = 'https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22' + \
        scenicName + '%22,%22wordType%22:1%7D]]&startDate=' + _startDate  + '&endDate=' + _endDate
    # keyUrl = 'https://index.baidu.com/Interface/ptbk?uniqid='
    header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive',
        'Cookie': Cookie,
        'Host': 'index.baidu.com',
        'Referer': 'https://index.baidu.com/v2/main/index.html',
        'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
        'sec-ch-ua-mobile': '?0',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin',
        # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
        'User-Agent': UserAgent().random,
        'Cipher-Text': Cipher_Text
    }
    
    # 判断是否开启了代理模式
    if _open:
    # 设置请求超时时间为30秒
        resData = requests.get(dataUrl, timeout=30,headers=header, proxies=get_proxy())
    else:
        resData = requests.get(dataUrl, timeout=30,headers=header)

    _res = resData.json()

    _search = re.search('uc_login_unique=.*?;', Cookie)
    # print(Cookie)
    # print("调用Cookies: {0}".format(_search.group()))

    if _res['status'] == 10018:
        print("="*60)
        print("\n")
        print("Cookies: {0}\n警告: {1}\n\n如何处理?: 先暂停此账户,过一会再重新使用!!!".format(_search.group(), _res['message']))
        print("\n")
        print("="*60)

    if _res['status'] == 10000:
        print("="*60)
        print("\n")
        print("Cookies: {0}\n警告: {1}!".format(_search.group(), _res['message']))
        print("\n")
        print("="*60)

    return _res

    # print(resData.json())


def get_data(_res_data):
    _start_time = _res_data['userIndexes'][0]['all']['startDate']  # 开始时间
    _end_time = _res_data['userIndexes'][0]['all']['endDate']     # 结束时间

    _search_word = _res_data['generalRatio'][0]['word'][0]['name']  # 关键词
    _all_avg = _res_data['generalRatio'][0]['all']['avg']      # 整体日均值
    _all_yoy = _res_data['generalRatio'][0]['all']['yoy']      # 整体同比
    _all_qoq = _res_data['generalRatio'][0]['all']['qoq']      # 整体环比

    _pc_avg = _res_data['generalRatio'][0]['pc']['avg']        # PC日均值
    _pc_yoy = _res_data['generalRatio'][0]['pc']['yoy']        # PC同比
    _pc_qoq = _res_data['generalRatio'][0]['pc']['qoq']        # PC环比
  
    _wise_avg = _res_data['generalRatio'][0]['wise']['avg']    # 移动日均值
    _wise_yoy = _res_data['generalRatio'][0]['wise']['yoy']    # 移动同比
    _wise_qoq = _res_data['generalRatio'][0]['wise']['qoq']    # 移动环比

    return [_search_word, _all_avg, _all_yoy,
          _all_qoq, _pc_avg, _pc_yoy, _pc_qoq, _wise_avg, _wise_yoy, _wise_qoq, _start_time, _end_time]

# 创建 EXCEL
def create_form(excel_file_name):
    form_header = ['关键词', '整体日均值', '整体同比',
                   '整体环比', 'PC日均值', 'PC同比', 'PC环比', '移动日均值', '移动同比', '移动环比', '开始时间','结束时间']
    df = pd.DataFrame(columns=form_header)
    df.to_excel(excel_file_name, index=False)

# 写入数据到 EXCEL
def add_info_to_form(excel_file_name, data=[]):
    df = pd.read_excel(excel_file_name)
    row_index = len(df) + 1  # 当前excel内容有几行
    df.loc[row_index] = data
    df.to_excel(excel_file_name, index=False)


# 未搜索到关键词写入到文件
def error_to_txt(_txt):
    fp = open(_error_file_name, 'a+', encoding='utf8')
    fp.write(_txt+"\n")
    fp.close()


def worker(i):
    while True:
        item = q.get()
        if item is None:
            print("线程%s: 消息队列发现了一个None,可以休息了^-^" % i)
            break
        # do_work(item)做具体的工作
        time.sleep(random.randint(0, int(_read_time)))

        # 获取 cookies
        _cookie = get_cookie()
        # 搜索关键词
        _res = response(word=str(item.replace(" ", "")),Cookie=_cookie['Cookie'], Cipher_Text=_cookie['Cipher_Text'])

        # 如果cookie是否错误或者是否登录
        if _res['status'] == 10018 or _res['status'] == 10000:
            break

        # 判断返回数据
        if _res['status'] == 10002:
            print("线程%s: 百度指数搜索 NOTFOUND <%s>" % (i, item))
            error_to_txt(item)

        
        try:
            if _res['status'] == 0:
                data = get_data(_res['data'])
                add_info_to_form(_to_file_name, data)
                print("线程%s: 百度指数搜索 SUCCESS <%s>" % (i, item))
        except Exception as e:
            print("线程%s: 百度指数搜索 ERROR <%s> " % (i, item))
            error_to_txt(item)



        # 做完后发出任务完成信号,然后继续下一个任务
        q.task_done()


# 读取 txt 文件,返回文件内的所有数据,为任务队列准备
def read_filename(fromFileName):
    _source = []
    with open(fromFileName, 'r', encoding='utf-8') as file:
        _source = file.read().splitlines()
    file.close()

    print('关键字数量: <%s>' % len(_source))
    return _source



def main():
    print('='*60)
    print('\n')
    print('正在启动......')
    create_form(_to_file_name)
    print('开始读取: %s' % _from_file_name)
    _source = read_filename(fromFileName=_from_file_name)

    # 多线程运行 worker 函数,并把他们添加到线程池里
    for i in range(1, num_of_threads+1):
        t = threading.Thread(target=worker, args=(i,))
        threads.append(t)
        t.start()

    # 每隔0.5秒发布一个新任务
    for item in _source:
        time.sleep(0.1)
        q.put(item)

    q.join()
    print("-----搜索都完成了-----")

    # 停止工作线程
    for i in range(num_of_threads):
        q.put(None)
    for t in threads:
        t.join()




# 获取代理
def get_proxy():
    proxy_ip = random.choice(_proxies)

    # 随机代理
    proxies = {'http': proxy_ip, 'https': proxy_ip}
    return proxies

# 测试使用
def get_cookie():
    _index = random.randint(0, len(_global) - 1)    # 获取变量数量, 并随机获取,列表下标从0开始,则减1
    _keys = list(_global.keys())                    # 获取变量 key,并组合成列表

    _use_key = _keys[_index]                        # 随机获取一个 key
    _cookie = _global[_use_key]                     # 通过随机 key 获取对应 cookie
    return _cookie


if __name__ == "__main__":
    main()

  • _env.py

_threads = 8    # 百度指数搜索并发线程数,自己决定,最好是你当前CPU*2
_read_time = 3  # 百度指数搜索每个线程间隔时间,必须大于0

# ===========================================
#       百度指数关键词文件,数据文件,错误文件
# ===========================================
_from_file_name = '关键词.txt'
_to_file_name = '关键词.xls'
_error_file_name = '未搜索关键词.txt'


# ===========================================
#        百度指数搜索天数、Cookie,自行填写
# ===========================================
# 搜索统计日期范围:2022-06-29 ~ 2022-07-28
_startDate = '2022-06-29'
_endDate = '2022-07-28'


_global = {
    # 百度账户一: xxxx
    '_var1': {
        'Cookie': 'BAIDUID=FEB688DCA7A2A3F745D140835D5A17EB:FG=1; Hm_lvt_d101ea4d2a5c67dab98251f0b5de24dc=1659071713; SIGNIN_UC=70a2711cf1d3d9b1a82d2f87d633bd8a04090841788b8wIk4yyecOyxgTYqPRVIrUOHjSmrWkAkMU5fLxWMCeGis%2BOwrGDihIlQvNTg1fn8%2BLE67Y9HcwzSOn1vWDxCGsrDS1JrRDiS4BVAK88wNX39zALNPBS9MwkJ8x%2F5Ksx40d313q1IV3O0MIQ2wWYutUVxJckUKcJaG7uk6rTRcIMcKgCzfLwNoC8lP2Sv%2FzZmjIOJ6L0LsGXV%2FPbww0IHBVs4S5o%2FX9D%2FRKwiSHBgnR2bF7XsEnNgJh2QhiymdGzSR6FyUM%2BEIEbGLAyl%2Bn7tw%3D%3D30907553640450600111758967547676; uc_login_unique=d9d6d671b925d6f40d189e40315a24b5; uc_recom_mark=cmVjb21tYXJrXzM0MDM0OTMz; __cas__st__212=a8b1ee2ce91f6a8df98168f341232674e8fb167b5268f51e7b71b5d6af13e60af6cbc529f24d7913566891f4; __cas__id__212=34034933; __cas__rn__=409084178; CPID_212=34034933; CPTK_212=2037297134; Hm_up_d101ea4d2a5c67dab98251f0b5de24dc=%7B%22uid_%22%3A%7B%22value%22%3A%2234034933%22%2C%22scope%22%3A1%7D%7D; bdindexid=9b6n8dbcfl9pvf2d3re6so5g22; Hm_lpvt_d101ea4d2a5c67dab98251f0b5de24dc=1659071753; ab_sr=1.0.1_MzBhYWRiMzhmMzk3ZTJmZmU1NDAwOTc3ODExY2U4NmVjOTc5NzEwODc5NmMyNGZiZWY1NWY0YmQzMmUyZmE4OGJiNDdlM2ZlYTNjMDllNDMxNjAwMjA3NmVkODU3YjlhNTQ2NmRmNGNlMjZiNmZlNDI4ZGVlZmRlMDYwYmZmZjljMmExYTdlZTM3ZGMyMTUyNWYzZjU1NWVmY2U1NmQ0MQ==; RT="z=1&dm=baidu.com&si=b6a4f9c4-9cee-44bd-b25b-ae8b15118980&ss=l660gkx5&sl=e&tt=c93&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=z00"',
        'Cipher_Text': '1658991609296_1659071764244_q1ZmVh/bml4j1LkN4yaPImQ9fVSrYOaycdWSxqCcP4BAfWhr0X8ZYdmQrvYh28BM1jwSi6br4M+biqxnh6revxn1AgSmy86omMWbiaS7geWFgQxm4/8/fYxAD2rf0lxDJLQyTXM7YxoKw3KHuu1QmAcjUUEjSvxsIccRfsPZcMLYrwZwBkya7uVS4zhC2CWj45aiXwKW7T+fdgBOwFEPCkkyEq1lQzRYMOJNdfKpsVRECtxU33x6HB4Z6+qh1HFnPdEQO/HTTesIXlNKKA9J/h31W5Ro4+jAjuiObCU+B5qFmRFEx9TWmkHJKOQNPysB6+klClikwc0151OMWYp38EBNCrQ0MTWL60th/5w+8N0P1oM8AicfVg6v/PngW1qj'
    },
    # 百度账户二: xxxx
    '_var2': {
        'Cookie': 'BIDUPSID=4A348FD99309494578B4C851F95560EC; PSTM=1655810429; BAIDUID=4A348FD9930949454B2582A91BC3CF79:FG=1; BAIDUID_BFESS=4A348FD9930949454B2582A91BC3CF79:FG=1; ZFY=1xUxhTYbUvlKMnXO:A:Az0ZauuAj4:BYq9R6mvvllP2LmA:C; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=8kak842120aga5048k8ip2br1he4l8816; uc_login_unique=d7ac76c88f88040ec973600869743eb1; H_PS_PSSID=36554_36625_36255_36726_36414_36840_36954_36165_36917_36569_36652_36745_26350_36865_36649; delPer=0; PSINO=2; BCLID=10649298767364973054; BDSFRCVID=ZIkOJexroG0leprDIU_7DRpU-rpWxY5TDYrELPfiaimDVu-VJeC6EG0Pts1-dEu-EHtdogKKymOTHrAF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tR30WJbHMTrDHJTg5DTjhPrMW4rWWMT-MTryKKJs54JKshTaBTJU0R8Aqq5jLbvkJGnRh4oNBUJtjJjYhfO45DuZyxomtfQxtNRJQKDE5p5hKq5S5-OobUPU2fc9LUvH0mcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLK-oj-D_GDjuM3e; BCLID_BFESS=10649298767364973054; BDSFRCVID_BFESS=ZIkOJexroG0leprDIU_7DRpU-rpWxY5TDYrELPfiaimDVu-VJeC6EG0Pts1-dEu-EHtdogKKymOTHrAF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tR30WJbHMTrDHJTg5DTjhPrMW4rWWMT-MTryKKJs54JKshTaBTJU0R8Aqq5jLbvkJGnRh4oNBUJtjJjYhfO45DuZyxomtfQxtNRJQKDE5p5hKq5S5-OobUPU2fc9LUvH0mcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLK-oj-D_GDjuM3e; Hm_lvt_d101ea4d2a5c67dab98251f0b5de24dc=1658991306,1659010575,1659066565,1659071534; SIGNIN_UC=70a2711cf1d3d9b1a82d2f87d633bd8a04090840555%2BfYDAukYB0EMdCSB9fNk5K6A%2F8XyKPwSTZLSK85wB4q6qDU0HGagkjS%2FHKxEETVXQedXr%2FwFXNFFGkQdUhi2h33LFrF2HZlGju4d40SFcrO3wd%2FD6grexJwmOlJ%2BLY4dKhzqHz1b4xsL1wiwaO6FqLkiF%2Bev4PfkcvQQq1c6UGDX5TClH4ovXx4fZwjj1g0lpsB2ug7bL6ttrCkusbyXoaHZjPnBLmPwDbHp9j0rC2q2Qzm74GQIMKL%2BSaXFvVuoqrpdZ%2Bno4Qtm9jD3sI%2FYyQ%3D%3D47412425789526699753615829911668; uc_recom_mark=cmVjb21tYXJrXzMzOTc5MDkz; __cas__st__212=c38f91a94579d3d6b084f8424371621cc67b2872a920b6ed2b2c88f23632c18195469761ebc925e6132dd0a2; __cas__id__212=33979093; __cas__rn__=409084055; CPID_212=33979093; CPTK_212=1949800301; Hm_up_d101ea4d2a5c67dab98251f0b5de24dc=%7B%22uid_%22%3A%7B%22value%22%3A%2233979093%22%2C%22scope%22%3A1%7D%7D; bdindexid=dm4rlcp0va4a5s6un38sao3uq3; Hm_lpvt_d101ea4d2a5c67dab98251f0b5de24dc=1659071625; ab_sr=1.0.1_MzA2NjNkNzEzOGIxMDE4YWY4MzM4ZWVhOWY5ODY5N2VhZmMxZGY1MzNmOTZmYTBjN2RhYTNlMmFmYTkyODE0ZTJmZDdhNzRiNmY4YmM1YzZiZTQ3NDc2ODEzYWY0Y2Y1ZTg5YzNjNDE1OWMxOTJmNTM0YjRkYTgwZTc2Y2QwYjRkNTI0YTc1Y2EwMmJiNDJlODA0YTQ2MWU3OGU0NmEwZA==; RT="z=1&dm=baidu.com&si=38baa5bf-fbb6-4170-8899-a33a051ee109&ss=l660csaw&sl=n&tt=gic&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=2cb4"',
        'Cipher_Text': '1658991609296_1659071645467_q1ZmVh/bml4j1LkN4yaPImQ9fVSrYOaycdWSxqCcP4BAfWhr0X8ZYdmQrvYh28BM1jwSi6br4M+biqxnh6revxn1AgSmy86omMWbiaS7geXyJkXG40GYuO0UsL0vlybzUNPaKbgBwFL5F0qZr38iTTzfnLER1MdHJ/Atj9DKQYIKctYJoJOV/v5PFXN+2d7Wq12fcOk3Ch27M8MYmZdby/vtt0QlG2zwTtedpQ57S5HfE30ikEHoKdHJtv0UdREVLC9DulieA4U0+qUK9HB0P9CAOz2HztqRy7ty1jjOPLlcQr+OOUeOq1n21O/Qne49hk0L60fuqahLFxLfUMCYcEXBgnxf1cCClaz0a69WQB1YOvQhbO5UWlCwlP81X9/xTovWhJIpW+y9qyZ2ro6ECw=='
    },
}

# ===========================================
#        百度指数爬虫是否开启代理
#        HTTP已测,其他代理未测试
# ===========================================
_open = False   # 默认为开启

_proxies = [
    "http://proxy.xxx.com:60001",
]

_

  • 关键词放在: 关键词.txt

百度首页搜索关键词获取URL

# -*- coding:utf-8 -*-

#获取网页内容文件
import requests


class MySpider(object):
    def __init__(self):
        self.url = "http://www.baidu.com/s?wd={name}"
        #写清楚获取headers途径
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
        }


    def get(self):
        """请求并返回网页源代码"""
        response = requests.get(self.url, self.headers)
        if response.status_code == 200:
            return response.text

    def write(self, text):
        # print(text)
        with open("%s.html" % self.target, "w", encoding="utf-8") as f:
            f.write(text)

    def parse():
        with open("python.html", "r", encoding="utf-8") as f:
            self.html = etree.HTML(f.read())
            #获取标题
            h3_tags = self.html.xpath("//h3[contains(@class,"t")]//text()")
            h3_tags = [i.strip() for i in h3_tags]
            print(h3_tags)


    def main(self):
        #处理url
        self.target = input("请输入你感兴趣的内容:")
        self.url = self.url.format(name=self.target)
        #请求
        text = self.get()
        #写入文件
        self.write(text)

if __name__ == "__main__":
    spider = MySpider()
    spider.main()


# 参考:https: // www.yht7.com/news/130433
# https: // www.helloworld.net/p/9491097957

Last updated