博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
使用scrapy抓取Youtube播放列表信息
阅读量:6956 次
发布时间:2019-06-27

本文共 3223 字,大约阅读时间需要 10 分钟。

可参看

抓取Youtube列表数据的前提是scrapy部署的机器可以正常访问Youtube网站

存取到Mongo中的数据如下:

{    "playlist_id" : "PLEbPmOCXPYV67l45xFBdmodrPkhzuwSe9",    "videos" : [        {            "playlist_id" : "PLEbPmOCXPYV67l45xFBdmodrPkhzuwSe9",            "video_id" : "9pTwztLOvj4",            "thumbnail" : [                {                    "url" : "https://i.ytimg.com/vi/9pTwztLOvj4/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLCmUXUPe-HgXiie0SRfL5cYz0JRrg",                    "width" : 245,                    "height" : 137                }            ],            "title" : "Legend of the galactic heroes (1988) episode 1",            "index" : 1,            "length_seconds" : 1445,            "is_playable" : true        },        {            "playlist_id" : "PLEbPmOCXPYV67l45xFBdmodrPkhzuwSe9",            "video_id" : "zzD1xU37Vtc",            "thumbnail" : [                {                    "url" : "https://i.ytimg.com/vi/zzD1xU37Vtc/hqdefault.jpg?sqp=-oaymwEZCPYBEIoBSFXyq4qpAwsIARUAAIhCGAFwAQ==&rs=AOn4CLCnLCYaZVBeHnZR0T73rfEd_Dbyew",                    "width" : 245,                    "height" : 137                }            ],            "title" : "Legend of the galactic heroes (1988) episode 2",            "index" : 2,            "length_seconds" : 1447,            "is_playable" : true        },

代码如下:

# -*- coding: utf-8 -*-import scrapyimport reimport jsonfrom scrapy import Selectorfrom knowsmore.items import YoutubePlaylistItem, YoutubePlaylistVideoItemfrom ..common import *class YoutubeListSpider(scrapy.Spider):    name = 'youtube_list'    allowed_domains = ['www.youtube.com']    start_urls = ['https://www.youtube.com/playlist?list=PLEbPmOCXPYV67l45xFBdmodrPkhzuwSe9']    def parse(self, response):        # Extract JSON Data with Regex Expression        ytInitialData = r1(r'window\["ytInitialData"\] = (.*?)}};', response.body)        if ytInitialData:            ytInitialData = '%s}}' % ytInitialData            ytInitialDataObj = json.loads(ytInitialData)            # Assign VideoList info to variable            playListInfo = ytInitialDataObj['contents']['twoColumnBrowseResultsRenderer']['tabs'][0]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['playlistVideoListRenderer']            # Build Scrapy Item            playList = YoutubePlaylistItem(                playlist_id = playListInfo['playlistId'],                videos = []            )            # Insert the videoItem to YoutubePlaylistItem videos field            for videoInfo in playListInfo['contents']:                videoInfo = videoInfo['playlistVideoRenderer']                videoItem = YoutubePlaylistVideoItem(                    playlist_id = playListInfo['playlistId'],                    video_id = videoInfo['videoId'],                    thumbnail = videoInfo['thumbnail']['thumbnails'],                    title = videoInfo['title']['simpleText'],                    index = videoInfo['index']['simpleText'],                    length_seconds = videoInfo['lengthSeconds'],                    is_playable = videoInfo['isPlayable']                )                playList['videos'].append(videoItem)                        yield playList

转载地址:http://bhmil.baihongyu.com/

你可能感兴趣的文章
轰轰烈烈的搭建Spring + Spring MVC + Mybatis
查看>>
MySQL 单机多实例
查看>>
微信小程序入门到实战(二)
查看>>
graphql-java使用手册:part4 订阅(Subscriptions)
查看>>
理解js对象
查看>>
2017-10-07 前端日报
查看>>
Go 领军人物谢孟军:智能制造渴望银弹,首先要摒弃偏见
查看>>
函数式编程与面向对象编程[3]:Scala的OOP-FP混合式编程与抽象代数理论
查看>>
statsd on steroid
查看>>
【mongoDB运维篇③】replication set复制集
查看>>
php中查询mysql如何在IN 中用数组
查看>>
2015 年十佳 HTML5 应用
查看>>
php对象设计进阶
查看>>
python程序的调试
查看>>
启动级别:init 0-6
查看>>
mybatis深入理解(一)之 # 与 $ 区别以及 sql 预编译
查看>>
Java四种引用类型
查看>>
TIOBE 6 月编程语言榜:TypeScript 首次跻身前100
查看>>
Fedora 31 将更新开源 .Net 框架,支持 Mono 5
查看>>
Emulator 29.0.3 Canary 发布,Android 模拟器
查看>>