Web Crawl-Python for Informatics

读取文件，寻找模式，提取感兴趣文本行片段

提取文本行，字符串方法：split，find，列表与字符串切片

文本搜索与抽取--python正则表达式库--关于字符串搜索与解析的小型编程语言

http://en.wikipedia.org/wiki/Regular_expression

http://docs.python.org/library/re.html

1. search（）

import re

hand = open('mbox-short.txt')

for line in hand:

line = line.rstrip()

if re.search('From:', line) :

print line

打开文件，循环每行，search（）打印包含“From:”的文本行， line.find()也可以实现

1.1 re的强大之处，可以在搜索字符串时添加特定字符，以实现精确字符串文本行的精确匹配

e.g. ^ in Regular_expression 匹配一行的开始

import re

hand = open('mbox-short.txt')

for line in hand:

line = line.rstrip()

if re.search('^From:', line) :

print line

仅 “From:” 开头的文本行, 字符串库的 startwith()也可实现

1.2 正则表达式中的常用字符“.”,可匹配所有字符

import re

hand = open('mbox-short.txt')

for line in hand:

line = line.rstrip()

if re.search('^F..m',line):

print line

1.3 * + 表示一个字符可重复任意次数，* 0或多，+ 1或多

import re

hand = open('mbox-short.txt')

for line in hand:

line = line.rstrip()

if re.search('^From:.+@',line):

print line

匹配以“From：”开头，+之后一或多个字符，以@结尾的文本行

2. findall()

抽取字符串，返回列表，每个字符串是一个元素

import re

hand = open('mbox-short.txt')

for line in hand:

line = line.rstrip()

x = re.findall('\S+@\S+',line)

if len(x)>0:

print x

两字符序列匹配,非空字符'\S','\S+'匹配尽可能多的非空字符

2.1 []罗列多个可接受的匹配字符

'[a-zA-Z0-9]\S*@\S*[a-zA-Z]'

寻找一个子字符串，小写字母，大写字母或数字开头，之后*--0或多个非空字符，@，再是0或多个非空字符

最后是大写或小写字母结尾，会止步于匹配找到的最后一个字母

'[a-zA-Z0-9]'本身就是一个非空字符，* + 直接作用于左侧的单个字符

网络编程

1. 做一个简单的网络浏览器

伪装成网络浏览器，使用超文本传输协议HyperText Transport Protocol HTTP检索网页，读取页面数据并进行解析

python内置sockets库，在python程序中建立网络连接，通过套接字检索数据。

套接字，很像文件，提供了两个程序的双向连接，在一个套接字上可以同时读取和写入。

在一端编写内容，套接字会把数据发送给另一段的应用程序，从套接字读取，将得到另一个程序发送的数据。当套接字另一端没有发送任何数据，尝试读取时，结果只能等待。如果套接字两端的程序都在等待，而不发送任何数据，就会一直僵持下去。

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

超文本传输协议，36p--get请求的语法，从网络服务器请求文档，与服务器80端口建立连接，然后发送表单的一行

get http://www.py4inf.com/code/romeo.txt HTTP/1.0

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('www.py4inf.com',80))

mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:

data = mysock.recv(512)

if (len(data)<1):

break

print data

mysock.close()

端口--当与服务器建立套接字连接时，服务器高速应用程序进行通讯所采用的数字。网络流量通常使用80端口，电子邮件流量使用25端口

套接字--两个应用程序之间网络连接，彼此可以发送与接收数据。首先程序与服务器 www.py4inf.com在80端口建立一个连接，这个程序扮演了网络服务器的角色

HTTP协议要求--必须发送GET命令，后面跟一个空白行发送空白行之后，编写一个循环，从套接字中接收512个字符的数据片段，并打印这些数据，直到没有数据可以读入，recv（）返回一个空字符串

PC:send -> socket 80 web server--www.py4inf.com--web pages ->PC:recv

套接字：建立低级别的网络连接，用于网络服务器，邮件服务器及其他多类型的服务器的通讯。找到描述协议的文档编写代码，根据协议发送给获取数据。

常用的是HTTP 即Web协议，python针对此设计专门的库来支持网络文档数据的获取

2. 通过HTTP检索图像

在一个字符串累计数据，截取头部信息，将图片数据保存到一个文件中。

import socket

import time

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('www.py4inf.com',80))

mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

count = 0

picture = "";

while True:

data = mysock.recv(5120)

if (len(data)<1): break

# time.sleep(o.25)

count = count +len(data)

print len(data), count

picture = picture + data

mysock.close()

# look for the end of header 2CRLF

pos = picture.find("\r\n\r\n");

print 'Hearder length', pos

print picture[:pos]

# skip past the header and save the picture data

picture = picture[pos+4:]

fhand = open("stuff.jps","wb")

fhand.write(picture)

fhand.close()

output

$ python urljpeg.py

2920 2920

1460 4380

1460 5840

1460 7300

...

1460 62780

1460 64240

2920 67160

1460 68620

1681 70301

Header length 240

HTTP/1.1 200 OK

Date: Sat, 02 Nov 2013 02:15:07 GMT Server: Apache

Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT ETag: "19c141-111a9-4ea280f8354b8" Accept-Ranges: bytes

Content-Length: 70057

Connection: close Content-Type: image/jpeg

recv() 调用时，通过网络，我们从服务器获得更多字符串，每一次1460/2920 个字符，请求上限是5120个字符

网速不同导致不同结果，最后一次调用recv（）数据流结束到1681个字符，

再下一个recv（）调用得到0长度的字符串，服务器在套接字末尾调用了close（），没有更多数据可发送

time.sleep()前的注释去掉，可以减缓随后的调用，每隔1/4秒。服务器让我们靠前发送更多数据

$ python urljpeg.py

1460 1460

5120 6580

5120 11700

...

5120 62900

5120 68020

2281 70301 Header length 240

HTTP/1.1 200 OK

Date: Sat, 02 Nov 2013 02:22:04 GMT

Server: Apache

Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT ETag: "19c141-111a9-4ea280f8354b8" Accept-Ranges: bytes

Content-Length: 70057

Connection: close

Content-Type: image/jpeg

服务器生成的send请求，和程序生成的recv请求，之间存在一个缓冲区

当程序执行延迟请求，某些点上，服务器会在套接字中填满缓冲区，并强制暂停，直到程序开始清空缓存区

发送应用或接收应用的暂停行为，称为流量控制

3.利用urllib检索网页

HTTP套接字符库手动发送与获取数据相比，更简单的方法是urllib。通过urlopen打开后，网页看成一个文件，只需简单指明需要检索的网页，urllib会处理所有htpp协议和头部细节

import urllib

fhand = urllib.urlopen('http:...')

for line in fhand:

print line.strip()

程序运行，仅看到文件内容的输出。

import urllib

counts = dict()

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:

words = line.split()

for word in words:

counts[word] = counts.get(word,0) + 1

print counts

4. 解析html web抓取

urllib常见用法就是网页抓取，即编写一个程序，伪装成网络浏览器，检索网页，在这些页面中根据模式检索

搜索引擎Google会查看网页源代码，抽取链接到其他页面的超链接，检索这些页面，抽取超链接

如此往复，Google爬虫几乎遍历网络上所有网页

使用正则表达式重复搜索进行html解析，由特定模式抽取匹配的子字符串

webpage

<h1>The First Page</h1>

<p>

If you like, you can switch to the

Second Page</a>.

</p>

href="http://.+?"

查找以"http://"开头的字符串，之后1或多个字符+？--表示非贪婪模式匹配

试图找到最小可能匹配的字符串，贪婪匹配试图找到最大可能匹配的字符串

import urllib

import re

url1 = raw_input('Enter -')

html = urllib.urlopen(url).read()

links = re.findall(href = "'http://.*?'",html)

for link in links:

print link

正则表达式findall（）返回匹配的字符串列表，仅返回双引号之间的超链接文本

python urlregex.py

Enter - http://www.dr-chuck.com/page1.htm

http://www.dr-chuck.com/page2.htm

python urlregex.py

Enter - http://www.py4inf.com/book.htm

http://www.greenteapress.com/thinkpython/thinkpython.html

http://allendowney.com/

http://www.py4inf.com/code

http://www.lib.umich.edu/espresso-book-machine

http://www.py4inf.com/py4inf-slides.zip

由于存在大量破坏性的html网页，这样可能会错过一些有效链接，终止于坏数据

html解析库--选择python的库：BeautifulSoup，一个用于html文档解析与数据抽取的python库，能处理大多数被浏览器忽略的，存在缺陷的html

http://www.crummy.com可下载代码

html和xml很像，一些网页精心构造为xml

一般，大多数html会被xml解析起认为格式不正确整体拒绝，解析失败

BeautifulSoup容忍了html的缺陷，依然能抽取所需数据

import urllib

from BeautifulSoup import *

url1 = raw_input('Enter -')

html = urllib.urlopen(url1).read()

soup = BeautifulSoup(html)

# retrieve all of the anchor tags

tags = soup('a')

for tag in tags:

print tag.get('href',None)

提示输入一个网址，打开网页，读取与传送数据到BeautifulSoup解析器，检索所有anchor tag，打印出每个标签的href属性内容

python urllinks.py

Enter - http://www.dr-chuck.com/page1.htm

http://www.dr-chuck.com/page2.htm

python urllinks.py

Enter - http://www.py4inf.com/book.htm

http://www.greenteapress.com/thinkpython/thinkpython.html

http://allendowney.com/

http://www.si502.com/

http://www.lib.umich.edu/espresso-book-machine

http://www.py4inf.com/code

http://www.pythonlearn.com/

抽取出每个标签不同的部分http://www.crummy.com

import urllib

from BeautifulSoup import *

url1 = raw_input('Enter -')

html = urllib.urlopen(url1).read()

soup = BeautifulSoup(html)

# retrieve all of the anchor tags

tags = soup('a')

for tag in tags:

# look at the part of a tag

print 'TAG',tag

print 'URL',tag.get('href'None)

print 'cintent:',tag.connects[0]

print 'Attrs:',tag.Attrs

output：

Enter - http://www.dr-chuck.com/page1.htm

TAG: <a href="http://www.dr-chuck.com/page2.htm">

Second Page</a>

URL: http://www.dr-chuck.com/page2.htm

Content: [u'\nSecond Page']

Attrs: [(u'href', u'http://www.dr-chuck.com/page2.htm')]

urllib读取二进制文件

图像或视频，通过urllib将url指向的文件保存在本地disk

img = urllib.urlopen("http://www.py4inf.com/cover.jpg").read()

fhand = open('cover.jpg','w')

fhand.write(img)

fhand.close()

通过网络读取所有数据，存在内存的img变量，打开文件cover.jpg将数据写入硬盘。

如果是一个大型的音频或视频文件，为避免耗存。以区块，缓冲区来检索数据，检索下一个区块前将当前区块写入硬盘

import urllib

img = urllib('http://www.py4inf.com/cover.jpg')

fhand = open('cover.jpg','w')

size = 0

while True:

info = img.read(100000)

if len(info)<1: break

size = size + len(info)

fhand.write(info)

print size, 'characters copied'

fhand.close()

每次读取100000个字符，网络检索下一批100000个字符之前，先将它们写入cover.jpg文件

output

568248 characters copied

unix / mac 可以操作系统内置命令执行这个操作

curl -0 http://www.py4inf.com/cover.jpg

----abstraction： copy URL

http://www.py4inf.com/code curl1.py curl2.py，curl3.py更高效地实现二进制文件读写

Find !

Stepping