博文

目前显示的是 四月 23, 2017的博文

Web Crawl-Python for Informatics

读取文件,寻找模式,提取感兴趣文本行片段 提取文本行,字符串方法: split , find ,列表与字符串切片 文本搜索与抽取 --python 正则表达式库 -- 关于字符串搜索与解析的小型编程语言 http://en.wikipedia.org/wiki/Regular_expression http://docs.python.org/library/re.html 1. search () import re hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( 'From:' , line ) : print line 打开文件,循环每行, search ()打印包含“ From: ”的文本行, line.find ()也可以实现 1.1 re 的强大之处,可以在搜索字符串时添加特定字符,以实现精确字符串文本行的精确匹配 e.g. ^ in Regular_expression 匹配一行的开始 import re  hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( '^From:' , line ) : print line 仅   “ From: ” 开头的文本行, 字符串库的 startwith ()也可实现 1.2 正则表达式中的常用字符 “ . ” ,可匹配所有字符 import re  hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( '^F..m' , line ) : print line 1.3 *  +  表示一个字符可重复任意次数, * 0 或多, + 1 或多 impo...

Nosql- MongoDB

图片
Introduction: this part of content we can have more practice on ec2 mongo –   https://docs.mongodb.com/v3.3/tutorial/iterate-a-cursor/ –   https://docs.mongodb.com/v3.4/reference/method/js- cursor/  The Points of abstraction: Manage JSON document Key concepts: document, collection, primary key (_id) Query language: insert, find, update, remove, aggregate Sharding  Document store MongoDB is a document database  A document is similar to an json object Consists of field-value pairs Value may be another document , array , String , number Document = record/row in RDBMS Databases  No need to explicitly create it , just use it  automatically created once add a collection ( i.e. table ) to it  use inf551  show databases  db.createCollection ( 'person' )   Collections Doc ument are stored in a...

NOsql-Cassandra & CQL

图片
Cassandra --an extensible record (wide column) store  General idea... • NoSQL – Different types – Scale up vs. scale out  • Key features – Flexible data model  – High availability & Scalability  Amazon DynamoDB – Data model , partition & sort key  – Data types ( string , number , set , map , list )   – Consistent hashing  • Apache Cassandra  – Write & read path  – Upsert – Minor & major compaction  Apache Hive – HiveQL: SQL-like language  – Analyze data stored in HDFS – Queries compiled into MapReduce jobs  Cassandra & DynamoDB – Key-based ( ~ OLTP ) – Processing a small amount of data per query  Hive – Analytical workload ( ~ OLAP ) – A query may need to process terabytes of data  模型: Cassandra使用Google 设计的 BigTable的数据模型,与面向行(row)的relational database或键值存储的key-value数据库不同,Cassandra使用Wide Column Stores,每行数据由 row key 唯一标识,最多20亿个列,每列由 column ...