Stepping

博文

目前显示的是四月 23, 2017的博文

Web Crawl-Python for Informatics

四月 29, 2017

读取文件，寻找模式，提取感兴趣文本行片段提取文本行，字符串方法： split ， find ，列表与字符串切片文本搜索与抽取 --python 正则表达式库 -- 关于字符串搜索与解析的小型编程语言 http://en.wikipedia.org/wiki/Regular_expression http://docs.python.org/library/re.html 1. search （） import re hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( 'From:' , line ) : print line 打开文件，循环每行， search （）打印包含“ From: ”的文本行， line.find ()也可以实现 1.1 re 的强大之处，可以在搜索字符串时添加特定字符，以实现精确字符串文本行的精确匹配 e.g. ^ in Regular_expression 匹配一行的开始 import re hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( '^From:' , line ) : print line 仅 “ From: ” 开头的文本行, 字符串库的 startwith ()也可实现 1.2 正则表达式中的常用字符 “ . ” ,可匹配所有字符 import re hand = open ( 'mbox-short.txt' ) for line in hand: line = line.rstrip () if re.search ( '^F..m' , line ) : print line 1.3 * + 表示一个字符可重复任意次数， * 0 或多， + 1 或多 impo...

继续阅读

Nosql- MongoDB

四月 26, 2017

Introduction: this part of content we can have more practice on ec2 mongo – https://docs.mongodb.com/v3.3/tutorial/iterate-a-cursor/ – https://docs.mongodb.com/v3.4/reference/method/js- cursor/ The Points of abstraction: Manage JSON document Key concepts: document, collection, primary key (_id) Query language: insert, find, update, remove, aggregate Sharding Document store MongoDB is a document database A document is similar to an json object Consists of field-value pairs Value may be another document , array , String , number Document = record/row in RDBMS Databases No need to explicitly create it , just use it automatically created once add a collection ( i.e. table ) to it use inf551 show databases db.createCollection ( 'person' ) Collections Doc ument are stored in a...

继续阅读

NOsql-Cassandra & CQL

四月 26, 2017

Cassandra --an extensible record (wide column) store General idea... • NoSQL – Different types – Scale up vs. scale out • Key features – Flexible data model – High availability & Scalability Amazon DynamoDB – Data model , partition & sort key – Data types ( string , number , set , map , list ) – Consistent hashing • Apache Cassandra – Write & read path – Upsert – Minor & major compaction Apache Hive – HiveQL: SQL-like language – Analyze data stored in HDFS – Queries compiled into MapReduce jobs Cassandra & DynamoDB – Key-based ( ~ OLTP ) – Processing a small amount of data per query Hive – Analytical workload ( ~ OLAP ) – A query may need to process terabytes of data 模型： Cassandra使用Google 设计的 BigTable的数据模型，与面向行(row)的relational database或键值存储的key-value数据库不同，Cassandra使用Wide Column Stores，每行数据由 row key 唯一标识，最多20亿个列，每列由 column ...

继续阅读