SQLite
SQLite,是一款轻型的关系数据库,单文件,比较好用,读写效率一般,建立索引后查询速度不错. 现在我们使用它来保存空气质量数据.
首先我们使用Perl脚本建立数据库(使用strawberry perl,自带DBI模块,如果使用其他的版本Perl,请自行安装DBI模块):
- createdb.pl
#!/usr/bin/perl use strict; use v5.10; use DBI; my $driver = "SQLite"; my $database = "air.db"; my $dsn = "DBI:$driver:dbname=$database"; my $userid = ""; my $password = ""; my $dbh = DBI->connect($dsn, $userid, $password, { RaiseError => 1 }) or die $DBI::errstr; print "Opened database successfully\n"; my $stmt = qq( CREATE TABLE LOCATION (AREA TEXT , POSITION_NAME TEXT , Latitude REAL , Longitude REAL , STATION_CODE text PRIMARY KEY); ); my $rv = $dbh->do($stmt); if($rv < 0){ print $DBI::errstr; } else { print "Table created successfully\n"; } my $stmt = qq(CREATE TABLE CONTENT (AQI INT , AREA TEXT , CO REAL , CO_24H REAL , NO2 INT , NO2_24H INT , O3 INT , O3_24H INT , O3_8H INT , O3_8H_24H INT , PM10 INT , PM10_24H INT , PM2_5 INT , PM2_5_24H INT , POSITION_NAME TEXT , PRIMARY_POLLUTANT TEXT , QUALITY TEXT , SO2 INT , SO2_24H INT , STATION_CODE TEXT , TIME_POINT TEXT , constraint pk_t2 primary key (STATION_CODE,TIME_POINT));); my $rv = $dbh->do($stmt); if($rv < 0){ print $DBI::errstr; } else { print "Table created successfully\n"; } $dbh->disconnect();
然后我们先将csv统一格式
- format.pl
#!perl -w use strict;#将不同格式的文件统一为固定格式 use v5.14; use utf8; my @files = glob "*.csv"; #获取所有csv文件 foreach my $file (@files) { # 对csv文件逐一处理 open FILE,"<",$file; say $file; my $outFile = $file; $outFile =~ s/.csv/Nor.csv/; open OUT,">",$outFile; say OUT '$+{aqi},$+{area},$+{co},$+{co_24h},$+{no2},$+{no2_24h},$+{o3},$+{o3_24h},$+{o3_8h},$+{o3_8h_24h},$+{pm10},$+{pm10_24h},$+{pm2_5},$+{pm2_5_24h},$+{position_name},$+{primary_pollutant},$+{quality},$+{so2},$+{so2_24h},$+{station_code},$+{time_point}'; { my $Line = readline(FILE); while(<FILE>) { my $line = $_; s/\n//; chomp $line; my @codes = split /\,/; #next if scalar @codes <10; say OUT "$codes[2],$codes[0],$codes[3],$codes[4],$codes[8],$codes[9],$codes[10],$codes[11],$codes[12],$codes[13],$codes[15],$codes[16],$codes[17],$codes[18],$codes[1],$codes[19],$codes[20],$codes[21],$codes[22],$codes[23],$codes[24]" if scalar @codes == 26; say OUT "$codes[5],$codes[0],$codes[6],$codes[7],$codes[8],$codes[9],$codes[10],$codes[11],$codes[12],$codes[13],$codes[14],$codes[15],$codes[16],$codes[17],$codes[1],$codes[24],$codes[21],$codes[18],$codes[19],$codes[2],$codes[20]" if scalar @codes == 25; say OUT $line if scalar @codes == 21; } } close OUT; close FILE; }
最后我们表中插入数据,使用如下脚本向 content 表中导入数据,因为设置了时间和站点编号为主键,所以理论上不会出现重复:
- insert.pl
#!/usr/bin/perl use strict; use v5.10; use utf8; use DBI; #文件路径 my $path = "d:/data/perl/total"; my $dbargs = {AutoCommit => 0,PrintError => 1}; #air.db为数据库名 my $db = DBI->connect("dbi:SQLite:dbname=air.db","","",$dbargs); #insert databses. my $insert_sth = $db->prepare( q{ INSERT INTO content values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}); #DBI->trace(1,"dbitrace.log"); #插入csv文件 my @csvfiles = <$path/*.csv>; foreach my $csvfile (@csvfiles) { # body... say $csvfile; open(CSV,"<$csvfile") or warn "Can't open file:$!"; while(<CSV>) { chomp; my($aqi,$area,$co,$co_24h,$no2,$no2_24h,$o3,$o3_24h,$o3_8h,$o3_8h_24h,$pm10,$pm10_24h,$pm2_5,$pm2_5_24h,$position_name,$primary_pollutant,$quality,$so2,$so2_24h,$station_code,$time_point) = split /,/; $aqi =~ s/"//g;; $area =~ s/"//g;; $co =~ s/"//g;; $co_24h =~ s/"//g;; $no2 =~ s/"//g;; $no2_24h =~ s/"//g;; $o3 =~ s/"//g;; $o3_24h =~ s/"//g;; $o3_8h =~ s/"//g;; $o3_8h_24h =~ s/"//g;; $pm10 =~ s/"//g;; $pm10_24h =~ s/"//g;; $pm2_5 =~ s/"//g;; $pm2_5_24h =~ s/"//g;; $position_name =~ s/"//g;; $primary_pollutant =~ s/"//g;; $quality =~ s/"//g;; $so2 =~ s/"//g;; $so2_24h =~ s/"//g;; $station_code =~ s/"//g;; $time_point =~ s/"//g;; $insert_sth->bind_param(1,$aqi); $insert_sth->bind_param(2,$area); $insert_sth->bind_param(3,$co); $insert_sth->bind_param(4,$co_24h); $insert_sth->bind_param(5,$no2); $insert_sth->bind_param(6,$no2_24h); $insert_sth->bind_param(7,$o3); $insert_sth->bind_param(8,$o3_24h); $insert_sth->bind_param(9,$o3_8h); $insert_sth->bind_param(10,$o3_8h_24h); $insert_sth->bind_param(11,$pm10); $insert_sth->bind_param(12,$pm10_24h); $insert_sth->bind_param(13,$pm2_5); $insert_sth->bind_param(14,$pm2_5_24h); $insert_sth->bind_param(15,$position_name); $insert_sth->bind_param(16,$primary_pollutant); $insert_sth->bind_param(17,$quality); $insert_sth->bind_param(18,$so2); $insert_sth->bind_param(19,$so2_24h); $insert_sth->bind_param(20,$station_code); $insert_sth->bind_param(21,$time_point); $insert_sth->execute ;#or warn $db->errstr; } $db->commit ;#or warn $db->errstr; close(CSV); } $db->disconnect(); print "sucess for import csv!";
插入完成后,建立索引,使用sqlite提供的shell工具,可以在SQLite官网下载,下载Precompiled Binaries for Windows版本,或者使用修改的cmder mini ,其中自带sqlite3
注意:表中的数据都是utf8编码,如果直接输出在命令行中可能为乱码,且windows自带的cmd中查询也有可能为乱码,最好输出到文件!如果想在命令行中直接查看,可以使用上面提到的cmder或者使用msys2,安装sqlite后使用sqlite3工具查询
打开命令行,或者cmder.exe,转入文件所在目录,输入 sqlite3.exe air.db命令,可以看到如下信息:
c:\>d: d:\>cd data d:\data\>cd perl D:\data\perl>sqlite3.exe air.db SQLite version 3.8.8 2015-01-16 12:08:06 Enter ".help" for usage hints. sqlite>
我们进行查询/输出/建立索引等功能将都在此进行
下面对time_point建立索引,请在数据插入完成后建立索引,索引会影响插入速度:
--sqlite> 是命令行显示,不需要输入 --建立索引 sqlite>CREATE INDEX time_index ON content(time_point); --将结果输出到文件中,以逗号分隔(默认输出以 | 分隔) sqlite>.separator ',' --将结果输出到文件中 sqlite>.output RESULT.txt --按日期选择 sqlite>SELECT * FROM content WHERE time_point BETWEEN DATE('2013-05-22') AND DATE('2013-05-24'); --按站点选择 sqlite>SELECT * FROM content WHERE station_code = "1001A"; --按地区选择 sqlite>SELECT * FROM content WHERE area LIKE "北京%" LIMIT 10; --选择同时输出经纬度 sqlite>SELECT content.*,location.latitude,location.longitude FROM content,location WHERE content.station_code = "2011A" AND content.station_code = location.station_code;
可以按照上面的描述,自己进行查询.