SQLite

SQLite,是一款轻型的关系数据库,单文件,比较好用,读写效率一般,建立索引后查询速度不错. 现在我们使用它来保存空气质量数据.

首先我们使用Perl脚本建立数据库(使用strawberry perl,自带DBI模块,如果使用其他的版本Perl,请自行安装DBI模块):

createdb.pl
#!/usr/bin/perl
use strict;
use v5.10;
use DBI;
my $driver   = "SQLite";
my $database = "air.db";
my $dsn = "DBI:$driver:dbname=$database";
my $userid = "";
my $password = "";
my $dbh = DBI->connect($dsn, $userid, $password, { RaiseError => 1 })
                      or die $DBI::errstr;
print "Opened database successfully\n";
 
my $stmt = qq(
CREATE TABLE LOCATION
      (AREA              TEXT   ,
POSITION_NAME      TEXT   ,
Latitude          REAL   ,
Longitude         REAL   ,
STATION_CODE       text PRIMARY KEY);
);
my $rv = $dbh->do($stmt);
if($rv < 0){
   print $DBI::errstr;
} else {
   print "Table created successfully\n";
}
 
my $stmt = qq(CREATE TABLE CONTENT
      (AQI                  INT  ,
AREA                 TEXT  ,
CO                   REAL  ,
CO_24H               REAL  ,
NO2                  INT  ,
NO2_24H              INT  ,
O3                   INT  ,
O3_24H               INT  ,
O3_8H                INT  ,
O3_8H_24H            INT  ,
PM10                 INT  ,
PM10_24H             INT  ,
PM2_5                INT  ,
PM2_5_24H            INT  ,
POSITION_NAME        TEXT  ,
PRIMARY_POLLUTANT    TEXT  ,
QUALITY              TEXT  ,
SO2                  INT  ,
SO2_24H              INT  ,
STATION_CODE         TEXT  ,
TIME_POINT           TEXT  ,
constraint pk_t2 primary key (STATION_CODE,TIME_POINT)););
 
my $rv = $dbh->do($stmt);
if($rv < 0){
   print $DBI::errstr;
} else {
   print "Table created successfully\n";
}
 
$dbh->disconnect();

然后我们先将csv统一格式

format.pl
#!perl -w
use strict;#将不同格式的文件统一为固定格式
use v5.14;
use utf8;
my @files = glob "*.csv"; #获取所有csv文件
 
 
foreach my $file (@files) {
	# 对csv文件逐一处理
	open FILE,"<",$file;
	say $file;
	my $outFile = $file;
	$outFile =~ s/.csv/Nor.csv/;
	open OUT,">",$outFile;
 
	say OUT '$+{aqi},$+{area},$+{co},$+{co_24h},$+{no2},$+{no2_24h},$+{o3},$+{o3_24h},$+{o3_8h},$+{o3_8h_24h},$+{pm10},$+{pm10_24h},$+{pm2_5},$+{pm2_5_24h},$+{position_name},$+{primary_pollutant},$+{quality},$+{so2},$+{so2_24h},$+{station_code},$+{time_point}';
	{
	  my $Line = readline(FILE);
		while(<FILE>)
		{
		  my $line = $_;
		  s/\n//;
		  chomp $line;
			my @codes = split /\,/;
			#next if scalar @codes <10;
			say OUT "$codes[2],$codes[0],$codes[3],$codes[4],$codes[8],$codes[9],$codes[10],$codes[11],$codes[12],$codes[13],$codes[15],$codes[16],$codes[17],$codes[18],$codes[1],$codes[19],$codes[20],$codes[21],$codes[22],$codes[23],$codes[24]" if scalar @codes == 26;
			say OUT "$codes[5],$codes[0],$codes[6],$codes[7],$codes[8],$codes[9],$codes[10],$codes[11],$codes[12],$codes[13],$codes[14],$codes[15],$codes[16],$codes[17],$codes[1],$codes[24],$codes[21],$codes[18],$codes[19],$codes[2],$codes[20]" if scalar @codes == 25;
			say OUT $line if scalar @codes == 21;
		}
	}
	close OUT;
	close FILE;
}

最后我们表中插入数据,使用如下脚本向 content 表中导入数据,因为设置了时间和站点编号为主键,所以理论上不会出现重复:

insert.pl
#!/usr/bin/perl
use strict;
use v5.10;
use utf8;
use DBI;
#文件路径
my $path = "d:/data/perl/total";
my $dbargs = {AutoCommit => 0,PrintError => 1};
#air.db为数据库名
my $db = DBI->connect("dbi:SQLite:dbname=air.db","","",$dbargs);
#insert databses.
my $insert_sth =  $db->prepare( q{ INSERT INTO content values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)});
#DBI->trace(1,"dbitrace.log");
#插入csv文件
my @csvfiles = <$path/*.csv>;
foreach my $csvfile (@csvfiles) {
  # body...
  say $csvfile;
  open(CSV,"<$csvfile") or warn "Can't open file:$!";
  while(<CSV>)
  {
      chomp;
      my($aqi,$area,$co,$co_24h,$no2,$no2_24h,$o3,$o3_24h,$o3_8h,$o3_8h_24h,$pm10,$pm10_24h,$pm2_5,$pm2_5_24h,$position_name,$primary_pollutant,$quality,$so2,$so2_24h,$station_code,$time_point) = split /,/;
      $aqi  =~ s/"//g;;
      $area  =~ s/"//g;;
      $co  =~ s/"//g;;
      $co_24h  =~ s/"//g;;
      $no2  =~ s/"//g;;
      $no2_24h  =~ s/"//g;;
      $o3  =~ s/"//g;;
      $o3_24h  =~ s/"//g;;
      $o3_8h  =~ s/"//g;;
      $o3_8h_24h  =~ s/"//g;;
      $pm10  =~ s/"//g;;
      $pm10_24h  =~ s/"//g;;
      $pm2_5  =~ s/"//g;;
      $pm2_5_24h  =~ s/"//g;;
      $position_name  =~ s/"//g;;
      $primary_pollutant  =~ s/"//g;;
      $quality  =~ s/"//g;;
      $so2  =~ s/"//g;;
      $so2_24h  =~ s/"//g;;
      $station_code  =~ s/"//g;;
      $time_point  =~ s/"//g;;
      $insert_sth->bind_param(1,$aqi);
      $insert_sth->bind_param(2,$area);
      $insert_sth->bind_param(3,$co);
      $insert_sth->bind_param(4,$co_24h);
      $insert_sth->bind_param(5,$no2);
      $insert_sth->bind_param(6,$no2_24h);
      $insert_sth->bind_param(7,$o3);
      $insert_sth->bind_param(8,$o3_24h);
      $insert_sth->bind_param(9,$o3_8h);
      $insert_sth->bind_param(10,$o3_8h_24h);
      $insert_sth->bind_param(11,$pm10);
      $insert_sth->bind_param(12,$pm10_24h);
      $insert_sth->bind_param(13,$pm2_5);
      $insert_sth->bind_param(14,$pm2_5_24h);
      $insert_sth->bind_param(15,$position_name);
      $insert_sth->bind_param(16,$primary_pollutant);
      $insert_sth->bind_param(17,$quality);
      $insert_sth->bind_param(18,$so2);
      $insert_sth->bind_param(19,$so2_24h);
      $insert_sth->bind_param(20,$station_code);
      $insert_sth->bind_param(21,$time_point);
      $insert_sth->execute ;#or warn $db->errstr;
	}
	$db->commit ;#or warn $db->errstr;
	close(CSV);
}
$db->disconnect();
print "sucess for import csv!";

插入完成后,建立索引,使用sqlite提供的shell工具,可以在SQLite官网下载,下载Precompiled Binaries for Windows版本,或者使用修改的cmder mini ,其中自带sqlite3

注意:表中的数据都是utf8编码,如果直接输出在命令行中可能为乱码,且windows自带的cmd中查询也有可能为乱码,最好输出到文件!如果想在命令行中直接查看,可以使用上面提到的cmder或者使用msys2,安装sqlite后使用sqlite3工具查询

打开命令行,或者cmder.exe,转入文件所在目录,输入 sqlite3.exe air.db命令,可以看到如下信息:

c:\>d:
d:\>cd data
d:\data\>cd perl
D:\data\perl>sqlite3.exe air.db
SQLite version 3.8.8 2015-01-16 12:08:06
Enter ".help" for usage hints.
sqlite>

我们进行查询/输出/建立索引等功能将都在此进行

下面对time_point建立索引,请在数据插入完成后建立索引,索引会影响插入速度:

--sqlite> 是命令行显示,不需要输入
--建立索引
sqlite>CREATE INDEX time_index ON content(time_point);
--将结果输出到文件中,以逗号分隔(默认输出以 | 分隔)
sqlite>.separator ','
--将结果输出到文件中
sqlite>.output RESULT.txt
--按日期选择
sqlite>SELECT * FROM content WHERE time_point BETWEEN DATE('2013-05-22') AND DATE('2013-05-24');
--按站点选择
sqlite>SELECT * FROM content WHERE station_code = "1001A";
--按地区选择
sqlite>SELECT * FROM content WHERE area LIKE "北京%" LIMIT 10;
--选择同时输出经纬度
sqlite>SELECT content.*,location.latitude,location.longitude FROM content,location WHERE content.station_code = "2011A" AND content.station_code = location.station_code;

可以按照上面的描述,自己进行查询.