Python数据分析02-Excel文件处理

Python处理Excel的包

在PYthon中处理Excel文件(扩展名为.xls.xlsx的文件)的扩展包为xlrdxlwt,这两个包对Excel中的日期型数据支持非常好。而在Anaconda Python中,已经含有了这两个包,不用额外安装。如果是最原始版本的Python,则需要下载这两个包。

书中关于这两个包的介绍并不多,因此一些参数只是告诉你了要么用,至于这个参数具体是做什么的,这个包的某些函数还有没有其它的参数,作者并没有多讲,因此可以检索一下xlrdxlwt这两个包的文档。

xlrd文档

xlwt文档

练习数据下载

《Python数据分析基础》这本书中的源数据和源代码已经上传至Github上,自己可以下载下来。

下载后的Excel练习数据一共是3个,名称分别为:sales_2013sales_2014sales_2015。其中以sales_2013为例说明一下,这个文件中一共有3个Excel表格,名称分别为january_2013february_2013march_2013

Excel与csv文件的不同之处在于,一个Excel文件(后文称为一个Exce工作薄)含有多个工作表(sheet)。通过工作簿的内省可以获取工作表的信息:

Python的内省(Introspection)是一种自我检查行为。在编程过程中,自省能够获取对象的类型与属性。

注:在这篇笔记中,工作簿是指一个Excel文件,而一个工作簿中则中含有多个工作表。

获取工作表的信息

下面的代码命名为1excel_introspect_workbook.py,它的功能在于获取一个Excel文件中工作表的数量、名称和每个工作表中行与列的数量,如下所示:

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env python3
import sys
from xlrd import open_workbook
input_file = sys.argv[1]
workbook = open_workbook(input_file)
print('Number of worksheets:', workbook.nsheets)
for worksheet in workbook.sheets():
print("Worksheet name:", worksheet.name, '\tRows:', \
worksheet.nrows, '\tColumns:', worksheet.ncols)

运行如下所示:

1
2
3
4
5
C:\Users\20161111\Desktop>python 1excel_introspect_workbook.py sales_2013.xlsx
Number of worksheets: 3
Worksheet name: january_2013 Rows: 7 Columns: 5
Worksheet name: february_2013 Rows: 7 Columns: 5
Worksheet name: march_2013 Rows: 7 Columns: 5

解释代码:

  1. 使用open_workbook函数打开了一个Excel主文件,并赋给了一个名为workbook的对象,这个对象中包含了工作簿中的所有可用的信息,可以查看一下,如下所示:
1
2
>>> dir(workbook)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_sheets_count', '_all_sheets_map', '_externsheet_info', '_externsheet_type_b57', '_extnsht_count', '_extnsht_name_from_num', '_repr_these', '_resources_released', '_rich_text_runlist_map', '_sh_abs_posn', '_sharedstrings', '_sheet_list', '_sheet_names', '_sheet_num_from_name', '_sheet_visibility', '_sheethdr_count', '_supbook_addins_inx', '_supbook_count', '_supbook_locals_inx', '_supbook_types', '_xf_epilogue_done', '_xf_index_to_xl_type_map', 'actualfmtcount', 'addin_func_names', 'biff2_8_load', 'biff_version', 'builtinfmtcount', 'codepage', 'colour_map', 'countries', 'datemode', 'derive_encoding', 'dump', 'encoding', 'fake_globals_get_sheet', 'filestr', 'font_list', 'format_list', 'format_map', 'formatting_info', 'get2bytes', 'get_record_parts', 'get_record_parts_conditional', 'get_sheet', 'get_sheets', 'getbof', 'handle_boundsheet', 'handle_builtinfmtcount', 'handle_codepage', 'handle_country', 'handle_datemode', 'handle_externname', 'handle_externsheet', 'handle_filepass', 'handle_name', 'handle_obj', 'handle_sheethdr', 'handle_sheetsoffset', 'handle_sst', 'handle_supbook', 'handle_writeaccess', 'initialise_format_info', 'load_time_stage_1', 'load_time_stage_2', 'logfile', 'mem', 'name_and_scope_map', 'name_map', 'name_obj_list', 'names_epilogue', 'nsheets', 'on_demand', 'palette_record', 'parse_globals', 'props', 'ragged_rows', 'raw_user_name', 'read', 'release_resources', 'sheet_by_index', 'sheet_by_name', 'sheet_loaded', 'sheet_names', 'sheets', 'style_name_map', 'unload_sheet', 'use_mmap', 'user_name', 'verbosity', 'xf_list', 'xfcount']
  1. name属性则是获取每个sheet的名称,nrowncol则是获取每个sheet的行与列的数目

处理单个sheet

读写Excel文件

下面的代码功能在于读取一个excel文件,名称为2excel_parsing_and_write.py,代码如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
# Create an Workbook object
# Workbook object is intended for output
output_worksheet = output_workbook.add_sheet('jan_2013_output')
# add_sheet function is adding an sheet
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
for row_index in range(worksheet.nrows):
for column_index in range(worksheet.ncols):
output_worksheet.write(row_index, column_index, \
worksheet.cell_value(row_index, column_index))
# write function is use for writing data into output files
output_workbook.save(output_file)

运行结果如下所示:

1
C:\Users\20161111\Desktop>python 2excel_parsing_and_write.py sales_2013.xlsx 2output.xls

打开2output.xls后如下所示:

其中第5列,也就是Purchase Date这一列是数字,而不是日期。Excel会将日期和时间保存为浮点数,这个浮点数代表从1900年1月0日起经过的日期数,加上一个24小时的小数部分。例如数值1表示1900年1月1日,因为从1900年1月0日是过去了1天,因此这一列中的数值代表日期,只是这是没有经过格式化的日期数据。

xlrd能够格式化这种日期。

格式化日期

常规方法

格式化日期的Python代码保存为3excel_parsing_and_write_keep_dates.py,代码内容如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env python3
import sys
from datetime import date
# date function is used to format date data
from xlrd import open_workbook, xldate_as_tuple
# xldate_as_tuple is to transform data representing date, time into tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
for row_index in range(worksheet.nrows):
row_list_output = []
for col_index in range(worksheet.ncols):
if worksheet.cell_type(row_index, col_index) == 3:
# value 3 represents cell that include date data
date_cell = xldate_as_tuple(worksheet.cell_value\
(row_index, col_index), workbook.datemode)
# datemode determine date is based on 1900 or 1904
date_cell = date(*date_cell[0:3]).strftime\
('%m/%d/%Y')
row_list_output.append(date_cell)
output_worksheet.write(row_index, col_index, date_cell)
else:
non_date_cell = worksheet.cell_value\
(row_index, col_index)
row_list_output.append(non_date_cell)
output_worksheet.write(row_index, col_index, \
non_date_cell)
output_workbook.save(output_file)

运行结果如下所示:

1
C:\Users\20161111\Desktop>python 3excel_parsing_and_write_keep_dates.py sales_2013.xlsx date_output.xls

使用pandas读取Excel文件

使用pandas也可以读取Excel文件,将下面的代码保存为pandas_read_and_write_excel.py,如下所示:

1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python3
import sys
import pandas as pd
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, sheetname = 'january_2013')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name = 'jan_13_output', index = False)
writer.save()

运行结果如下所示:

1
PS D:\netdisk\python_data> python .\pandas_read_and_write_excel.py .\sales_2013.xlsx pandas_output.xls

选取特定的行

现在挑出那些Sale Amount超过1400的行。

常规Python选择某一行

保存下面代码,命名为4excel_value_meets_condition.py,代码如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/usr/bin/env python3
import sys
from datetime import date
from xlrd import open_workbook,xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
sale_amount_column_index = 3
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = []
header = worksheet.row_values(0)
data.append(header)
for row_index in range(1, worksheet.nrows):
row_list = []
sale_amount = worksheet.cell_value\
(row_index, sale_amount_column_index)
if sale_amount > 1400:
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value\
(row_index, column_index)
cell_type = worksheet.cell_type\
(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple\
(cell_value, workbook.datemode)
date_cell = date(*date_cell[0:3])\
.strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
if row_list:
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)

运行结果如下所示:

1
PS D:\netdisk\python_data> python .\4excel_value_meets_condition.py .\sales_2013.xlsx 4output.xls

pandas包选取某一行

在下面的案例中,使用pandas挑出那些Sale Amount列大于1400的行。如果需要多个条件,则需要添加&|符号,代码文件保存为pandas_value_meets_condition.py

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col = None)
data_frame_value_meets_condition = \
data_frame[data_frame['Sale Amount'].astype(float) > 1400]
writer = pd.ExcelWriter(output_file)
data_frame_value_meets_condition.to_excel(writer, sheet_name = 'jan_13_output',\
index = False)
writer.save()

运行代码如下所示:

1
PS D:\netdisk\python_data> python .\pandas_value_meets_condition.py .\sales_2013.xlsx pandas_output.xlsx

打开pandas_output.xlsx,如下所示:

挑选满足多个条件的行

例如在前面的数据中,我们要挑选01/24/201301/31/2013这两个购买日期的行,代码保存为5excel_value_in_set.py,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)
important_dates = ['01/24/2013','01/31/2013']
data_frame_value_in_set = data_frame[data_frame['Purchase Date'].isin(important_dates)]
writer = pd.ExcelWriter(output_file)
data_frame_value_in_set.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()

运行以下代码即可:

1
D:\netdisk\python_data>python 5excel_value_in_set.py sales_2013.xlsx pandas_output.xls

挑选满足某一特定模式的行

任务:选择那些顾客的名字里,首字母是J的行,使用pandas来实现,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)
data_frame_value_matches_pattern = data_frame[data_frame['Customer Name']\
.str.startswith("J")]
writer = pd.ExcelWriter(output_file)
data_frame_value_matches_pattern.to_excel(writer, sheet_name='jan_13_output',\
index=False)
writer.save()

按如下方式运行代码:

1
λ python pandas_value_matches_pattern.py sales_2013.xlsx output_files\pandsa_output.xls

选取特定列

使用Python选取特定的列的时,可以采用列索引值,也可以采用列标题来实现。

按索引值来提取列。**

现在假设我们想保留Customer Name和Purchase Date这两列,使用Pandas来实现,保存代码为pandas_column_by_index.py,如下所示:

1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col = None)
data_frame_column_by_index = data_frame.iloc[:, [1,4]]
writer = pd.ExcelWriter(output_file)
data_frame_column_by_index.to_excel(writer, sheet_name = 'jan_13_output',index=False)
writer.save()

参考资料