Python数据分析02-Excel文件处理

Python处理Excel的包

在PYthon中处理Excel文件（扩展名为.xls和.xlsx的文件）的扩展包为xlrd和xlwt，这两个包对Excel中的日期型数据支持非常好。而在Anaconda Python中，已经含有了这两个包，不用额外安装。如果是最原始版本的Python，则需要下载这两个包。

书中关于这两个包的介绍并不多，因此一些参数只是告诉你了要么用，至于这个参数具体是做什么的，这个包的某些函数还有没有其它的参数，作者并没有多讲，因此可以检索一下xlrd和xlwt这两个包的文档。

xlrd文档

xlwt文档

练习数据下载

《Python数据分析基础》这本书中的源数据和源代码已经上传至Github上，自己可以下载下来。

下载后的Excel练习数据一共是3个，名称分别为：sales_2013，sales_2014和sales_2015。其中以sales_2013为例说明一下，这个文件中一共有3个Excel表格，名称分别为january_2013，february_2013和march_2013。

Excel与csv文件的不同之处在于，一个Excel文件（后文称为一个Exce工作薄）含有多个工作表（sheet）。通过工作簿的内省可以获取工作表的信息：

Python的内省（Introspection）是一种自我检查行为。在编程过程中，自省能够获取对象的类型与属性。

注：在这篇笔记中，工作簿是指一个Excel文件，而一个工作簿中则中含有多个工作表。

获取工作表的信息

下面的代码命名为1excel_introspect_workbook.py，它的功能在于获取一个Excel文件中工作表的数量、名称和每个工作表中行与列的数量，如下所示：

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
input_file = sys.argv[1]
workbook = open_workbook(input_file)
print('Number of worksheets:', workbook.nsheets)
for worksheet in workbook.sheets():
    print("Worksheet name:", worksheet.name, '\tRows:', \
          worksheet.nrows, '\tColumns:', worksheet.ncols)

运行如下所示：

C:\Users\20161111\Desktop>python 1excel_introspect_workbook.py sales_2013.xlsx
Number of worksheets: 3
Worksheet name: january_2013    Rows: 7         Columns: 5
Worksheet name: february_2013   Rows: 7         Columns: 5
Worksheet name: march_2013      Rows: 7         Columns: 5

解释代码：

使用open_workbook函数打开了一个Excel主文件，并赋给了一个名为workbook的对象，这个对象中包含了工作簿中的所有可用的信息，可以查看一下，如下所示：

1
2

>>> dir(workbook)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_sheets_count', '_all_sheets_map', '_externsheet_info', '_externsheet_type_b57', '_extnsht_count', '_extnsht_name_from_num', '_repr_these', '_resources_released', '_rich_text_runlist_map', '_sh_abs_posn', '_sharedstrings', '_sheet_list', '_sheet_names', '_sheet_num_from_name', '_sheet_visibility', '_sheethdr_count', '_supbook_addins_inx', '_supbook_count', '_supbook_locals_inx', '_supbook_types', '_xf_epilogue_done', '_xf_index_to_xl_type_map', 'actualfmtcount', 'addin_func_names', 'biff2_8_load', 'biff_version', 'builtinfmtcount', 'codepage', 'colour_map', 'countries', 'datemode', 'derive_encoding', 'dump', 'encoding', 'fake_globals_get_sheet', 'filestr', 'font_list', 'format_list', 'format_map', 'formatting_info', 'get2bytes', 'get_record_parts', 'get_record_parts_conditional', 'get_sheet', 'get_sheets', 'getbof', 'handle_boundsheet', 'handle_builtinfmtcount', 'handle_codepage', 'handle_country', 'handle_datemode', 'handle_externname', 'handle_externsheet', 'handle_filepass', 'handle_name', 'handle_obj', 'handle_sheethdr', 'handle_sheetsoffset', 'handle_sst', 'handle_supbook', 'handle_writeaccess', 'initialise_format_info', 'load_time_stage_1', 'load_time_stage_2', 'logfile', 'mem', 'name_and_scope_map', 'name_map', 'name_obj_list', 'names_epilogue', 'nsheets', 'on_demand', 'palette_record', 'parse_globals', 'props', 'ragged_rows', 'raw_user_name', 'read', 'release_resources', 'sheet_by_index', 'sheet_by_name', 'sheet_loaded', 'sheet_names', 'sheets', 'style_name_map', 'unload_sheet', 'use_mmap', 'user_name', 'verbosity', 'xf_list', 'xfcount']

name属性则是获取每个sheet的名称，nrow与ncol则是获取每个sheet的行与列的数目

处理单个sheet

读写Excel文件

下面的代码功能在于读取一个excel文件，名称为2excel_parsing_and_write.py，代码如下所示：

#!/usr/bin/env python3
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
# Create an Workbook object
# Workbook object is intended for output
output_worksheet = output_workbook.add_sheet('jan_2013_output')
# add_sheet function is adding an sheet
with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('january_2013')
    for row_index in range(worksheet.nrows):
        for column_index in range(worksheet.ncols):
            output_worksheet.write(row_index, column_index, \
                                   worksheet.cell_value(row_index, column_index))
            # write function is use for writing data into output files
output_workbook.save(output_file)

运行结果如下所示：

1	C:\Users\20161111\Desktop>python 2excel_parsing_and_write.py sales_2013.xlsx 2output.xls

打开2output.xls后如下所示：

其中第5列，也就是Purchase Date这一列是数字，而不是日期。Excel会将日期和时间保存为浮点数，这个浮点数代表从1900年1月0日起经过的日期数，加上一个24小时的小数部分。例如数值1表示1900年1月1日，因为从1900年1月0日是过去了1天，因此这一列中的数值代表日期，只是这是没有经过格式化的日期数据。

xlrd能够格式化这种日期。

格式化日期

常规方法

格式化日期的Python代码保存为3excel_parsing_and_write_keep_dates.py，代码内容如下所示：

#!/usr/bin/env python3
import sys
from datetime import date
# date function is used to format date data
from xlrd import open_workbook, xldate_as_tuple
# xldate_as_tuple is to transform data representing date, time into tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('january_2013')
    for row_index in range(worksheet.nrows):
        row_list_output = []
        for col_index in range(worksheet.ncols):
            if worksheet.cell_type(row_index, col_index) == 3:
                # value 3 represents cell that include date data
                date_cell = xldate_as_tuple(worksheet.cell_value\
                                                (row_index, col_index), workbook.datemode)
                # datemode determine date is based on 1900 or 1904 
                date_cell = date(*date_cell[0:3]).strftime\
                    ('%m/%d/%Y')
                row_list_output.append(date_cell)
                output_worksheet.write(row_index, col_index, date_cell)
            else:
                non_date_cell = worksheet.cell_value\
                    (row_index, col_index)
                row_list_output.append(non_date_cell)
                output_worksheet.write(row_index, col_index, \
                                       non_date_cell)
output_workbook.save(output_file)

运行结果如下所示：

1	C:\Users\20161111\Desktop>python 3excel_parsing_and_write_keep_dates.py sales_2013.xlsx date_output.xls

使用`pandas`读取Excel文件

使用pandas也可以读取Excel文件，将下面的代码保存为pandas_read_and_write_excel.py，如下所示：

#!/usr/bin/env python3
import sys
import pandas as pd
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, sheetname = 'january_2013')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name = 'jan_13_output', index = False)
writer.save()

运行结果如下所示：

1	PS D:\netdisk\python_data> python .\pandas_read_and_write_excel.py .\sales_2013.xlsx pandas_output.xls

选取特定的行

现在挑出那些Sale Amount超过1400的行。

常规Python选择某一行

保存下面代码，命名为4excel_value_meets_condition.py，代码如下所示：

#!/usr/bin/env python3
import sys
from datetime import date
from xlrd import open_workbook,xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
sale_amount_column_index = 3
with open_workbook(input_file) as workbook:
    worksheet = workbook.sheet_by_name('january_2013')
    data = []
    header = worksheet.row_values(0)
    data.append(header)
    for row_index in range(1, worksheet.nrows):
        row_list = []
        sale_amount = worksheet.cell_value\
            (row_index, sale_amount_column_index)
        if sale_amount > 1400:
            for column_index in range(worksheet.ncols):
                cell_value = worksheet.cell_value\
                    (row_index, column_index)
                cell_type = worksheet.cell_type\
                    (row_index, column_index)
                if cell_type == 3:
                    date_cell = xldate_as_tuple\
                        (cell_value, workbook.datemode)
                    date_cell = date(*date_cell[0:3])\
                    .strftime('%m/%d/%Y')
                    row_list.append(date_cell)
                else:
                    row_list.append(cell_value)
        if row_list:
            data.append(row_list)
    for list_index, output_list in enumerate(data):
        for element_index, element in enumerate(output_list):
            output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)

运行结果如下所示：

1	PS D:\netdisk\python_data> python .\4excel_value_meets_condition.py .\sales_2013.xlsx 4output.xls

`pandas`包选取某一行

在下面的案例中，使用pandas挑出那些Sale Amount列大于1400的行。如果需要多个条件，则需要添加&与|符号，代码文件保存为pandas_value_meets_condition.py。

#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col = None)
data_frame_value_meets_condition = \
    data_frame[data_frame['Sale Amount'].astype(float) > 1400]
writer = pd.ExcelWriter(output_file)
data_frame_value_meets_condition.to_excel(writer, sheet_name = 'jan_13_output',\
                                          index = False)
writer.save()

运行代码如下所示：

1	PS D:\netdisk\python_data> python .\pandas_value_meets_condition.py .\sales_2013.xlsx pandas_output.xlsx

打开pandas_output.xlsx，如下所示：

挑选满足多个条件的行

例如在前面的数据中，我们要挑选01/24/2013与01/31/2013这两个购买日期的行，代码保存为5excel_value_in_set.py，如下所示：

#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)
important_dates = ['01/24/2013','01/31/2013']
data_frame_value_in_set = data_frame[data_frame['Purchase Date'].isin(important_dates)]
writer = pd.ExcelWriter(output_file)
data_frame_value_in_set.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()

运行以下代码即可：

1	D:\netdisk\python_data>python 5excel_value_in_set.py sales_2013.xlsx pandas_output.xls

挑选满足某一特定模式的行

任务：选择那些顾客的名字里，首字母是J的行，使用pandas来实现，如下所示：

#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)
data_frame_value_matches_pattern = data_frame[data_frame['Customer Name']\
    .str.startswith("J")]
writer = pd.ExcelWriter(output_file)
data_frame_value_matches_pattern.to_excel(writer, sheet_name='jan_13_output',\
                                          index=False)
writer.save()

按如下方式运行代码：

1	λ python pandas_value_matches_pattern.py sales_2013.xlsx output_files\pandsa_output.xls

选取特定列

使用Python选取特定的列的时，可以采用列索引值，也可以采用列标题来实现。

按索引值来提取列。**

现在假设我们想保留Customer Name和Purchase Date这两列，使用Pandas来实现，保存代码为pandas_column_by_index.py，如下所示：

#!/usr/bin/env python3
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col = None)
data_frame_column_by_index = data_frame.iloc[:, [1,4]]
writer  = pd.ExcelWriter(output_file)
data_frame_column_by_index.to_excel(writer, sheet_name = 'jan_13_output',index=False)
writer.save()

Python处理Excel的包

练习数据下载

获取工作表的信息

处理单个sheet

读写Excel文件

格式化日期

常规方法

使用pandas读取Excel文件

选取特定的行

常规Python选择某一行

pandas包选取某一行

挑选满足多个条件的行

挑选满足某一特定模式的行

选取特定列

参考资料

使用`pandas`读取Excel文件

`pandas`包选取某一行