前言 ..................................................................................................XI
第一部分 入门 :理论和工具
第 1 章 Hadoop 基础 ........................................................................3
黑猩猩和大象创业 .................................................................................................................4
Map-Only 作业 :逐个处理记录 ...........................................................................................5
Pig Latin Map-Only 作业........................................................................................................6
创建 Docker Hadoop 集群 ......................................................................................................8
运行作业 .......................................................................................................................12
小结 .......................................................................................................................................15
第 2 章 MapReduce........................................................................17
黑猩猩和大象拯救圣诞节 ...................................................................................................17
玩具岛上的麻烦 ...........................................................................................................17
黑猩猩把信件变成带标签的玩具表 ...........................................................................19
小象将玩具表送到适当的工作台 .......................................................................................21
示例 :驯鹿游戏 ...................................................................................................................23
UFO 数据 ......................................................................................................................24
根据报道延迟对 UFO 目击分组 .................................................................................24
Mapper ..........................................................................................................................24
Reducer .........................................................................................................................26
数据可视化 ...................................................................................................................29
驯鹿小结 .......................................................................................................................30
Hadoop 与传统数据库 .........................................................................................................30
MapReduce 俳句 ...................................................................................................................31
Map 阶段简述 ..............................................................................................................32
Group-Sort 阶段简述 ...................................................................................................32
Reduce 阶段简述 ..........................................................................................................32
小结 .......................................................................................................................................33
第 3 章 棒球数据集速览 ..................................................................35
数据 .......................................................................................................................................35
缩略词和术语 .......................................................................................................................36
规则和目标 ...........................................................................................................................37
评价指标 ...............................................................................................................................37
小结 .......................................................................................................................................38
第 4 章 Pig 入门 ..............................................................................39
Pig 帮助 Hadoop 处理数据表,而不是记录 ......................................................................39
维基百科访问数统计 ...................................................................................................41
基本数据操作 .......................................................................................................................43
控制操作 .......................................................................................................................44
管道操作 .......................................................................................................................44
结构化操作 ...................................................................................................................44
LOAD 定位并描述你的数据 ...............................................................................................46
简单类型 .......................................................................................................................46
复杂类型 1,元组 :带类型字段的固长序列 ............................................................47
复杂类型 2,袋 :元组的无限集合 ............................................................................47
定义变换后的记录模式 ...............................................................................................48
STORE 将数据写入磁盘 .....................................................................................................49
辅助命令 ...............................................................................................................................50
DESCRIBE ...................................................................................................................50
DUMP ...........................................................................................................................50
SAMPLE .......................................................................................................................50
ILLUSTRATE ...............................................................................................................51
EXPLAIN......................................................................................................................51
Pig 函数 .................................................................................................................................51
Piggybank ..............................................................................................................................53
Apache DataFu ......................................................................................................................56
小结 .......................................................................................................................................59
第二部分 战术 :分析模式
第 5 章 Map-Only 操作 ...................................................................63
模式用法 .......................................................................................................................63
清除数据 ...............................................................................................................................64
选择满足条件的记录 :FILTER 等 .....................................................................................65
选择满足多个条件的记录 ...........................................................................................66
选择或丢弃空值记录 ...................................................................................................66
选择匹配正则表达式的记录(MATCHES) ..............................................................67
根据固定的值列表匹配记录 .......................................................................................70
按字段名投影字段 ...............................................................................................................71
使用 FOREACH 选择、重命名和重排序字段 ..........................................................71
抽取记录的随机样本 ...................................................................................................73
按 key 抽取一致性样本 ...............................................................................................74
仅加载部分 part-Files 实现粗略抽样 .........................................................................75
使用 LIMIT 选择固定数量的记录..............................................................................75
其他数据消除模式 .......................................................................................................76
变换记录 ...............................................................................................................................76
使用 FOREACH 逐个变换记录 ..................................................................................76
嵌套 FOREACH 允许使用中间表达式 ......................................................................77
根据模版格式化字符串 ...............................................................................................79
使用复杂类型组装字面值 ...........................................................................................80
操纵字段的类型 ...........................................................................................................84
整型、浮点型和取整 ...................................................................................................86
从外部包调用用户自定义函数 ...................................................................................87
将一个表分裂成多个表的操作 ...........................................................................................88
将数据条件定向到多个数据流 (SPLIT) ....................................................................88
将几个表联合成一个表的操作 ...........................................................................................89
将多个 Pig 关系表合并成一个表(堆砌行集) .........................................................89
小结 .......................................................................................................................................91
第 6 章 分组操作 .............................................................................93
按 key 将记录分组到袋 .......................................................................................................93
模式用法 .......................................................................................................................97
统计 key 的出现次数 ...................................................................................................97
使用带分隔符的字符串表示值的集合 .......................................................................99
使用带分隔符的字符串表示复杂数据结构 .............................................................101
使用 JSON 编码的字符串表示复杂数据结构 .........................................................102
分组和聚合 .........................................................................................................................106
聚合组的统计数据 .....................................................................................................106
完全汇总字段 .............................................................................................................108
汇总整个表的聚合统计值 .........................................................................................110
汇总字符串字段 ......................................................................................................... 111
使用直方图计算数值型值的分布情况 .............................................................................113
模式用法 .....................................................................................................................114
直方图的数据分箱 .....................................................................................................114
确定箱子的大小 .........................................................................................................116
解释直方图和分位数 .................................................................................................118
将数据分箱到规模呈指数变化的块 .........................................................................119
为通用代码段创建 Pig 宏 .........................................................................................121
比赛分布情况 .............................................................................................................121
极端情况和干扰因子 .................................................................................................122
不要相信尾部分布 .....................................................................................................125
计算相对分布直方图 .................................................................................................126
重新注入全局值 .........................................................................................................127
在组内计算直方图 .....................................................................................................128
导出可读结果 .............................................................................................................130
汇总技巧 .............................................................................................................................132
统计组的条件子集——汇总技巧 .............................................................................132
同时汇总组的多个子集 .............................................................................................134
测试组内某个值是否缺失 .........................................................................................136
小结 .....................................................................................................................................137
参考文献 .............................................................................................................................138
第 7 章 表连接 ..............................................................................139
匹配表记录(内连接) ......................................................................................................140
将一个表的记录与另一个表的记录直接匹配连接(直接内连接) .......................140
连接是怎么工作的 .............................................................................................................142
连接就是 COGROUP+FLATTEN .............................................................................142
连接就是在表名上进行二次排序的 MapReduce 作业 ...........................................143
处理连接和分组中的空值和不匹配 .........................................................................145
枚举多对多关系 .................................................................................................................147
连接表和它自己(自连接) ...............................................................................................148
包含不匹配记录的连接(外连接) ...................................................................................150
模式用法 .....................................................................................................................152
连接不含外键关系的表 .............................................................................................153
连接整型表填补列表中的空白 .................................................................................155
仅选择与另一个表不匹配的记录(反连接) ...................................................................157
仅选择与另一个表匹配的记录(半连接) .......................................................................158
反连接的另一种方式 :使用 COGROUP .................................................................158
小结 .....................................................................................................................................160
第 8 章 排序操作 ...........................................................................161
准备职业生涯时期 .............................................................................................................161
对所有记录进行全排序 .....................................................................................................163
多字段排序 .................................................................................................................164
表达式排序(行不通) ...............................................................................................164
大小写不敏感的字符串排序 .....................................................................................165
排序的空值处理 .........................................................................................................165
将值放到排序顺序的顶部或底端 .............................................................................166
组内排序 .............................................................................................................................167
模式用法 .....................................................................................................................169
根据字段值的 Top-K 选择行 ....................................................................................169
组内 Top-K .................................................................................................................170
按照排序顺序给记录编号 .................................................................................................170
找出最大值对应的记录 .............................................................................................171
对一组记录进行混排 .................................................................................................171
小结 .....................................................................................................................................172
第 9 章 重复记录和唯一记录 .........................................................173
处理重复 .............................................................................................................................173
消除表中的重复记录 .................................................................................................174
消除组内的重复记录 .................................................................................................174
基于键消除重复 .........................................................................................................175
基于键选择唯一(或重复)记录 .............................................................................176
集合操作 .............................................................................................................................177
全表上的集合操作 .....................................................................................................178
Distinct Union .............................................................................................................179
Distinct Union(其他方法) .......................................................................................179
Set Intersection ............................................................................................................179
Set Difference .............................................................................................................180
Symmetric Difference :(A-B)+(B-A) ........................................................................180
Set Equality .................................................................................................................181
组内集合操作 .............................................................................................................182
构造一个集合序列 .....................................................................................................182
某个组内的集合操作 .................................................................................................183
小结 .....................................................................................................................................185
索引 ................................................................................................187