LearnBasicBigDataTech
:rocket:Some projects on Big Data Analysis like Spark, Hive, Presto and Data Visualization like Superset
Install / Use
/learn @weltond/LearnBasicBigDataTechREADME
致一亩三分地: 选取这篇Repository是为了展示自己不断学习的过程,虽然工作中没有太多机会用到大数据相关的技术栈,但我仍然在业余时间自学并时刻保持好学的心态!谢谢!
Big Data Learning on Hive, Spark, Presto, Superset(Data Visulaization)
Learning Part 1
Basic operations on Hive
Create table and load data
-
Create a table without partition stored in textfile
hive> create table student_nopart(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as textfile;Check whether table is created or not by:
hive> show tables; 2018-03-30T19:07:18,959 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default 2018-03-30T19:07:19,105 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default 2018-03-30T19:07:19,122 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_tables: db=default pat=.* OK student_nopart Time taken: 0.229 seconds, Fetched: 1 row(s)we can see table student_nopart was created successfully
-
Load local data into hive
hive> load data local inpath 'studentgenderstatetab10k' into table student_nopart; .... OK Time taken: 1.505 secondscheck if data is loaded into hive warehouse
hive> dfs -ls /user/hive/warehouse/student_nopart; Found 1 items -rwxr-xr-x 1 hadoop supergroup 268219 2018-03-30 19:10 /user/hive/warehouse/student_nopart/studentgenderstatetab10kAnother way to check
hive> select COUNT(*) from student_nopart; ..... Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.55 sec HDFS Read: 276442 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 550 msec OK 10000 Time taken: 35.192 seconds, Fetched: 1 row(s)
Dynamic partition during insertion
-
create a new table partitioned by column gender and state.
hive> create table student_part(name string, age int, gpa double) partitioned by (gender string, state string) row format delimited fields terminated by '\t' stored as textfile; .... 2018-03-30T19:48:16,514 INFO [pool-11-thread-4] org.apache.hadoop.hive.common.FileUtils - Creating directory if it doesn't exist: hdfs://localhost:9000/user/hive/warehouse/student_part OK Time taken: 0.698 seconds -
Load data from student_nopart into table student_part, dynamic partition
hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> insert into student_part partition(gender,state) select * from student_nopart;Check the directionary struct in HDFS
hive> dfs -ls /user/hive/warehouse/student_part; Found 2 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=F drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=M hive> dfs -ls /user/hive/warehouse/student_part/gender=M; Found 4 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=AZ drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=CA drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=FL drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=WA hive> dfs -ls /user/hive/warehouse/student_part/gender=F; Found 4 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=AZ drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=CA drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=FL drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=WA -
Do a simple query
hive> select COUNT(name) from student_part where gender = 'F' and state = 'AZ'; ...... MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.92 sec HDFS Read: 45557 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 920 msec OK 1693 Time taken: 25.845 seconds, Fetched: 1 row(s)
Store file
-
Store table as ORC file
hive> create table student_orc(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as orcfile;Insert data into table
hive> insert into table student_orc select * from student_part; -
Check ORC file
hadoop@1053d6dea9e8:/src/week3/lab1/data$ hive --orcfiledump /user/hive/warehouse/student_orc/000000_0
Hive UDF
-
Write UDF in Eclipse. There are two UDFs:
- Hello from Daniel's example:
public class Hello extends UDF { public Text evaluate(Text input) { return new Text("Hello " + input.toString()); } }- MyUpper wrote by myself:
public class MyUpper extends UDF { public Text evaluate(Text input) { return new Text(input.toString().toUpperCase()); } } -
Use UDF in Hive Firstly, add jar into classpath and create temporary function
hive> add jar target/evalfunc-0.0.1-SNAPSHOT.jar; .... Added resources: [target/evalfunc-0.0.1-SNAPSHOT.jar] hive> create temporary function hello as 'com.example.hive.evalfunc.Hello'; OK Time taken: 0.558 seconds hive> create temporary function myupper as 'com.example.hive.evalfunc.MyUpper'; OK Time taken: 0.011 secondsUse UDF in query
hive> select hello(myupper(name)) from student_nopart limit 10; ..... OK Hello ULYSSES THOMPSON Hello KATIE CARSON Hello LUKE KING Hello HOLLY DAVIDSON Hello FRED MILLER Hello HOLLY WHITE Hello LUKE STEINBECK Hello NICK UNDERHILL Hello HOLLY DAVIDSON Hello CALVIN BROWN Time taken: 3.137 seconds, Fetched: 10 row(s)
HiveServer2
-
start HiveServer2 service
hadoop@1053d6dea9e8:/src/week3/lab1/evalfunc$ hive --service hiveserver2 & -
Connect HiveServer2 using beeline
beeline> !connect jdbc:hive2://localhost:10000 hadoop hadoop Connecting to jdbc:hive2://localhost:10000 2018-03-30T21:04:15,672 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source: 127.0.0.1 get_database: default 2018-03-30T21:04:15,713 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore Connected to: Apache Hive (version 2.3.2) Driver: Hive JDBC (version 2.3.2) Transaction isolation: TRANSACTION_REPEATABLE_READ -
Show tables
0: jdbc:hive2://localhost:10000> show tables; 2018-03-30T21:04:24,479 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default 2018-03-30T21:04:24,847 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default 2018-03-30T21:04:24,862 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_tables: db=default pat=.* OK +-----------------+ | tab_name | +-----------------+ | student_nopart | | student_orc | | student_part | +-----------------+ 3 rows selected (1.858 seconds)Have to say, aesthetic table makes me comfortable and happy. : )
-
Create a new table
jdbc:hive2://localhost:10000> create table voter(name string, age int, registration string, contribution double) row format delimited fields terminated by '\t' stored as textfile; ...... OK No rows affected (0.188 seconds)Load data into table
jdbc:hive2://localhost:10000> load data inpath votertab10k into table voter; -
Do some simple queries
-
query how many students coming from California
jdbc:hive2://localhost:10000> select COUNT(DISTINCT(name)) as numOfCaliPerson from student_part where state = 'CA'; ...... OK +------------------+ | numofcaliperson | +------------------+ | 595 | +------------------+ 1 row selected (23.091 seconds) -
join table student_orc and voter
jdbc:hive2://localhost:10000> select * from student_orc s join voter v on s.name = v.name limit 10; ...... OK +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+ | s.name | s.age | s.gpa | s.gender | s.state | v.name | v.age | v.registration | v.contribution | +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+ | luke king | 65 | 0.73 | F | AZ | luke king | 26 | libertarian | 604.71 | | luke king | 65 | 0.73 | F | AZ | luke king | 62 | democrat | 281.26 | | luke king | 65 | 0.73 | F | AZ | luke king | 56 | libertarian | 177.84 | | luke king | 65 | 0.73 | F | AZ | luke king | 61 | republican | 239.53 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 59 | democrat | 957.45 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 62 | green | 934.61 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 18 | democrat | 980.63 | | holly davidson | 57 | 2.43
-
