SkillAgentSearch skills...

LearnBasicBigDataTech

:rocket:Some projects on Big Data Analysis like Spark, Hive, Presto and Data Visualization like Superset

Install / Use

/learn @weltond/LearnBasicBigDataTech
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

致一亩三分地: 选取这篇Repository是为了展示自己不断学习的过程,虽然工作中没有太多机会用到大数据相关的技术栈,但我仍然在业余时间自学并时刻保持好学的心态!谢谢!

Big Data Learning on Hive, Spark, Presto, Superset(Data Visulaization)


Learning Part 1

Basic operations on Hive

Create table and load data

  1. Create a table without partition stored in textfile

    hive> create table student_nopart(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as textfile;
    

    Check whether table is created or not by:

    hive> show tables;
    2018-03-30T19:07:18,959 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default
    2018-03-30T19:07:19,105 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default
    2018-03-30T19:07:19,122 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_tables: db=default pat=.*
    OK
    student_nopart
    Time taken: 0.229 seconds, Fetched: 1 row(s)
    

    we can see table student_nopart was created successfully

  2. Load local data into hive

    hive> load data local inpath 'studentgenderstatetab10k' into table student_nopart;  
    ....  
    OK
    Time taken: 1.505 seconds
    

    check if data is loaded into hive warehouse

    hive> dfs -ls /user/hive/warehouse/student_nopart;
    Found 1 items
    -rwxr-xr-x   1 hadoop supergroup     268219 2018-03-30 19:10 /user/hive/warehouse/student_nopart/studentgenderstatetab10k
    

    Another way to check

    hive> select COUNT(*) from student_nopart;  
    .....
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 276442 HDFS Write: 105 SUCCESS
    Total MapReduce CPU Time Spent: 4 seconds 550 msec
    OK
    10000
    Time taken: 35.192 seconds, Fetched: 1 row(s)
    

Dynamic partition during insertion

  1. create a new table partitioned by column gender and state.

    hive> create table student_part(name string, age int, gpa double) partitioned by (gender string, state string) row format delimited fields terminated by '\t' stored as textfile;  
    ....  
    2018-03-30T19:48:16,514 INFO [pool-11-thread-4] org.apache.hadoop.hive.common.FileUtils - Creating directory if it doesn't exist: hdfs://localhost:9000/user/hive/warehouse/student_part  
    OK  
    Time taken: 0.698 seconds  
    
  2. Load data from student_nopart into table student_part, dynamic partition

    hive> set hive.exec.dynamic.partition.mode=nonstrict;  
    hive> insert into student_part partition(gender,state) select * from student_nopart;
    

    Check the directionary struct in HDFS

    hive> dfs -ls /user/hive/warehouse/student_part;
    Found 2 items
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=F
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=M
    hive> dfs -ls /user/hive/warehouse/student_part/gender=M;
    Found 4 items
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=AZ
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=CA
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=FL
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=WA
    hive> dfs -ls /user/hive/warehouse/student_part/gender=F;
    Found 4 items
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=AZ
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=CA
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=FL
    drwxr-xr-x   - hadoop supergroup          0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=WA
    
  3. Do a simple query

    hive> select COUNT(name) from student_part where gender = 'F' and state = 'AZ'; 
    ......
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.92 sec   	HDFS Read: 45557 HDFS Write: 104 SUCCESS
    Total MapReduce CPU Time Spent: 4 seconds 920 msec
    OK
    1693
    Time taken: 25.845 seconds, Fetched: 1 row(s)
    

Store file

  1. Store table as ORC file

    hive> create table student_orc(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as orcfile;
    

    Insert data into table

    hive> insert into table student_orc select * from student_part;
    
  2. Check ORC file

    hadoop@1053d6dea9e8:/src/week3/lab1/data$ hive --orcfiledump /user/hive/warehouse/student_orc/000000_0
    

Hive UDF

  1. Write UDF in Eclipse. There are two UDFs:

    1. Hello from Daniel's example:
    public class Hello extends UDF {
      public Text evaluate(Text input) {
          return new Text("Hello " + input.toString());
        }
    }
    
    1. MyUpper wrote by myself:
    public class MyUpper extends UDF {
      public Text evaluate(Text input) {
        return new Text(input.toString().toUpperCase());
      }
    }
    
  2. Use UDF in Hive Firstly, add jar into classpath and create temporary function

    hive> add jar target/evalfunc-0.0.1-SNAPSHOT.jar;
    ....
    Added resources: [target/evalfunc-0.0.1-SNAPSHOT.jar]
    hive> create temporary function hello as 'com.example.hive.evalfunc.Hello';
    OK
    Time taken: 0.558 seconds
    hive> create temporary function myupper as 'com.example.hive.evalfunc.MyUpper';
    OK
    Time taken: 0.011 seconds
    

    Use UDF in query

    hive> select hello(myupper(name)) from student_nopart limit 10;
    .....
    OK
    Hello ULYSSES THOMPSON
    Hello KATIE CARSON
    Hello LUKE KING
    Hello HOLLY DAVIDSON
    Hello FRED MILLER
    Hello HOLLY WHITE
    Hello LUKE STEINBECK
    Hello NICK UNDERHILL
    Hello HOLLY DAVIDSON
    Hello CALVIN BROWN
    Time taken: 3.137 seconds, Fetched: 10 row(s)
    

HiveServer2

  1. start HiveServer2 service

    hadoop@1053d6dea9e8:/src/week3/lab1/evalfunc$ hive --service hiveserver2 &
    
  2. Connect HiveServer2 using beeline

    beeline> !connect jdbc:hive2://localhost:10000 hadoop hadoop
    Connecting to jdbc:hive2://localhost:10000
    2018-03-30T21:04:15,672 INFO [pool-11-thread-12] 	org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:	127.0.0.1 get_database: default
    2018-03-30T21:04:15,713 INFO [pool-11-thread-12] 	org.apache.hadoop.hive.metastore.HiveMetaStore - 12: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
    Connected to: Apache Hive (version 2.3.2)
    Driver: Hive JDBC (version 2.3.2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    
  3. Show tables

    0: jdbc:hive2://localhost:10000> show tables;
    2018-03-30T21:04:24,479 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default
    2018-03-30T21:04:24,847 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default
    2018-03-30T21:04:24,862 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_tables: db=default pat=.*
    OK
    +-----------------+
    |    tab_name     |
    +-----------------+
    | student_nopart  |
    | student_orc     |
    | student_part    |
    +-----------------+
    3 rows selected (1.858 seconds)
    

    Have to say, aesthetic table makes me comfortable and happy. : )

  4. Create a new table

    jdbc:hive2://localhost:10000> create table voter(name string, age int, registration string, contribution double) row format delimited fields terminated by '\t' stored as textfile;
    ......
    OK
    No rows affected (0.188 seconds)
    

    Load data into table

    jdbc:hive2://localhost:10000> load data inpath votertab10k into table voter;
    
  5. Do some simple queries

    1. query how many students coming from California

      jdbc:hive2://localhost:10000> select COUNT(DISTINCT(name)) as numOfCaliPerson from student_part where state = 'CA';
      ......
      OK
      +------------------+
      | numofcaliperson  |
      +------------------+
      | 595              |
      +------------------+
      1 row selected (23.091 seconds)
      
    2. join table student_orc and voter

      jdbc:hive2://localhost:10000> select * from student_orc s join voter v on s.name = v.name limit 10;
      ......
      OK
      +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+
      |      s.name      | s.age  | s.gpa  | s.gender  | s.state  |      v.name      | v.age  | v.registration  | v.contribution  |
      +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+
      | luke king        | 65     | 0.73   | F         | AZ       | luke king        | 26     | libertarian     | 604.71          |
      | luke king        | 65     | 0.73   | F         | AZ       | luke king        | 62     | democrat        | 281.26          |
      | luke king        | 65     | 0.73   | F         | AZ       | luke king        | 56     | libertarian     | 177.84          |
      | luke king        | 65     | 0.73   | F         | AZ       | luke king        | 61     | republican      | 239.53          |
      | holly davidson   | 57     | 2.43   | F         | AZ       | holly davidson   | 59     | democrat        | 957.45          |
      | holly davidson   | 57     | 2.43   | F         | AZ       | holly davidson   | 62     | green           | 934.61          |
      | holly davidson   | 57     | 2.43   | F         | AZ       | holly davidson   | 18     | democrat        | 980.63          |
      | holly davidson   | 57     | 2.43   
      
View on GitHub
GitHub Stars50
CategoryData
Updated2mo ago
Forks3

Languages

Java

Security Score

80/100

Audited on Jan 27, 2026

No findings