Because its never late to start taking notes and make it 'public'...

This repository contains my random (but important) technical notes, which I found useful.
Please do follow individual sub-folders for more details specific technology.
Notes so far include topics like Hadoop, Spark, Machine Learning, Java, Python, Unit testing, Clean Code, Py project management,

Follow me on, LinkedIn, Github, Kaggle

get HDFS file size

$ hdfs dfs -du -s -h  hdfs://hadoop-cluster/user/hive/warehouse/hive_schema.db/table
655.2 M  1.9 G  hdfs://hadoop-cluster/user/hive/warehouse/hive_schema.db/table

[size]     [disk space consumed]
655.2 M  * 3 (replication factor)  = 1.9 G
-s : aggregate summary of file lengths
-h : human readable instead long number in bytes 

# Look for specific keywards
hdfs dfs -du -s -h  /user/hive/warehouse/*hive_schema_name*/*hive_table_name* 

# print total size
hdfs dfs -du -s -h  /user/hive/warehouse/*hive_schema_name*/*hive_table_name* | awk '{ total += $1 }; END { print total }'

find and delete HIVE tables matching pattern with beeline

beeline -u $BEELINE_URL --showHeader=false --outputformat=tsv2 -e "show tables from $HIVE_SCHEMA like $PATTERN ;" | xargs -I '{}' beeline -u $BEELINE_URL --showHeader=false --outputformat=tsv2 -e " drop table $HIVE_SCHEMA.{} ;" 

xargs - reads data from standard input (stdin) and executes the command (supplied to it as argument) one or more times based on the input read. Any blanks and spaces in input are treated as delimiters, while blank lines are ignored.

HIVE: select column names based on reguler expression

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification

set hive.support.quoted.identifiers=none;
select `<reguler expression>` from  hive_table ;

hdfs dfs -checksum

https://community.hortonworks.com/questions/19239/hadoop-checksum-calculation-doubts.html

hdfs dfs -checksum <hdfs url>
<hdfs url>        MD5-of-0MD5-of-512CRC32C        00000200000000000000000024c3cf9f64d08eaafeb25bb9776f793c




- Datanodes are responsible for verifying the data they receive before storing the data and its checksum
- When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes
- 'get' command : HDFS computes a checksum for each block of each file. The checksums for a file are stored separately in a hidden file. When a file is read from HDFS, the checksums in that hidden file are used to verify the file’s integrity. For the get command, the -crc option will copy that hidden checksum file. The -ignorecrc option will skip the checksum checking when copying
- A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes3
- Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
- "MD5-of-0MD5-of-512CRC32C" : http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg@mail.gmail.com%3E

installing jupyter on windows

# On command prompt
C:\> python -m pip install jupyter

# It creats entry in C:\Python36\Scripts
C:\Python36\Scripts\jupyter.exe

C:\>jupyter notebook
[I 12:36:39.808 NotebookApp] Serving notebooks from local directory: C:\
[I 12:36:39.808 NotebookApp] 0 active kernels
[I 12:36:39.808 NotebookApp] The Jupyter Notebook is running at:
[I 12:36:39.808 NotebookApp] http://localhost:8888/?token=1d3293c485f32b492a93cf6dae1088d51d6d4635dff7630d
[I 12:36:39.808 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:36:39.808 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=1d3293c485f32b492a93cf6dae1088d51d6d4635dff7630d
[I 12:36:39.976 NotebookApp] Accepting one-time-token-authenticated connection from ::1

Test your Knowledge with Stack Overflow

https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955 You don’t really know a Python library if you cannot answer the majority of questions on it that are asked on Stack Overflow. This statement might be a little too strong, but in general, Stack Overflow provides a great testing ground for your knowledge of a particular library. There are over 50,000 questions tagged as pandas, so you have an endless test bank to build your pandas knowledge.

If you have never answered a question on Stack Overflow, I would recommend looking at older questions that already have answers and attempting to answer them by only using the documentation. After you feel like you can put together high-quality answers, I would suggest making attempts at unanswered questions. Nothing improved my pandas skills more than answering questions on Stack Overflow.

Alter HIVE table name

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase
This statement lets you change the name of a table to a different name. As of version 0.6, a rename on a managed table moves its HDFS location. Rename has been changed as of version 2.2.0 (HIVE-14909) so that a managed table's HDFS location is moved only if the table is created without a LOCATION clause and under its database directory. Hive versions prior to 0.6 just renamed the table in the metastore without moving the HDFS location.

ALTER TABLE old_table RENAME TO new_table;

Rename table records not visible in pyspark! There is a property of table which pyspark api looks for lading data. hive rename command fails to update this property, and we see no records when query from pyspark. Resolution is to alter table to change 'serdeproperties' path.

alter table my_schema.new_table set serdeproperties ('path'='hdfs://hadoop-cluster/user/hive/warehouse/my_schema.db/new_table')

The GIT stuff

commonly used commands in the order that I follow mostly :-)

https://www.youtube.com/watch?v=47uih9Tp6H8
https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

git clone <git dir path>
git init
git pull
git checkout -b feature/XYZ # change branch to feature branch
git add . # . is for add all the files that changed
git status
git commit -m "comments"
git push origin <branch name> 

git rm file1.txt
#But if you want to remove the file only from the Git repository and not remove it from the filesystem, use:
git rm --cached ipmvp-nifi-custom.iml

If git pull want to specifiy exact remote branch
Ex: There is no tracking information for the current branch. Please specify which branch you want to merge with.

    git branch --set-upstream-to=origin/<branch> feature/XXX-100

Create new branch and push to git

git branch feature/ABC-123
git checkout feature/ABC-123
git add .
git commit -m "bla bla"
git push origin feature/ABC-123

Clear all local commits!

git reset --hard HEAD^

git checkout develop
git pull
git checkout - # switch to feature branch
git reset --hard HEAD # only if you have to cleanup locall changes
git rebase develop

git submodule

https://blog.github.com/2016-02-01-working-with-submodules/
https://git-scm.com/book/de/v1/Git-Tools-Submodule

git release with git-flow extensions
- release/merge feature branch to develop and master
```
git flow release start 0.1.0
git checkout master
git checkout merge release/0.1.0
git flow release finish '0.1.0'
```
- 'tags' - once master updated with new changes, it should be tagged with the updated version number.
git setup the connection or 407 error

https://stackoverflow.com/questions/24907140/git-returns-http-error-407-from-proxy-after-connect

git config --global http.proxy http://username:password@proxiURL:proxiPort

git config --global http.sslVerify false

Git Rebase

https://medium.com/@fredrikmorken/why-you-should-stop-using-git-rebase-5552bee4fed1
https://www.jetbrains.com/help/pycharm/apply-changes-from-one-branch-to-another.html Rebase the develop branch with feature. I.e. pull the updates from develop to feature branch.

# on the feature branch
  git checkout develop
  git pull origin develop # to update local develop
  git pull --rebase origin develop 
  git rebase --continue # to confirm if rebase cpompleted
  git status

git flow

https://danielkummer.github.io/git-flow-cheatsheet/

merge develop to master by creating release branch.
version of new release is same as release branch name.

steps using git flow:

git flow for feature branch

git flow init  
git flow feature start MYFEATURE
git flow feature finish MYFEATURE
git flow feature publish MYFEATURE
git flow feature pull origin MYFEATURE
git flow feature track MYFEATURE

git flow for release

git flow release start RELEASE  
git flow release publish RELEASE
git flow release finish RELEASE
git push origin --tags

git merge conflicts

https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts

When feature branch needs to merge with develop that is already updated with some other changes

Steps:

first update feature from develop.
merge conflicts
then merge feature to develop
merge conflicts

AWS CLI

Install CLI : https://docs.aws.amazon.com/cli/latest/userguide/awscli-install-windows.html#awscli-install-windows-path

Create a profile

C:\Users\vkbomb>aws configure --profile test
AWS Access Key ID []: XXX
AWS Secret Access Key []: YY
Default region name []: eu-central-1
Default output format [None]:

List S3 objects

C:\Users>aws s3 ls s3:

MyLearningNotes

Install / Use

README