MyLearningNotes
Because its never late to start taking notes and 'public' it...
Install / Use
/learn @vivek-bombatkar/MyLearningNotesREADME
Because its never late to start taking notes and make it 'public'...
- This repository contains my random (but important) technical notes, which I found useful.
- Please do follow individual sub-folders for more details specific technology.
- Notes so far include topics like Hadoop, Spark, Machine Learning, Java, Python, Unit testing, Clean Code, Py project management,
get HDFS file size
$ hdfs dfs -du -s -h hdfs://hadoop-cluster/user/hive/warehouse/hive_schema.db/table
655.2 M 1.9 G hdfs://hadoop-cluster/user/hive/warehouse/hive_schema.db/table
[size] [disk space consumed]
655.2 M * 3 (replication factor) = 1.9 G
-s : aggregate summary of file lengths
-h : human readable instead long number in bytes
# Look for specific keywards
hdfs dfs -du -s -h /user/hive/warehouse/*hive_schema_name*/*hive_table_name*
# print total size
hdfs dfs -du -s -h /user/hive/warehouse/*hive_schema_name*/*hive_table_name* | awk '{ total += $1 }; END { print total }'
find and delete HIVE tables matching pattern with beeline
beeline -u $BEELINE_URL --showHeader=false --outputformat=tsv2 -e "show tables from $HIVE_SCHEMA like $PATTERN ;" | xargs -I '{}' beeline -u $BEELINE_URL --showHeader=false --outputformat=tsv2 -e " drop table $HIVE_SCHEMA.{} ;"
xargs - reads data from standard input (stdin) and executes the command (supplied to it as argument) one or more times based on the input read. Any blanks and spaces in input are treated as delimiters, while blank lines are ignored.
HIVE: select column names based on reguler expression
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification
set hive.support.quoted.identifiers=none;
select `<reguler expression>` from hive_table ;
hdfs dfs -checksum
https://community.hortonworks.com/questions/19239/hadoop-checksum-calculation-doubts.html
hdfs dfs -checksum <hdfs url>
<hdfs url> MD5-of-0MD5-of-512CRC32C 00000200000000000000000024c3cf9f64d08eaafeb25bb9776f793c
- Datanodes are responsible for verifying the data they receive before storing the data and its checksum
- When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes
- 'get' command : HDFS computes a checksum for each block of each file. The checksums for a file are stored separately in a hidden file. When a file is read from HDFS, the checksums in that hidden file are used to verify the file’s integrity. For the get command, the -crc option will copy that hidden checksum file. The -ignorecrc option will skip the checksum checking when copying
- A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes3
- Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
- "MD5-of-0MD5-of-512CRC32C" : http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201508.mbox/%3CCAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg@mail.gmail.com%3E
installing jupyter on windows
# On command prompt
C:\> python -m pip install jupyter
# It creats entry in C:\Python36\Scripts
C:\Python36\Scripts\jupyter.exe
C:\>jupyter notebook
[I 12:36:39.808 NotebookApp] Serving notebooks from local directory: C:\
[I 12:36:39.808 NotebookApp] 0 active kernels
[I 12:36:39.808 NotebookApp] The Jupyter Notebook is running at:
[I 12:36:39.808 NotebookApp] http://localhost:8888/?token=1d3293c485f32b492a93cf6dae1088d51d6d4635dff7630d
[I 12:36:39.808 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:36:39.808 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=1d3293c485f32b492a93cf6dae1088d51d6d4635dff7630d
[I 12:36:39.976 NotebookApp] Accepting one-time-token-authenticated connection from ::1
Test your Knowledge with Stack Overflow
https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955 You don’t really know a Python library if you cannot answer the majority of questions on it that are asked on Stack Overflow. This statement might be a little too strong, but in general, Stack Overflow provides a great testing ground for your knowledge of a particular library. There are over 50,000 questions tagged as pandas, so you have an endless test bank to build your pandas knowledge.
If you have never answered a question on Stack Overflow, I would recommend looking at older questions that already have answers and attempting to answer them by only using the documentation. After you feel like you can put together high-quality answers, I would suggest making attempts at unanswered questions. Nothing improved my pandas skills more than answering questions on Stack Overflow.
Alter HIVE table name
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/Alter/UseDatabase
This statement lets you change the name of a table to a different name.
As of version 0.6, a rename on a managed table moves its HDFS location.
Rename has been changed as of version 2.2.0 (HIVE-14909) so that a managed table's HDFS location is moved only if the table is created without a LOCATION clause and under its database directory.
Hive versions prior to 0.6 just renamed the table in the metastore without moving the HDFS location.
ALTER TABLE old_table RENAME TO new_table;
Rename table records not visible in pyspark! There is a property of table which pyspark api looks for lading data. hive rename command fails to update this property, and we see no records when query from pyspark. Resolution is to alter table to change 'serdeproperties' path.
alter table my_schema.new_table set serdeproperties ('path'='hdfs://hadoop-cluster/user/hive/warehouse/my_schema.db/new_table')
The GIT stuff
commonly used commands in the order that I follow mostly :-)
https://www.youtube.com/watch?v=47uih9Tp6H8
https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow
git clone <git dir path>
git init
git pull
git checkout -b feature/XYZ # change branch to feature branch
git add . # . is for add all the files that changed
git status
git commit -m "comments"
git push origin <branch name>
git rm file1.txt
#But if you want to remove the file only from the Git repository and not remove it from the filesystem, use:
git rm --cached ipmvp-nifi-custom.iml
- If
git pullwant to specifiy exact remote branch
Ex: There is no tracking information for the current branch. Please specify which branch you want to merge with.
git branch --set-upstream-to=origin/<branch> feature/XXX-100
- Create new branch and push to git
git branch feature/ABC-123
git checkout feature/ABC-123
git add .
git commit -m "bla bla"
git push origin feature/ABC-123
- Clear all local commits!
git reset --hard HEAD^
git checkout develop
git pull
git checkout - # switch to feature branch
git reset --hard HEAD # only if you have to cleanup locall changes
git rebase develop
- git submodule
https://blog.github.com/2016-02-01-working-with-submodules/
https://git-scm.com/book/de/v1/Git-Tools-Submodule
-
git release with git-flow extensions
- release/merge feature branch to develop and master
git flow release start 0.1.0 git checkout master git checkout merge release/0.1.0 git flow release finish '0.1.0'- 'tags' - once master updated with new changes, it should be tagged with the updated version number.
-
git setup the connection or 407 error
https://stackoverflow.com/questions/24907140/git-returns-http-error-407-from-proxy-after-connect
git config --global http.proxy http://username:password@proxiURL:proxiPort
git config --global http.sslVerify false
- Git Rebase
https://medium.com/@fredrikmorken/why-you-should-stop-using-git-rebase-5552bee4fed1
https://www.jetbrains.com/help/pycharm/apply-changes-from-one-branch-to-another.html Rebase the develop branch with feature. I.e. pull the updates from develop to feature branch.
# on the feature branch
git checkout develop
git pull origin develop # to update local develop
git pull --rebase origin develop
git rebase --continue # to confirm if rebase cpompleted
git status
git flow
https://danielkummer.github.io/git-flow-cheatsheet/
- merge develop to master by creating release branch.
- version of new release is same as release branch name.
steps using git flow:
- git flow for feature branch
git flow init
git flow feature start MYFEATURE
git flow feature finish MYFEATURE
git flow feature publish MYFEATURE
git flow feature pull origin MYFEATURE
git flow feature track MYFEATURE
- git flow for release
git flow release start RELEASE
git flow release publish RELEASE
git flow release finish RELEASE
git push origin --tags
git merge conflicts
https://www.atlassian.com/git/tutorials/using-branches/merge-conflicts
- When feature branch needs to merge with develop that is already updated with some other changes
Steps:
- first update feature from develop.
- merge conflicts
- then merge feature to develop
- merge conflicts
AWS CLI
Install CLI : https://docs.aws.amazon.com/cli/latest/userguide/awscli-install-windows.html#awscli-install-windows-path
Create a profile
C:\Users\vkbomb>aws configure --profile test
AWS Access Key ID []: XXX
AWS Secret Access Key []: YY
Default region name []: eu-central-1
Default output format [None]:
List S3 objects
C:\Users>aws s3 ls s3:
