Recently I have actively participated two team projects (PAWS and 2020datahack) which involves multiple team members and a lot of group decisions. For the first time, I realized how important to use github as the platform for code sharing and communication. Here I am going to share several commands that I frequently used at this process and hope it will help people quickly pick up this useful collaboration tool.
1. initiate a new repo at github
Repository, aka repo, is a collection of codes, data and documentation designated for project(s). As far as you have github account, you can create public repo(s) through github webpage easily. Follow the step 1-5 on this website, then you will create a repo with a few clicks. New repo usually comes with a README.md file. Using markdown format, you can describe the project in this README.md file which will be loaded to your repo main page when you open it. Here is the repo I created for Rladies-Philly PAWS projects.
2. Local vs. remote
One concept need to be clarified here is local vs. remote. Github is the most popular cloud-based service hosting repos. Those repo is managed by git installed at remote (aka, github here). So what is git? Git is a version control system that lets you manage and keep track of your source code history. It can also be installed at local and work as local version control system. In that case, your snapshot of each version will be saved at local instead of cloud.
Code
# to initiate repo at local# suppose you have a project working now called repoX# all scripts/data/documentation are saved in a folder in your computer called repoX. # Now you want to start git version control for this projectcd repoX/git initgit add -Agit commit -m "initiate version control"
The above code can be run on any shell-like terminal. And congrats, you have sucessefully create version control at local folder repoX/. git commit basically create a snapshot of this folder. If you want to change it back to this moment in the future, you can do it with commit number (it is hash code). It is important to write a meaningful message (like here “initiate version control”) to remind yourself what the snapshot is like. We will discuss how to recover using commit number in a little bit.
3. clone, pull and push
To communicate between your local and remote github, you can access through git by downloading (pull) remote to local and uploading (push) local repo to the cloud (github).
If you initiate repo from github first (step 1 above), you can first clone that repo to local. This repo will remember the remote address and allow you later pull from and push to the remote
Code
# here I use rladiesPHL/PAWS_return_noshow.git as my example# clone the repo to localgit clone https://github.com/rladiesPHL/PAWS_return_noshow.git
Anyone can clone a public repo to their local computer. However, to be able to pull and push, you need to be included as collaborators for that repo specially, or you are the repo owner yourself. To add someone as collaborator, follow the steps by clicking on the webpage. Once you are the owner/collaborator, you can do following command to download and upload.
Code
# initiate clone will create a folder at local called "PAWS_return_noshow", go to that foldercd PAWS_return_noshow/# pull (the update) from PAWS_return_noshow (since the clone remember the remote address)git pull# you can do your analysis, do your update at local now# when it is time to upload your analysis to cloud, you first want to take a snapshot of what you have done so fargit add -Agit commit -m "my update"# now you can push your analysis to the githubgit push origin master
4. fork
Above example is to push your analysis directly to the origin’s master branch. So what is origin? (what is the master branch will be explained in the next). Put it simple, you can consider origin as the place where is first downloaded. For example, I download PAWS_return_noshow repo from rladiesPHL account and the origin here will be rladiesPHL repo address (https://github.com/rladiesPHL/PAWS_return_noshow.git).
Code
# to quick check your repo remote origingit remote -v
Why is origin important? The origin determines which repo push and pull will go to/from. Some repo won’t allow you to push and pull because you are not the owner or collaborator. If you do not want request pull and push permission from the owner, you can fork the repo to your own github account. Here fork is like to clone a remote repo belonging to other poeple’s account at that snapshot to your own account. You can develope/make commits on repo without any push/pull permission obstables.
The easiest way to fork a repo is from webpage. You can follow the instruction on this help page.
Be aware, if you git clone the forked repo from your github account, the “origin” is your own account repo. This repo is functionally independent from the upstream repo, although at top of your own repo page it will show “This branch is X commits ahead of/behind XXX:master.” when the upstream repo makes commits after forking. What if you want your own “forked” repo remember where it comes so that you can merge the future changes from the upstream repo to your “forked” repo?
Code
# here I show an example of a forked repo at my own account (sckinta/datahack2020) linking back to the upstream account (CodeForPhilly/datahack2020)# add a repo description called "upstream" and associated this name with upstream repo. Here "upstream" can be any name (eg. up, ori, ...)git remote add upstream https://github.com/CodeForPhilly/datahack2020.git# check remote info again you will find now repo have two remote associated with it. one is called "origin" and another is called "upstream"git remote -v
To update your forked repo at github, you need three steps: 1) fetch the upstream repo to your local repo; 2) merge updated fetch content into the main branch at local; 3) push updated local to remote forked repo
Code
# fecth upstream to localgit fetch upstream# Merge the updated fetch content into the main branch at localgit merge upstream/master# Update, push to remote(fork) master branchgit push origin master
I highly suggest pull your forked repo to local first before fetch upstream. It will guarantee when you merge the upstream it will not cause the conflicts.
5. branch
After fork origin, another useful tool for collobarative project is using branch. “Branch”, as it is named, means a branch of analysis derived from the mainstream (which is by default named “master”). You can create branches on your own repo or the repo you have been invited as collaborator. This is the biggest difference between branch and a repo fork. To add a branch at local, using code git checkout.
Code
# Create and switch to a new branch (say, branch "chun")git checkout -b chun# go back to the mastergit checkout master
Now you can do your analysis in the repo fold. When you are ready to commit your new analysis, how will your repo know this analysis added to branch “chun”? Simple, using git checkout switch to chun branch and commit there. You can also push your new branch to remote, where the branch will show up under the //tree
Code
# for example I push my new analysis to branch "chun" and finally push it to sckinta/datahack2020# switch to branch chungit checkout chun# make your new commitgit add -Agit commit -m "new analysis"# push it to github branchgit push origin chun
If you want to continue on other collaborator’s branch (say “abc”), you can pull that branch to local.
Code
# download branch abc to your analysisgit pull origin abc# check how many branches current local repo containsgit branch
Code
# *chun# master# abc
After everybody did their analysis on their own branch, your group finally determine we are going to merge branch “abc” to master and delete the branch “abc”.
Code
# go to the master firstgit checkout master# merge branch "abc" ingit merge abc# delete old branchgit checkout -d abc
Occasionally, this process doesn’t go smoothly. Conflicts may occur when you try to merge multiple branches in. Then you may need advance tools like mergetool and opendiff. Here I won’t explain them. Please refer togit tutorial page for further reading. All the simple branch and merge has also been best explained on git tutorial.
6. Recover a certain commit
One major reason we want to use version control is that we can revert to a old snapshot/commit if we want. To check the commits done to the current repo, you can try git log. The log is reported in reverse chronical order.
# since we are currently at chun branch, we better go back to master where "3c659ec" is atgit checkout master# revert to a commitgit revert 3c659ec# the above command can also begit reset --hard 3c659ec
Remember all of above is only updated at local. If you want to make it show up at github, do a add, commit and push series.
7. link to your remote account at local
After introduce all above basic commands for git, the last thing I want to share is to set up the local git remote account. I probably shoud put it at #2.remote vs. local# part.
To globally set github account at local can save your effort to put account name and password everytime you want to push/pull to your own account.
Code
# for example I set global account as rladiesPHL. This will save the global configuration to a ~/.gitconfig file. It will prompt password for you to inputgit config --global user.email "philly@rladies.org"git config --global user.name "rladiesPHL"
Howver, sometimes I want to switch back to my personal account temperally to do a quick push. I wish git will prompt account and password for me to input
Code
# reset global account a little bitgit config --local credential.helper ""# when you push, it will prompt account and password for me to inputgit push origin master
All above are the frequently used git commands I used. Hope it will help anyone who is willing to use git version in their future project.