sudo apt-get install git git-lfs sudo yum install git git-lfs
Git for Data Science
This document is to help you get started using Git for data science projects. It is not a self-contained tutorial, but aims to build on existing excellent works to help you make effective use of git.
What is Git? Why?
Serious programming and scientific endeavors usually benefit from some kind of version control: a record of past versions of the project so that you can recover old code and know exactly what version of code or analysis scripts you are working with. Git also provides a very good way to ship code between computers, so you can move things from your local computer to a compute server.
Installing Git
We will need both git and an add-on package called git-lfs.
For at least Windows, and likely other platforms, git
and git-lfs
are also available via Conda:
conda install git conda install -c conda-forge git-lfs
Introducing Git
There are many, many tutorials and resources to get started with Git. Here are a few:
Git has extensive built-in reference documentation, but that can be quite difficult to parse. Run e.g. git help commit
to read it.
You may also find the Git manpage generator entertaining.
GUIs
Git is primarily a command-line tool. If you like a GUI, SourceTree works on Windows and Mac.
Directory Layout
It helps to have a planned layout for your projects. Particular tools may encourage different layouts, but I usaully use something like this:
-
Notebooks go in the root directory
-
Input data goes in a directory called
data
-
Output files go in a directory called
output
(this may change depending on what other tools I use) -
Sometimes I have a directory called
temp
for temporary or scratch files
Ignoring Files
We don’t want to commit everything to Git.
|
A key rule for happy Git usage: commit inputs, not outputs. |
Data science work will break this rule a little bit, but in general we don’t want to commit our output files or temporary working files. We want to commit analysis code and documentation, and often final analysis output, but that’s it.
Git lets us tell it to ignore files by listing them in a file called .gitignore
. In the top-level of your project, create such a file with the following contents:
# ignore Jupyter checkpoint files
.ipynb_checkpoints
# if you use node.js, exclude the node libraries
node_modules
# excluded editor temp files
.*.swp
*~
# we usually don't want log files
*.log
# if our outputs go in a directory 'output', exclude it too
/output
/temp
Depending on the other tools you use, you may want to exclude additional files.
Working with Data
Git works well for small files, particularly source code; it is ok for images and small binary files as well.
However, it does not work as well on its own for large files, such as input data sets. For that, we have two options:
-
Ignore data in
.gitignore
and manage it separately (e.g. have a script that downloads and unpacks it) -
Store data in
git-lfs
(Large File Storage)
Either of these works, and which one we should use depends on the project and our available resources (BitBucket provides 1GB of large file storage for free, which may not be enough if you have a lot of projects).
Using git-lfs
To use git-lfs
, you need to first install the software, and then set it up for your repository:
git lfs install
Then we track files with git-lfs
prior to adding them:
git lfs track 'data/*.csv'
Then we can add the files:
git add data/*.csv
And see that they’re added to LFS:
git lfs ls-files
And finally commit:
git commit
When we git push
, it will push the LFS files separately to the LFS store.