Learning DVCS Workflow - 3

Recently for work I wrote out some steps to work through to get from a Subverison (SVN) repository containing large files, into a git repository using git Large File Storage (LFS) to house the big files separately.

A customer had used git svn to export a Subversion repository into a local git repository, and then, after adding their GitLab project as the remote origin, attempted to upload this to gitlab.com with:

git push -u origin --all

It failed with HTTP 413 because of a 5GB max transfer limit. There is a very large file from the SVN repository! One solution is to move these large files into LFS, which bypasses this limit.

It may help to understand this if you read about LFS, what it does, and how it works. While the announcement blog post from 2017 explains what LFS is and why git (and also GitLab/GitHub/Bitbucket and so on) use it, GitLab's documentation is already fairly technical. The quick summary is:

  • Git LFS is a technology to overcome git's shortcomings for working with large files. I also covered some of this in Learning DVCS Workflow - 2
  • It is an extension to git (much like git-svn is)
  • It works by "tracking" the "large file types" in a storage that is separate to the normal "blob" storage used by git to track changes
  • From then on, changes to these tracked files are treated by copying the whole file and not attempting to do diff processing on them. Instead git tracks the different versions with small "pointers" into the LFS
  • The main advantage of LFS is to improve git clone speed by only cloning the latest version of LFS files, and getting older versions only on request, by following the pointers
  • In GitLab, the 5GB file transfer limit is lifted for LFS files

Migrating SVN to git

Follow these high-level steps:

  1. Convert the SVN repository into a new git repository (see the manual for git svn)
  2. On gitlab.com, create a new, empty project to push into
  3. Add the GitLab project as the origin remote
  4. Make sure to enable LFS in the project's settings
  5. Install the Git LFS client on your workstation(s)
  6. (the tricky part) locate the large files which came from SVN
  7. Use git-lfs-migrate to migrate the large files and their history, over to git LFS
  8. git push -u origin --all

Step 7 to use git lfs migrate is usually discussed in git literature as a "history rewrite", which can cause disruption in shared git workflows, but since we are migrating from Subversion, and no-one shares the git repository yet, we do not need to be concerned. It would be good to get all this migration done early though, to avoid potential trouble in the future.

Let's look at steps 5, 6 and 7 in detail.

Locate large files which came from SVN

This Stack Overflow answer has a nice shell one-liner that uses git to list all the files in a repository, sorted by size:

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Use it to see which are the big files. There may be a pattern in the files it finds, such as maybe media files (with names ending in .jgp, .wav, .mp4, or .png).

Use this to build a list of "file types" to track, then tell git lfs about them. First, let's install the lfs extension:

git lfs install

This only needs to be done once per workstation, but all developers sharing the new git repo will need it, along with the git command line tool.

Now, tell git lfs to track the file types you found. For instance, this tracks common image formats and compressed files:

git lfs track "*.jpg" "*.jpeg" "*.gif" "*.png" "*.zip" "*.gz"  

The git lfs track will make a new .gitattributes that is used for future commits involving these types of files. Add that file and commit it now:

git add .gitattributes
git commit -m "Track large binary files in LFS"

This doesn't move the files into LFS though, that is the next step.

Migrate the tracked files' history

Having identified which files need to be moved, you can use git lfs migrate to move them.

Dry run of the migration. Use the option --everything to perform a migration in all branches. Use --include option to only migrate the files identified for git lfs track.

git lfs migrate info --everything --include="*.jpg,*.jpeg,*.gif,*.png,*.zip,*.gz"

This shows how the files' history will be changed to put them into LFS. If this checks out, perform the migration:

git lfs migrate import --everything --include="*.jpg,*.jpeg,*.gif,*.png,*.zip,*.gz"

Now when pushing to GitLab the large files will bypass the 5GB restriction and go to LFS storage. Also because they are tracked in the .gitattributes, any changes, or new files matching the patterns, will also go to LFS.