Learning DVCS Workflow - 3
Recently for work I wrote out some steps to work through to get from a Subverison (SVN) repository containing large files, into a git repository using git Large File Storage (LFS) to house the big files separately.
A customer had used git svn to export a Subversion repository into a local git repository, and then, after adding their GitLab project as the remote origin, attempted to upload this to gitlab.com with:
git push -u origin --all
It failed with HTTP 413 because of a 5GB max transfer limit. There is a very large file from the SVN repository! One solution is to move these large files into LFS, which bypasses this limit.
It may help to understand this if you read about LFS, what it does, and how it works. While the announcement blog post from 2017 explains what LFS is and why git (and also GitLab/GitHub/Bitbucket and so on) use it, GitLab's documentation is already fairly technical. The quick summary is:
- Git LFS is a technology to overcome
git
's shortcomings for working with large files. I also covered some of this in Learning DVCS Workflow - 2 - It is an extension to git (much like
git-svn
is) - It works by "tracking" the "large file types" in a storage that is separate to the normal "blob" storage used by
git
to track changes - From then on, changes to these tracked files are treated by copying the whole file and not attempting to do
diff
processing on them. Insteadgit
tracks the different versions with small "pointers" into the LFS - The main advantage of LFS is to improve
git clone
speed by only cloning the latest version of LFS files, and getting older versions only on request, by following the pointers - In GitLab, the 5GB file transfer limit is lifted for LFS files
Migrating SVN to git
Follow these high-level steps:
- Convert the SVN repository into a new git repository (see the manual for git svn)
- On gitlab.com, create a new, empty project to push into
- Add the GitLab project as the
origin
remote - Make sure to enable LFS in the project's settings
- Install the Git LFS client on your workstation(s)
- (the tricky part) locate the large files which came from SVN
- Use git-lfs-migrate to migrate the large files and their history, over to git LFS
git push -u origin --all
Step 7 to use git lfs migrate
is usually discussed in git literature as a "history rewrite", which can cause disruption in shared git workflows, but since we are migrating from Subversion, and no-one shares the git repository yet, we do not need to be concerned. It would be good to get all this migration done early though, to avoid potential trouble in the future.
Let's look at steps 5, 6 and 7 in detail.
Locate large files which came from SVN
This Stack Overflow answer has a nice shell one-liner that uses git
to list all the files in a repository, sorted by size:
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
Use it to see which are the big files. There may be a pattern in the files it finds, such as maybe media files (with names ending in .jgp
, .wav
, .mp4
, or .png
).
Use this to build a list of "file types" to track, then tell git lfs
about them. First, let's install the lfs
extension:
git lfs install
This only needs to be done once per workstation, but all developers sharing the new git repo will need it, along with the git
command line tool.
Now, tell git lfs
to track the file types you found. For instance, this tracks common image formats and compressed files:
git lfs track "*.jpg" "*.jpeg" "*.gif" "*.png" "*.zip" "*.gz"
The git lfs track
will make a new .gitattributes
that is used for future commits involving these types of files. Add that file and commit it now:
git add .gitattributes
git commit -m "Track large binary files in LFS"
This doesn't move the files into LFS though, that is the next step.
Migrate the tracked files' history
Having identified which files need to be moved, you can use git lfs migrate
to move them.
Dry run of the migration. Use the option --everything
to perform a migration in all branches. Use --include
option to only migrate the files identified for git lfs track.
git lfs migrate info --everything --include="*.jpg,*.jpeg,*.gif,*.png,*.zip,*.gz"
This shows how the files' history will be changed to put them into LFS. If this checks out, perform the migration:
git lfs migrate import --everything --include="*.jpg,*.jpeg,*.gif,*.png,*.zip,*.gz"
Now when pushing to GitLab the large files will bypass the 5GB restriction and go to LFS storage. Also because they are tracked in the .gitattributes
, any changes, or new files matching the patterns, will also go to LFS.