Learning DVCS Workflow - 2

I hacked on my web site project this Easter long-weekend, and learnt how to split the existing repository into separate projects, and then glue it back together again.

I also learnt about Git Large File Storage (LFS), how to set it up, and how to migrate certain file types to use this for more efficient handling of binary files.


My website is made up of several things:

  • A blog, which is generated using the Nikola static blog generator
  • Some pages which are not blogs, also generated using Nikola
  • Some photos, which Nikola automatically generates a viewer for
  • Some web hacks to play with learning JavaScript, CSS, and hand-written HTML
  • A mirror of the Jargon File

Since the beginning, it's all been a single project, and since 2015 when I moved it into Nikola, it was all a single git "monorepo". I just dumped files that Nikola should copy as-is into /files/ and leave it all in the same Git repository. It meant that when I moved from GitHub to GitLab last year it was a single project to import, and everything is fabulous.

Except that it's all a jumble of stuff in the git history.

So, to keep these separate things apart, I decided that I would learn about using git for one of the things that it's famous for: rewriting history.

Splitting a project into parts

There is a really good extension for git called git filter-repo which is a modern, fast replacement for older tools like the BFG Repo-Cleaner. I used it to break out the separate parts of my milosophical-me project, according to the file paths.

First, I cloned the project from GitLab, and then I made two more copies locally (rsync -HPvax). Then in each repository copy, I filtered just the paths I wanted to keep:

git filter-repo --subdirectory-filter files/jargon/
git filter-repo --subdirectory-filter files/hax/

For the copy of the repository that was to be my Nikola site, I also removed these paths, keeping what remains:

git filter-repo --path files/hax/ --path files/jargon/ --invert-paths

So now I have three different local git repositories with just the files and commits for that project. I also cleaned up the branches for the hax and jargon projects:

git branch -m src master
git branch -d hax

All that remained was to make two new Projects in GitLab and import these repositories. I put them into my own Group milohax-net:

Putting the parts back together again

I can't publish these projects to a package library as some kind of code module to include, but rather must have them copied into my milosophical-me project.

Traditionally, the way to do this is with git sub-modules. But I've had bad experiences with sub-modules, and I don't want to go there again if I don't have to. Basically they're broken, in my opinion.

I learnt about git subtrees. These are a bit better, but still you have to be careful with them.

Finally I learnt about git sub-repos. These are excellent, because to use them needs nothing special, not even installing the plugin (you do need to install the plugin if you want to pull updates from the sub-repo, or push to it from your super-repo, but not just to clone and use the super-repo).

So for sub-repos, you just clone them in and the whole history is "squashed" (like with subtree), but merging and rebasing in the super-repo Just Works™. You don't have to forever manage which ref you linked like a sub-module, or be careful what you commit like sub-trees.

Just do this from the root of the super-repo:

git subrepo clone https://gitlab.com/milohax-net/web-hax.git files/hax
git subrepo clone https://gitlab.com/milohax-net/jargon.git files/jargon

The relationship to the sub-repo is still tracked. This means that later on, to update one of them (say, if I make a new hack in web-hax), I can pull the latest into my main repo:

git subrepo pull files/hax/

(you can also specify a tag or a ref, rather than just the latest)

Working with Large File Storage

Traditionally git was for working with text files, such as computer source code. It doesn't work well with binary files like images or audio. If you have a lot of images in a git repository, then when you clone it, all the versions of all of them are downloaded so that you can visit their history. It's very large, and very slow. Also it's usually a waste because history isn't checked out often for images.

Git Large File Storage (LFS) was invented to work efficiently with these files. It keeps a "pointer" to the file in the git repository, and then lazy-loads the images from the remote storage. You only download the version that you ask for (typically the latest), and then that's cached in .git/lfs/ in a hashed filespec, and copied to your working copy on checkout.

You're expected to plan ahead and install it to your repository, and track certain file types before you commit them. But what if you didn't do that…?

Well not to worry, it's very simple to fix:

git lfs migrate import --everything --include="*.jpg,*.jpeg,*.gif,*.png,*.zip,*.gz"

Now as well as tracking all future files of these types, the existing ones are migrated to LFS. This command rewrites your commit history. If you don't want to do that, read the man-page.