How Microsoft’s Git fork scales for massive monorepos

Microsoft has rolled its Git Virtual File System and Scalar optimizations into a fork of Git designed to support enormous repos and large distributed teams.

branches / branching / bare tree
MemoryCatcher / Siggy Nowak (CC0)

Building applications at scale is nothing compared to building an operating system like Windows, especially when it comes to source code control. How do you manage the repository (or repositories) for such a software behemoth, with thousands of developers and testers, and with a complex build pipeline that’s continuously delivering fresh code?

Microsoft’s history with internal source control systems is convoluted. You might think it used the now discontinued Visual SourceSafe, but that was most appropriate for local file systems and smaller projects. Instead, Microsoft used many different tools over the years, initially an internal fork of the familiar Unix Revision Control System, before standardizing on Perforce Source Depot.

Git hits a wall in Redmond

Meanwhile some parts of the business used Visual Studio’s Team Foundation Server, before switching to using Git as the foundation of a common engineering platform for the entire company. Team Foundation Server supported Git, and the mix of a visual tool and the command line supported lots of different use cases across Microsoft.

That shift made a lot of sense, as Git was designed to deal with the complexities of managing an enormous code base with a huge number of globally distributed developers. It’s not surprising that there are a lot of similarities between how Windows and Linux are built, and Git has features that work well for both.

However, there’s one big problem for a massive repository like Windows. For all its complexity and its many moving parts, tools like Windows and Office are developed in single repositories, massive monorepos that take up vast amounts of storage space—some 300GB and 3.5 million files for Windows alone. The problem stems from how Git treats repositories: replicating them, and every change, to every single copy. For Windows, the size of the repo would quickly overwhelm developer PCs and quickly clog up the developer network.

Enter GVFS – the Git Virtual File System

A massive repo might be workable if all your developers worked on a single ultrafast communications network and high-speed storage network, but it certainly isn’t when you’re a globally distributed team that mixes offices and home workers. Microsoft needed to develop a way to treat a Git repository as a virtual file system, creating local files only when they’re needed, instead of copying the entire repository over an unknown network.

The resulting tool balances the capabilities of Git with Microsoft’s development needs. It doesn’t change Git at all, though it sacrifices Git’s offline capabilities. That was a good decision, back when the vast majority of Microsoft’s developers worked in Redmond.

Git Virtual File System, GVFS, which ships as a Windows file system driver, is designed to monitor your working directory and your .git folder, pulling down only what’s needed for the work you’re doing and checking out only the files you need. You can still see the contents of the repository, as if it were an extension of your PC’s file system, much like the way OneDrive files are downloaded only when you explicitly select them.

As Microsoft began using GVFS it noticed various edge cases that showed that Git was doing unnecessary work on files, so its engineers moved to providing fixes for these issues to the Git project. These fixes were designed to improve Git performance for large repositories, allowing Microsoft to shift to one enormous internal monorepo for source control.

Scaling up Git with Scalar

Things didn’t stop there. Now we’re on the third public version of Microsoft’s work on scaling Git, this time as part of the company’s own fork of Git—a special-purpose Git distribution designed to support monorepos.

The current release builds on work released in 2020 as Scalar. Scalar is an application that accelerates any Git repository, no matter where it’s hosted. It requires Microsoft’s own custom Git implementation, though the long-term aim is to have much of the necessary server-side code part of the official Git release. Scalar is an opinionated tool, with a focus on improving Git performance.

Scalar is a .NET command line application that runs in the background, managing registered repositories. You can use it alongside GVFS, or as a stand-alone accelerator, taking advantage of recent Git features. Microsoft uses Scalar with GVFS internally, placing cache servers between its repositories and developer PCs. GVFS isn’t essential for Scalar, but it certainly helps.

Once installed and running, Scalar can be used alongside a traditional Git client, cloning repositories using a local cache or a remote cache server and managing your local repository. The default is to make a sparse checkout, which allows Scalar to, as Microsoft put it in the announcement blog post, “focus on the files that matter.”

Scalar sets up the local clones, then developers can use Git as normal. This is handled by offering a tiered approach to file management: a high-level index of all the files in a repository (which can be many millions), a sparse working directory of the files you might need for the task your working on, and finally a set of the files you have modified.

Managing Git in the background

Much of Scalar’s work happens in the background, so that features like Git’s garbage collection don’t block commits when rewriting and updating files. Scalar does this by setting key Git configurations to avoid foreground operations. You still use Git as you normally do, but what could be both processor-intensive and network-intensive repository maintenance operations are handed off to the background Scalar process, where they can operate at a lower priority without affecting the work you’re doing.

With a set of indexes managing your working directory, Scalar uses GVFS to clone repositories using only the root files, downloading additional files as needed. Files are stored inside a scalar directory, with the working directory in a src subdirectory. This file structure lets you manage builds and branches locally.

Microsoft’s work on Scalar has led to it shipping its own Git distribution with the Scalar CLI. You can find releases of Microsoft’s Git for Windows, macOS, and Linux (as a Debian package, with other distributions needing to compile from source). There’s also a portable Windows version. Microsoft is now calling its features “advanced Git features,” an approach that makes sense of the work it’s doing to prove how Git can work at massive scale.

If you want to try it out, you first need to set up your own Git server, ready to host your own repositories. You can use familiar Git tools to get running, storing code and artifacts, before switching to Scalar and GVFS. Although Scalar will work with other Git implementations, you should look for one that supports the partial clone option, which is the official alternative to GVFS.

The current version of Microsoft Git includes server-side enhancements to ensure that massive monorepos behave much like smaller repositories, without requiring additional tooling to construct builds from multiple sources.

Why Scalar?

You can think of Scalar as a proving ground for the direction Microsoft would like Git to go. Forking Git allows the company to try these features out before it offers them back to the wider Git community. It’s a reasonable approach that makes the code available to the community to evaluate before anyone makes a pull request.

With so many projects, communities, and companies relying on Git, it’s crucial that changes don’t break things for its millions of users and the billions of lines of code hosted in repositories all across the world. Not everyone needs the tools in Scalar and GVFS, but Microsoft certainly does, and other projects may well need similar features down the line.

Big open standards projects like JavaScript and HTML work by demonstrating that the major downstream platforms support the project’s planned new features before they’re committed to specifications, hiding them behind feature flags for testing. Microsoft’s approach to Git is similar.

It allows Microsoft to reap the benefits of these new features in its own fork, while the rest of us to continue using our own Git installs or cloud-based Git services, without having to worry about Scalar and how it works until it’s part of the platform. Then the transition is as easy as running an update on a server.

Copyright © 2024 IDG Communications, Inc.