Mercurial > CVu-Mercurial

\documentclass[a4paper]{article}
\usepackage{pslatex}
\usepackage{url}

\usepackage[pdftex]{hyperref}
\hypersetup{
	pdfauthor={Jim Hague},
	pdftitle={Inside a distributed version control system},
	colorlinks}

\newcommand{\standout}[1]{
        {\begin{center} \large \textbf{#1} \end{center}}
}

\setlength{\parskip}{2mm}
\setlength{\parindent}{0mm}

\begin{document}
\title{Inside a distributed version control system}
\author{Jim Hague\\
        \texttt{jim.hague@acm.org}}
\date{May 2009}
\maketitle

\section{Preamble}
Grinton Lodge is a Youth Hostel that sits on an exposed hillside just
above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales
National Park. A former Victorian shooting lodge, it now welcomes
walkers and other travellers from around the world.

Tonight, a Wednesday in mid-November, is not one of its busiest
nights. Kat, the duty staff member, tells me that there is a small
corporate team-building group in the annex. There's no sign of them at
present. Otherwise, that portion of the world that has beaten a path
to the door of this grand building today consists of just me. And Kat
goes home soon.

The November CVu, removed from its wrappers and read yesterday, lies
in my bag. Taunting me. Go on, it says, if you've ever going to put
finger to keyboard in the name of CVu, well, tonight you are out of
excuses.

Bugger.

\section{Let's look into Mercurial}
If you're at all interested in version control systems~--- and any
software developer not using one daily is a strange beast indeed~---
you'll at least have become vaguely aware in the last few years of the
growing maturity of the latest group of version control systems
offering funky new stuff. These are the distributed version control
systems (DVCS). There is more to them than just their headline
attributes, being able to check history and do checkins while
disconnected from a central server, but these are damn useful to start
with.

When I first heard about DVCS, it wasn't immediately obvious to me (to
put it mildly) how they would work. After years of using a centralised
version control system, I had rough mental model of what went on. But
how do you cope without the central server forcing ordering onto the
changes?

Since then I've started using Mercurial\footnote{
\url{http://www.selenic.com/mercurial}}.
Mercurial is a DVCS. It's one of
three DVCSs that have gained significant popularity in the last few
years, the other two being Git\footnote{\url{http://git-scm.com}} and
Bazaar\footnote{\url{http://bazaar-vcs.org/}}.
I switched a significant work project over
to Mercurial (from Subversion) in mid-2007, because a customer site
required on-site work but could not allow access back to the company
VPN. I chose Mercurial for a variety of reasons which I won't bore you
with here\footnote{
OK, if you must know:
\begin{itemize}
\item Implementability. I needed the system to work on Windows, Linux and
AIX. The latter was not one of the directly supported platforms for
any of the candidates. Git's implementation uses a horde of
tools. Bazaar requires only Python, but required Python 2.4 while IBM
stubbornly still supplies only Python 2.3. Mercurial requires Python
2.3 or greater, and uses some C for speed.
\item Simplicity. My users used Subversion daily, but did not generally
have much experience with other VCS. From the command line,
Mercurial's core operations will be familiar to a Subversion
user. This is also true of Bazaar, but was less true of Git. Git has
improved in this matter since then, but a Mr Winder of this parish
tells me that it's still possible to seriously embarrass
yourself. There was also a lack of Windows support for Git at the
time.
\item Speed. Mercurial is fast. In the same ballpark as Git. Bazaar
wasn't, and although it has improved significantly, has, in my
estimation, added user complexity in the process, and at the time
of writing is still off the pace for some operations.
\item Documentation. At the time, Bryan O'Sullivan's excellent Mercurial
book (\url{http://hgbook.red-bean.com}) was a clear winner for best
documentation.
\end{itemize}}.

What I want to do in this article is give you an insight into how a
DVCS works. OK, so specifically I'm going to be talking about
Mercurial, but Git and Bazaar attack the problem in a similar way. But
first I'd better give you some idea of how you use Mercurial.

\subsection{The 5 minute Mercurial overview}
\subsubsection{The basics}
I think it unlikely that someone possessing the taste and discernment
to be reading CVu would not be familiar with at least one version
control system. So, while I want to give you a flavour of what it's
like to use, I'm not going to hang about. If you'd like a proper
introduction, or you don't follow something, I thoroughly recommend
you consult the Mercurial book.

To start using Mercurial to keep track of a project.

\begin{verbatim}
$ hg init
$
\end{verbatim}

This creates the repository root in the current directory.

Like CVS\footnote{\url{http://www.nongnu.org/cvs/}}
with its \texttt{CVS} directory and
Subversion\footnote{\url{http://subversion.tigris.org/}}
with its \texttt{.svn}
directory, Mercurial keeps its private data in a directory. Mercifully there is
only one of these, in the top level of your project. And rather than
holding details of where the actual repository is to be found, the \texttt{.hg}
directory holds the entire repository.

Next you need to specify the files you want Mercurial to track.

\begin{verbatim}
$ echo "There was a gibbon one morning" > pome.txt
$ hg add pome.txt
$
\end{verbatim}

As you might expect, this marks the files as to be added. And as you
might also expect, you need to commit to record the added files in the
repository. The commit comment can be supplied on the command line; if
you don't supply a comment, you'll be dropped into an editor to
provide one.

There is a suggested format for these messages~--- a one line summary
followed by any more required detail on following lines. By default
Mercurial will only display the first line of commit messages when
listing changes. In these examples I'll stick to terse messages, and
I'll enter them from the command line.

\begin{verbatim}
$ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>"
$
\end{verbatim}

Mercurial records the user making the change as part of the change
information. It is usual to give your name and email address as I've
done here. You can imagine, though, that constantly having to repeat
this is a bit tedious, so you can set a default user name in a
configuration file. Mercurial keeps global, user and repository
configurations, and it can go in any of those.

As with Subversion, after further edits you see how your working copy
differs from the repository.

\begin{verbatim}
$ hg status
M pome.txt
$ hg diff
diff -r 33596ef855c1 pome.txt
--- a/pome.txt  Wed Apr 23 22:36:33 2008 +0100
+++ b/pome.txt  Wed Apr 23 22:48:01 2008 +0100
@@ -1,1 +1,2 @@ There was a gibbon one morning
 There was a gibbon one morning
+said "I think I will fly to the moon".
$ hg commit -m "A great second line"
$
\end{verbatim}

And look through a log of changes.

\begin{verbatim}
$ hg log
changeset:   1:3d65e7a57890
tag:         tip
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:49:10 2008 +0100
summary:     A great second line

changeset:   0:33596ef855c1
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:36:33 2008 +0100
summary:     My Pome

$
\end{verbatim}

There are some items here that need an explanation.

The changeset identifier is in fact two identifiers separated by a
colon. The first is the sequence number of the changeset in the
repository, and is directly comparable to the change number in a
Subversion repository. The second is a globally unique identifier for
that change. As the change is copied from one repository to another
(this is a distributed system, remember, even if we haven't come to
that bit yet), its sequence number in any particular repository will
change, but the global identifier will always remain the same.

\texttt{tip} is a Mercurial term. It means simply the most recent change.

Want to rename a file?

\begin{verbatim}
$ hg mv pome.txt poem.txt
$ hg status
A poem.txt
R pome.txt
$ hg commit -m "Rename my file"
$
\end{verbatim}
(The command to rename a file is actually \texttt{hg rename},
but Mercurial saves Unix-trained fingers from
typing embarrassment.)

At this point you may be wondering about directories. \texttt{hg mkdir}
perhaps? Well, no. Mercurial only tracks files. To be sure, the
directory a file occupies is tracked, but effectively only as a
component of the file name.  This has the slightly unexpected result
that you can't record an empty directory in your repository.\footnote{
I tripped over this converting a work Subversion
repository. One possibility is to create a placeholder file in the
directory. In the event I created the directory (which receives build
products) as part of the build instead.}

Given this, and the status output above that suggests strongly that
Mercurial treats a rename as a copy followed by a delete, you may be
worried that Mercurial won't cope at all well with rearranging your
repository. Relax. Mercurial does store the details of the rename as
part of the changeset, and copes very well with rearrangements\footnote{
The Mercurial designers justify not dealing with
directories as first class objects by pointing out that provided you
can correctly move files about in the tree, the other reasons for
tracking directories are uncommon and do not in their opinion justify
the considerable added complexity. So far I've found no reason to
doubt that judgement.}.

Want to rewind the working copy to a previous revision?

\begin{verbatim}
$ hg update -r 1
1 files updated, 0 files merged, 1 files removed, 0 files unresolved
$
\end{verbatim}

\texttt{hg update} updates the working files. In this case I'm specifying
that I want to go back to local changeset 1. I could also have typed
\texttt{-r 3d65e7a57890}, or even \texttt{-r 3d};
when specifying the global change
identifier you only need to type enough digits to make it unique.

This is all very well, but it's not exactly distributed, is it?

\subsubsection{Going distributed}
A version control system goes Distributed by allowing multiple copies
of the repository to exist, and work to be done in all those
repositories in parallel. So when you start work on an existing
project, the first thing to do is to get your own copy of the
repository.

\begin{verbatim}
elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem
updating working directory
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
\end{verbatim}

Mercurial lets you access other repositories via the file system, over http or
over ssh.

\begin{verbatim}
elsewhere$ cd Jim-Poem
elsewhere$  hg log
changeset:   3:a065eb26e6b9
tag:         tip
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:52:31 2008 +0100
summary:     Rename my file

changeset:   2:ff97668b7422
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:50:22 2008 +0100
summary:     Finished first verse

changeset:   1:3d65e7a57890
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:49:10 2008 +0100
summary:     A great second line

changeset:   0:33596ef855c1
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:36:33 2008 +0100
summary:     My Pome

$
\end{verbatim}

\texttt{hg clone} is aptly named. It creates a new repository that contains
exactly the same changes as the source repository. You can make a
clone just by copying your project directory, if you're confident
nothing else will access it during the copy. \texttt{hg clone} saves you this
worry, and sets the default push/pull location in the new repo to the
cloned repo.

From that point, you use \texttt{hg pull} to collect changes from other
places into your repo (though note it does not by default update your
working copy), and, as you might guess, \texttt{hg push} shoves your changes
into a foreign repository. By default these will act on the repository
you cloned from, but you can specify any other repository.

More on those in a moment. First, though, I want to show you something
you can't do in Subversion. Start with the repository with 4 changes
we just cloned. I want to focus on the first couple of lines, so I'll
wind the working copy back to the point where only those lines exist.

\begin{verbatim}
$ hg update -r 1
1 files updated, 0 files merged, 1 files removed, 0 files unresolved
$
\end{verbatim}

And make a change.

\begin{verbatim}
$ hg diff
diff -r 3d65e7a57890 pome.txt
--- a/pome.txt  Wed Apr 23 22:49:10 2008 +0100
+++ b/pome.txt  Thu Apr 24 19:13:14 2008 +0100
@@ -1,2 +1,2 @@ There was a gibbon one morning
-There was a gibbon one morning
-said "I think I will fly to the moon".
+There was a baboon who one afternoon
+said "I think I will fly to the sun".
$ hg commit -m "Better first two lines"
$
\end{verbatim}

The alert among you will have sat up at that. Well done! Yes, there's
something very worrying. How can I commit a change at an old point?
If you try this in Subversion, it will complain mightily about your
file being out of date. But Mercurial just went ahead and did
something.  The Bazaar experts among you will know that in Bazaar, if
you use \texttt{bzr revert -r} to bring the working copy to a past revision,
make a change and commit, then your latest version will be the past
revision plus your change. Perhaps that's what Mercurial did?

No. What Mercurial did is central to Mercurial's view of the
world. You took your working copy back to an old changeset, and then
committed a fresh change based at that changeset. Mercurial actually
did just what you asked it to do, no more and no less. Let's see the
initial evidence.

\begin{verbatim}
$ hg heads
changeset:   4:267d32f158b3
tag:         tip
parent:      1:3d65e7a57890
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 19:13:59 2008 +0100
summary:     Better first two lines

changeset:   3:a065eb26e6b9
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:52:31 2008 +0100
summary:     Rename my file

$
\end{verbatim}

Time for some more Mercurial terminology. You can think of a \texttt{head} in
Mercurial as the most recent change on a branch. In Mercurial, a
branch is simply what happens when you commit a change that has as its
parent a change that already has a child. Mercurial has a standard
extension \texttt{hg glog} which uses some ASCII art to show the current
state:

\begin{verbatim}
$ hg glog
@  changeset:   4:267d32f158b3
|  tag:         tip
|  parent:      1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Thu Apr 24 19:13:59 2008 +0100
|  summary:     Better first two lines
|
| o  changeset:   3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 18:52:31 2008 +0100
| |  summary:     Rename my file
| |
| o  changeset:   2:ff97668b7422
|/   user:        Jim Hague <jim.hague@acm.org>
|    date:        Thu Apr 24 18:50:22 2008 +0100
|    summary:     Finished first verse
|
o  changeset:   1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Wed Apr 23 22:49:10 2008 +0100
|  summary:     A great second line
|
o  changeset:   0:33596ef855c1
   user:        Jim Hague <jim.hague@acm.org>
   date:        Wed Apr 23 22:36:33 2008 +0100
   summary:     My Pome

$
\end{verbatim}

\texttt{hg view} shows a nicer graphical view\footnote{Though, being
Tcl/Tk based, not that much nicer.}.

So the change is in there. It's the latest change, and is simply on a
different branch to the other changes.

Almost invariably, you will want to bring everything back together and
merge the branches. A merge is a change that combines two heads back
into one. It prepares an updated working directory with the merged
contents of the two heads for you to review and, if satisfactory,
commit.

\begin{verbatim}
$ hg merge
merging pome.txt and poem.txt
0 files updated, 1 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)
$ cat poem.txt
There was a baboon who one afternoon
said "I think I will fly to the sun".
So with two great palms strapped to his arms,
he started his takeoff run.
$ hg commit -m "Merge first line branch"
$
\end{verbatim}

(I'm no poet. The poem is, of
course, \textit{Silly Old Baboon} by the late, great, Spike
Milligan. From \textit{A Book of Milliganimals}, Puffin, 1971.)

Here's the ASCII art again showing what just happened.
Oh, and notice in the above that Mercurial has done the
right thing with regard to the rename.

\begin{verbatim}
$ hg glog
@    changeset:   5:792ab970fc80
|\   tag:         tip
| |  parent:      4:267d32f158b3
| |  parent:      3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 19:29:53 2008 +0100
| |  summary:     Merge first line branch
| |
| o  changeset:   4:267d32f158b3
| |  parent:      1:3d65e7a57890
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 19:13:59 2008 +0100
| |  summary:     Better first two lines
| |
o |  changeset:   3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 18:52:31 2008 +0100
| |  summary:     Rename my file
| |
o |  changeset:   2:ff97668b7422
|/   user:        Jim Hague <jim.hague@acm.org>
|    date:        Thu Apr 24 18:50:22 2008 +0100
|    summary:     Finished first verse
|
o  changeset:   1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Wed Apr 23 22:49:10 2008 +0100
|  summary:     A great second line
|
o  changeset:   0:33596ef855c1
   user:        Jim Hague <jim.hague@acm.org>
   date:        Wed Apr 23 22:36:33 2008 +0100
   summary:     My Pome

$
\end{verbatim}

So, our little branch change has now been merged back, and we have a
single line of development again. Notice that unlike the other
changesets, changeset 5 has two parent changesets, indicating it is a
merge changeset. You can only merge two branches in one operation; or
putting it another way, a changeset can have a maximum of two parents.

This behaviour is absolutely central to Mercurial's philosophy. If a
change is committed that takes as its starting point a change that
already has a child, then a branch gets created. Working with
Mercurial, branches get created frequently, and equally frequently
merged back. As befits any frequent operation, both are easy to do.

You're probably thinking at this point that this making a commit onto
an old version is a slightly strange thing to do, and you'd be right.
But that's exactly what's going to happen the moment you go
distributed. Two people working independently with their own
repositories are going to make commits based, typically, on the latest
changes they happen to have incorporated into their tree. To be
Distributed, a DVCS has to deal with this. Mercurial faces it head-on.
When you pull changes into your repo (or someone else pushes them), if
any of the changes overlap~--- are both based on the same base change~---
you get extra heads, and it's up to you to let these extra heads live
or merge, as you please.

In practice this is more manageable then you might think. Consider a
typical Mercurial usage, where the 'master' repo sits on a known
server, and everyone pulls changes from the master and pushes their
own efforts to the master. By default Mercurial won't let you push if
the receiving repo will gain an extra head as a result, so you
typically pull (and do any required merging) just before
pushing. Subversion users will recognise this pattern. Subversion
won't let you commit a change if your working copy is not at the very
latest revision, so the Subversion user will update, and merge if
necessary, just before committing.

What, then, about a branch in the conventional sense of '1.0
maintenance branch'? Typically in Mercurial you'd handle this by
keeping a separate cloned repository for those changes. Cloning is
fast, and if local uses hard links where possible on filesystems that
support them, so isn't necessarily extravagant on disc space. You can,
if you prefer, handle them all in a single repo with 'named
branches', but cloning is definitely simpler.

OK, so now you know the basics of using Mercurial. We can proceed to
looking at how this magic is achieved. In particular, where does this
magic globally unique identifier for a change come from?

\subsection{Inside the Mercurial repo}
The way Mercurial handles its repo is really quite simple.

That's simple, as in 'most things are simple once you know the
answer'.  I found the explanation helpful\footnote{For the curious,
Bryan O'Sullivan's excellent Mercurial book
has a chapter on the subject, and the Mercurial website has a fair amount
of detail too.}, so this section attempts
the 10,000ft (FL100 if you prefer) view of Mercurial.

First remember that any file or component can only have one or two
parents. You can't merge more than one other branch at once.

We start with the basic building block, which Mercurial calls a
revlog. A revlog is a thing that holds a file and all the changes in
the file history\footnote{For any non-trivial file, this will
actually be two files on the disc, a data file and an index.}. The
revlog stores the differences between successive versions
of the file, though it will periodically store a complete version of
the file instead of a difference, so that the content of any
particular file version can always be reconstructed without excessive
effort.

Under the secret-squirrel Mercurial \texttt{.hg} directory at the top of your
project is a store which holds a revlog for each file in your
project. So you have the complete history of the project locally. No
more round trips to the server.

Both the differences between successive versions and the periodic
complete versions of a file are compressed before storing. This is
surprisingly effective at minimising the storage requirements of the
entire history of your project. I have a small Java project handy,
comprising a little over 300 source modules. There are 5 branches plus
the mainline, and some 1920 commits in all. A Subversion checkout of
the current mainline takes 51Mb. Converting the project to Mercurial
yields a Mercurial repository that takes 60Mb, so a little
bigger. Remember, though, that the Mercurial repository includes not
just the working copy, but also the entire history of the project.

Any point in the evolution of a revlog can be uniquely identified with
a nodeid. This is simply the SHA1 hash of the current file contents
concatenated with the nodeids of one or both parents of the current
revision. Note that this way, two file states are identical if and
only if the file contents are the same *and* the file has the
same history.

Here's a dump of a revlog index:

\begin{verbatim}
$ hg debugindex .hg/store/data/pome.txt.i
   rev    offset  length   base linkrev nodeid       p1           p2
     0         0      32      0       0 6bbbd5d6cc53 000000000000 000000000000
     1        32      51      0       1 83d266583303 6bbbd5d6cc53 000000000000
     2        83      84      0       2 14a54ec34bb6 83d266583303 000000000000
     3       167      76      3       4 dc4df776b38b 83d266583303 000000000000
$
\end{verbatim}

Note here that a file state can have two parents. If both the parent
nodeids are non-null, the file state has two parents, and the state is
therefore the result of a merge.

Let's dump out a revlog at a particular revision:

\begin{verbatim}
$ hg debugdata .hg/store/data/pome.txt.i 2
There was a gibbon one morning
said "I think I will fly to the moon".
So with two great palms strapped to his arms,
he started his takeoff run.
$
\end{verbatim}

The next component is the manifest. This is simply a list of all the
files in the project, together with their current nodeids. The
manifest is a file, held in a revlog. The nodeid of the manifest,
therefore, identifies the project filesystem at a particular point.

\begin{verbatim}
$ hg debugdata .hg/store/00manifest.i 5
poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8
$
\end{verbatim}

Finally we have the changeset. This is the atomic collection of
changes to a repository that leads to a new revision. The changeset
info includes the nodeid of the corresponding manifest, the timestamp
and committer ID, a list of changed files and a comment. The changeset
also includes the nodeid of the parent changeset, or the two parents
if the change is a merge. The changeset description is held in a
revlog, the changelog.

\begin{verbatim}
$ hg debugdata .hg/store/00changelog.i 5
1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e
Jim Hague <jim.hague@acm.org>
1209061793 -3600
poem.txt
pome.txt

Merge first line branch
$
\end{verbatim}

The nodeid of the changeset, therefore, gives us a globally unique
identifier for any particular change.  Changesets have a
Subversion-like incrementing change number, but it is peculiar to that
repository. The nodeid, however, is global.

One more detail remains to complete the picture. How do we get back
from a particular file change to find the responsible changeset? Each
revlog change has a linkrev entry that does just this.

So, now we have a repository with a history of the changes applied to
that repository. Each change has a unique identifier. If we find that
change in another repository, it means that at the point in the other
repository we have exactly the same state; the file contents and
history are identical.

At this point we can see how pulling changes from another repository
works. Mercurial has to determine which changesets in the source
repository are missing in the target repository. To do this, for each
head in the source repo it has to find the most recent change in that
head that it already present in the target repo, and get any remaining
changes after that point. These changes are then copied over and
applied.

The Mercurial revlog format has proved remarkably durable. Since the
first release of Mercurial in April 2005, there have been a total of 5
changes to the file format. However, of those, all but one have been
changes to the handling of file names. The most recent change, in
October 2008, and its predecessor in December 2006, were both
introduced purely to cope with Windows specific issues. The one change
that touched the data structures described above was in April 2006. The
format introduced, RevLogNG, changed only the details of index data
held, not the overall design. The chief Mercurial developer, Matt
Mackall, notes that the code in present-day Mercurial devoted to
reading the old format comprises 28 lines of Python. Compared with,
say, the early tribulations of Subversion and the switch from \texttt{bdfs} to
\texttt{fsfs}, this is an impressive record.

\section{Reflections on going distributed}
It's nearly traditional at this stage in an introduction to DVCS to
demonstrate several different workflow scenarios that you can build
with a DVCS. Which makes the important point that a DVCS can be
adapted to your workflow in a way that is at best unwieldy with a
CVCS. I intend, though, to break with tradition here.

By this stage, I hope you can see that distributing version control
works by introducing branches where development takes place in
parallel. Mercurial treats these branches as arising naturally from
the commits made and transferred between repositories. Both Git and
Bazaar take a slightly different viewpoint, and explicitly generate a
fresh branch for work in a particular repository. But in both cases
the underlying principle of identifying changes by a globally unique
identifier and resolving parallel development by merges between
overlapping changes is the same. And all three can be used in a truly
distributed manner, with full history and the ability to commit being
available locally.

So instead of chatter on about workflows, I want instead to reflect on
the consequences all this has for that all-important question of
whether a DVCS is a suitable vehicle for your data.

The first is a minor and rather obvious point. If you want to store
files that are very large and which change often in your DVCS, then
all the compression in the world is unlikely to stop the storage
requirements for the full project history from becoming uncomfortably
large, particularly if the files are not very compressible to start
with.

The second, and main, point is that there is an important question you
need to ask about your data. We've seen that a DVCS relies on
branching and merging to weave its magic. So take a close look at your
data, and ask:

\standout{Will It Merge?}

The subset of plain old text which comprises program source
code requires some human oversight, but will merge automatically
well enough for the process to be well within the bounds of the
possible.

Unfortunately when we move further afield mergeability becomes a rarer
commodity. I nearly began the previous paragraph by stating that
plain old text will merge well enough. Then Doubt set in~--- what about
XML? Or BASE64 encoded content?

Of course, merge doesn't necessarily have to be textual merge. I am
told that Word can be used to diff and merge two Word \texttt{.doc} files, a
data format notorious for its binary impenetrability. As long as some
suitable merge agent is available, and the DVCS can be configured to
use it for data of a particular type\footnote{Mercurial can have the
merge and diff tools specified with reference to the file extension on
which they operate~--- I assume Bazaar and Git are similar.}, then there
is no bar to successful DVCS use.

Before this reliance on mergeability causes you to dismiss DVCS out of
hand, reflect. A CVCS can only handle non-mergeable data by acting as
a versioned file store; in other words, having as the only available
merge option the use of one or other of the merge candidates in its
entirety. Useful though a versioned file store can be, it cannot be
considered a full-featured version control system. By treating the
offending unmergeable files as external to the DVCS, or with careful
workflow~--- disabling the distributed and mergeable potentials~--- a DVCS
can deal with these files, but only at a cost of its distributedness
or its version control system-ness. In this it differs little from a
CVCS.

So, for all data you want to version control, let your battle cry be:

\standout{Will It Merge?}

At this point, I have an urge to don lab coat and safety goggles and
be videoed attempting to mechanically merge data in a variety of
different formats. Frankly, this is unlikely to be as exciting at
blending iPhones\footnote{\url{http://www.willitblend.com}},
but from a system development point of view it's rather more
important. And, I think gives us a large clue as to one of the
reasons for the continuing
popularity of Plain Old Text as a source code representation mechanism.

\end{document}
author	Jim Hague <jim.hague@acm.org>
date	Wed, 26 Aug 2009 14:46:50 +0100
parents	03d0ebf7ce0b
children