Mercurial > CVu-Mercurial

Inside a distributed version control system
===========================================

Grinton Lodge is a Youth Hostel that sits on an exposed hillside just
above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales
National Park. A former Victorian shooting lodge, it now welcomes
walkers and other travellers from around the world.

Tonight, a Wednesday in mid-November, is not one of its busiest
nights. Kat, the duty staff member, tells me that there is a small
corporate team-building group in the annex. There's no sign of them at
present. Otherwise, that portion of the world that has beaten a path
to the door of this grand building today consists of just me. And Kat
goes home soon.

The November CVu, removed from its wrappers and read yesterday, lies
in my bag. Taunting me. Go on, it says, if you've ever going to put
finger to keyboard in the name of CVu, well, tonight you are out of
excuses.

Bugger.

Let's look into Mercurial
-------------------------

If you're at all interested in version control systems - and any
software developer not using one daily is a strange beast indeed -
you'll at least have become vaguely aware in the last few years of the
growing maturity of the latest group of version control systems
offering funky new stuff. These are the distributed version control
systems (DVCS). There is more to them than just their headline
attributes, being able to check history and do checkins while
disconnected from a central server, but these are damm useful to start
with.

When I first heard about DVCS, it wasn't immediately obvious to me (to
put it mildly) how they would work. After years of using a centralised
version control system, I had rough mental model of what went on. But
how do you cope without the central server forcing ordering onto the
changes?

Since then I've started using Mercurial. Mercurial is a DVCS. It's one
of three DVCSs that have gained significant popularity in the last few
years, the other two being Git and Bazaar. I switched a significant
work project over to Mercurial (from Subversion) over a year ago,
because a customer site required on-site work but could not allow
access back to the company VPN. I chose Mercurial for a variety of
reasons which I won't bore you with here. If you must know, see the
box.

What I want to do in this article is give you an insight into how a
DVCS works. OK, so specifically I'm going to be talking about
Mercurial, but Git and Bazaar attack the problem in a similar way. But
first I'd better give you some idea of how you use Mercurial.

::::
Box: OK, if you must know:

o Implementability. I needed the system to work on Windows, Linux and
AIX. The latter was not one of the directly supported platforms for
any of the candidates. Git's implementation uses a horde of
tools. Bazaar requires only Python, but required Python 2.4 while IBM
stubbornly still supplies only Python 2.3. Mercurial requires Python
2.3 or greater, and uses some C for speed.

o Simplicity. My users used Subversion daily, but did not generally
have much experience with other VCS. From the command line,
Mercurial's core operations will be familiar to a Subversion
user. This is also true of Bazaar, but was less true of Git. Git has
improved in this matter since then, but a Mr Winder of this parish
tells me that it's still possible to seriously embarass
yourself. There was also a lack of Windows support for Git at the
time.

o Speed. Mercurial is fast. In the same ballpark as Git. Bazaar
wasn't, and although it has improved significantly, has, in my
estimation, added user complexity in the process, and is still off the
pace for some operations.

o Documentation. At the time, Bryan O'Sullivan's excellent Mercurial
book (http://hgbook.red-bean.com) was a clear winner for best
documentation.
::::

The 5 minute Mercurial overview
-------------------------------

I think it unlikely that someone possessing the taste and discernment
to be reading CVu would not be familiar with at least one version
control system. So, while I want to give you a flavour of what it's
like to use, I'm not going to hang about. If you'd like a proper
introduction, or you don't follow something, I thoroughly recommend
you consult the Mercurial book.

To start using Mercurial to keep track of a project.

$ hg init
$

This creates the repository root in the current directory.

Like CVS with its CVS directory and Subversion with its .svn
directory, Mercurial keeps its private data in a directory. Mercifully
there is only one of these, in the top level of your project. And
rather than holding details of where the actual repository is to be
found, the .hg directory holds the entire repository.

Next you need to specify the files you want Mercurial to track.

$ echo "There was a gibbon one morning" > pome.txt
$ hg add pome.txt
$

As you might expect, this marks the files as to be added. And as you
might also expect, you need to commit to record the added files in the
repository. The commit comment can be supplied on the command line; if
you don't supply a comment, you'll be dropped into an editor to
provide one.

There is a suggested format for these messages - a one line summary
followed by any more required detail on following lines. By default
Mercurial will only display the first line of commit messages when
listing changes. In these examples I'll stick to terse messages, and
I'll enter them from the command line.

$ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>"
$

Mercurial records the user making the change as part of the change
information. It is usual to give your name and email address as I've
done here. You can imagine, though, that constantly having to repeat
this is a bit tedious, so you can set a default user name in a
configuration file. Mercurial keeps global, user and repository
configurations, and it can go in any of those.

As with Subversion, after further edits you see how your working copy
differs from the repository.

$ hg status
M pome.txt
$ hg diff
diff -r 33596ef855c1 pome.txt
--- a/pome.txt  Wed Apr 23 22:36:33 2008 +0100
+++ b/pome.txt  Wed Apr 23 22:48:01 2008 +0100
@@ -1,1 +1,2 @@ There was a gibbon one morning
 There was a gibbon one morning
+said "I think I will fly to the moon".
$ hg commit -m "A great second line"
$

And look through a log of changes.

$ hg log
changeset:   1:3d65e7a57890
tag:         tip
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:49:10 2008 +0100
summary:     A great second line

changeset:   0:33596ef855c1
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:36:33 2008 +0100
summary:     My Pome

$

There are some items here that need an explanation.

The changeset identifer is in fact two identifiers separated by a
colon. The first is the sequence number of the changeset in the
repository, and is directly comparable to the change number in a
Subversion repository. The second is a globally unique identifier for
that change. As the change is copied from one repository to another
(this is a distributed system, remember, even if we haven't come to
that bit yet), its sequence number in any particular repository will
change, but the global identifier will always remain the same.

'tip' is a Mercurial term. It means simply the most recent change.

Want to rename a file?

$ hg mv pome.txt poem.txt
$ hg status
A poem.txt
R pome.txt
$ hg commit -m "Rename my file"
$

(The command to rename a file is actually 'hg rename', but Mercurial
saves Unix-trained fingers from typing embarrassment.)

At this point you may be wondering about directories. 'hg mkdir'
perhaps? Well, no. Mercurial only tracks files. To be sure, the
directory a file occupies is tracked, but effectively only as a
component of the file name.  This has the slightly unexpected result
that you can't record an empty directory in your repository.

(Footnote: I tripped over this converting a work Subversion
repository. One possibility is to create a placemaker file in the
directory. In the event I created the directory (which receives build
products) as part of the build instead.)

Given this, and the status output above that suggests strongly that
Mercurial treats a rename as a copy followed by a delete, you may be
worried that Mercurial won't cope at all well with rearranging your
repository. Relax. Mercurial does store the details of the rename as
part of the changeset, and copes very well with rearrangements.

(Footnote: The Mercurial designers justify not dealing with
directories as first class objects by pointing out that provided you
can correctly move files about in the tree, the other reasons for
tracking directories are uncommon and do not in their opinion justify
the considerable added complexity. So far I've found no reason to
doubt that judgement.)

Want to rewind the working copy to a previous revision?

$ hg update -r 1
1 files updated, 0 files merged, 1 files removed, 0 files unresolved
$

'hg update' updates the working files. In this case I'm specifying
that I want to go back to local changeset 1. I could also have typed
'-r 3d65e7a57890', or even '-r 3d'; when specifying the global change
identifier you only need to type enough digits to make it unique.

This is all very well, but it's not exactly distributed, is it?

Copy an existing repository:

elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem
updating working directory
1 files updated, 0 files merged, 0 files removed, 0 files unresolved

(You can access other repositories via the file system, over http or
over ssh).

elsewhere$ cd Jim-Poem
elsewhere$  hg log
changeset:   3:a065eb26e6b9
tag:         tip
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:52:31 2008 +0100
summary:     Rename my file

changeset:   2:ff97668b7422
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:50:22 2008 +0100
summary:     Finished first verse

changeset:   1:3d65e7a57890
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:49:10 2008 +0100
summary:     A great second line

changeset:   0:33596ef855c1
user:        Jim Hague <jim.hague@acm.org>
date:        Wed Apr 23 22:36:33 2008 +0100
summary:     My Pome

'hg clone' is aptly named. It creates a new repository that contains
exactly the same changes as the source repository. You can make a
clone just by copying your project directory, if you're confident
nothing else will access it during the copy. 'hg clone' saves you this
worry, and sets the default push/pull location in the new repo to the
cloned repo.

From that point, you use 'hg pull' to collect changes from other
places into your repo (though note it does not by default update your
working copy), and, as you might guess, 'hg push' shoves your changes
into a foreign repository. By default these will act on the repository
you cloned from, but you can specify any other repository.

More on those in a moment. First, though, I want to show you something
you can't do in Subversion. Start with the repository with 4 changes
we just cloned. I want to focus on the first couple of lines, so I'll
wind the working copy back to the point where only those lines exist.

$ hg update -r 1
1 files updated, 0 files merged, 1 files removed, 0 files unresolved

And make a change.

$ hg diff
diff -r 3d65e7a57890 pome.txt
--- a/pome.txt  Wed Apr 23 22:49:10 2008 +0100
+++ b/pome.txt  Thu Apr 24 19:13:14 2008 +0100
@@ -1,2 +1,2 @@ There was a gibbon one morning
-There was a gibbon one morning
-said "I think I will fly to the moon".
+There was a baboon who one afternoon
+said "I think I will fly to the sun".
$ hg commit -m "Better first two lines"
$

The alert among you will have sat up at that. Well done! Yes, there's
something very worrying. How can I commit a change at an old point?
If you try this in Subversion, it will complain mightily about your
file being out of date. But Mercurial just went ahead and did
something.  The Bazaar experts among you will know that in Bazaar, if
you use 'bzr revert -r' to bring the working copy to a past revision,
make a change and commit, then your latest version will be the past
revision plus your change. Perhaps that's what Mercurial did?

No. What Mercurial did is central to Mercurial's view of the
world. You took your working copy back to an old changeset, and the
committed a fresh change based at that changeset. Mercurial actually
did just what you asked it to do, no more and no less. Let's see the
initial evidence.

$ hg heads
changeset:   4:267d32f158b3
tag:         tip
parent:      1:3d65e7a57890
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 19:13:59 2008 +0100
summary:     Better first two lines

changeset:   3:a065eb26e6b9
user:        Jim Hague <jim.hague@acm.org>
date:        Thu Apr 24 18:52:31 2008 +0100
summary:     Rename my file

$

Time for some more Mercurial terminology. You can think of a 'head' in
Mercurial as the most recent change on a branch. In Mercurial, a
branch is simply what happens when you commit a change that has as its
parent a change that already has a child. Mercurial has a standard
extension 'hg glog' which uses some ASCII art to show the current
state:

$ hg glog
@  changeset:   4:267d32f158b3
|  tag:         tip
|  parent:      1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Thu Apr 24 19:13:59 2008 +0100
|  summary:     Better first two lines
|
| o  changeset:   3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 18:52:31 2008 +0100
| |  summary:     Rename my file
| |
| o  changeset:   2:ff97668b7422
|/   user:        Jim Hague <jim.hague@acm.org>
|    date:        Thu Apr 24 18:50:22 2008 +0100
|    summary:     Finished first verse
|
o  changeset:   1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Wed Apr 23 22:49:10 2008 +0100
|  summary:     A great second line
|
o  changeset:   0:33596ef855c1
   user:        Jim Hague <jim.hague@acm.org>
   date:        Wed Apr 23 22:36:33 2008 +0100
   summary:     My Pome

$

'hg view' shows a nicer graphical view. (Footnote: Though, being
Tcl/Tk based, not that much nicer.)

So the change is in there. It's the latest change, and is simply on a
different branch to the other changes.

Almost invariably, you will want to bring everything back together and
merge the branches. A merge is a change that combines two heads back
into one. It prepares an updated working directory with the merged
contents of the two heads for you to review and, if satisfactory, commit.

$ hg merge
merging pome.txt and poem.txt
0 files updated, 1 files merged, 0 files removed, 0 files unresolved
(branch merge, don't forget to commit)
$ cat poem.txt
There was a baboon who one afternoon
said "I think I will fly to the sun".
So with two great palms strapped to his arms,
he started his takeoff run.
$ hg commit -m "Merge first line branch"
$

(Footnote: I'm no poet. The poem is, of course, 'Silly Old Baboon' by
the late, great, Spike Milligan. From 'A Book of Milliganimals',
Puffin, 1971.)

Here's the ASCII art again showing what just happened. Oh, and notice
that Mercurial has done the right thing with regard to the rename.

$ hg glog
@    changeset:   5:792ab970fc80
|\   tag:         tip
| |  parent:      4:267d32f158b3
| |  parent:      3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 19:29:53 2008 +0100
| |  summary:     Merge first line branch
| |
| o  changeset:   4:267d32f158b3
| |  parent:      1:3d65e7a57890
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 19:13:59 2008 +0100
| |  summary:     Better first two lines
| |
o |  changeset:   3:a065eb26e6b9
| |  user:        Jim Hague <jim.hague@acm.org>
| |  date:        Thu Apr 24 18:52:31 2008 +0100
| |  summary:     Rename my file
| |
o |  changeset:   2:ff97668b7422
|/   user:        Jim Hague <jim.hague@acm.org>
|    date:        Thu Apr 24 18:50:22 2008 +0100
|    summary:     Finished first verse
|
o  changeset:   1:3d65e7a57890
|  user:        Jim Hague <jim.hague@acm.org>
|  date:        Wed Apr 23 22:49:10 2008 +0100
|  summary:     A great second line
|
o  changeset:   0:33596ef855c1
   user:        Jim Hague <jim.hague@acm.org>
   date:        Wed Apr 23 22:36:33 2008 +0100
   summary:     My Pome

$

So, our little branch change has now been merged back, and we have a
single line of development again. Notice that unlike the other
changesets, changeset 5 has two parent changesets, indicating it is a
merge changeset. You can only merge two branches in one operation; or
putting it another way, a changeset can have a maximum of two parents.

This behaviour is absolutely central to Mercurial's philosophy. If a
change is committed that takes as its starting point a change that
already has a child, then a branch gets created. Working with
Mercurial, branches get created frequently, and equally frequently
merged back. As befits any frequent operation, both are easy to do.

You're probably thinking at this point that this making a commit onto
an old version is a slightly strange thing to do, and you'd be right.
But that's exactly what's going to happen the moment you go
distributed. Two people working independently with their own
repositories are going to make commits based, typically, on the latest
changes they happen to have incorporated into their tree. To be
Distributed, a DVCS has to deal with this. Mercurial faces it head-on.
When you pull changes into your repo (or someone else pushes them), if
any of the changes overlap - are both based on the same base change -
you get extra heads, and it's up to you to let these extra heads live
or merge, as you please.

In practice this is more manageable then you might think. Consider a
typical Mercurial usage, where the 'master' repo sits on a known
server, and everyone pulls changes from the master and pushes their
own efforts the master. But default Mercurial won't let you push if
the receiving repo will gain an extra head as a result, so you
typically pull (and do any required merging) just before
pushing. Subversion users will recognised this pattern. Subversion
won't let you commit a change if your working copy is not at the very
latest revision, so the Subversion user will update, and merge if
necessary, just before committing.

What, then, about a branch in the conventional sense of '1.0
maintenance branch'? Typically in Mercurial you'd handle this by
keeping a separate cloned repository for those changes. Cloning is
fast, and if local uses hard links where possible on filesystems that
support them, so isn't necessarily extravagant on disc space. You can,
if you prefer, handle them all in a single repo with 'named
branches', but cloning is definitely simpler.

OK, so now you know the basics of using Mercurial. We can proceed to
looking at how this magic is achieved. In particular, where does this
magic globally unique identifier for a change come from?

Inside the Mercurial repo
-------------------------

The way Mercurial handles its repo is really quite simple.

That's simple, as in 'most things are simple once you know the
answer'.  I found the explanation helpful, so this section attempts
the 10,000ft (FL100 if you prefer) view of Mercurial.

(Foornote: Bryan O'Sullivan's excellent Mercurial book has a chapter
on the subject, and the Mercurial website has a fair amount of detail
too. This is 'research', OK?)

First remember that any file or component can only have one or two
parents. You can't merge more than one other branch at once.

We start with the basic building block, which Mercurial calls a
revlog. A revlog is a thing that holds a file and all the changes in
the file history. (Footnote: For any non-trivial file, this will
actually be two files on the disc, a data file and an index). The
revlog stores the (compressed) differences between successive versions
of the file, though it will periodically store a complete version of
the file instead of a difference, so that the content of any
particular file version can always be reconstructed without excessive
effort.

Under the secret-squirrel Mercurial .hg directory at the top of your
project is a store which holds a revlog for each file in your project.

Any point in the evolution of a revlog can be uniquely identified with
a nodeid. This is simply the SHA1 hash of the current file contents
concatenated with the nodeids of one or both parents of the current
revision. Note that this way, two file states are identical if and
only if the file contents are the same *and* the file has the
same history.

Here's a dump of a revlog index:

$ hg debugindex .hg/store/data/pome.txt.i
   rev    offset  length   base linkrev nodeid       p1           p2
     0         0      32      0       0 6bbbd5d6cc53 000000000000 000000000000
     1        32      51      0       1 83d266583303 6bbbd5d6cc53 000000000000
     2        83      84      0       2 14a54ec34bb6 83d266583303 000000000000
     3       167      76      3       4 dc4df776b38b 83d266583303 000000000000
$

Note here that a file state can have two parents. If both the parent
nodeids are non-null, the file state has two parents, and the state is
therefore the result of a merge.

Let's dump out a revlog at a particular revision:

$ hg debugdata .hg/store/data/pome.txt.i 2
There was a gibbon one morning
said "I think I will fly to the moon".
So with two great palms strapped to his arms,
he started his takeoff run.
$

The next component is the manifest. This is simply a list of all the
files in the project, together with their current nodeids. The
manifest is a file, held in a revlog. The nodeid of the manifest,
therefore, identifies the project filesystem at a particular point.

$ hg debugdata .hg/store/00manifest.i 5
poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8
$

Finally we have the changeset. This is the atomic collection of
changes to a repository that leads to a new revision. The changeset
info includes the nodeid of the corresponding manifest, the timestamp
and committer ID, a list of changed files and a comment. The changeset
also includes the nodeid of the parent changeset, or the two parents
if the change is a merge. The changeset description is held in a
revlog, the changelog.

$ hg debugdata .hg/store/00changelog.i 5
1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e
Jim Hague <jim.hague@acm.org>
1209061793 -3600
poem.txt
pome.txt

Merge first line branch
$

The nodeid of the changeset, therefore, gives us a globally unique
identifier for any particular change.  Changesets have a
Subversion-like incrementing change number, but it is peculiar to that
repository. The nodeid, however, is global.

One more detail remains to complete the picture. How do we get back
from a particular file change to find the responsible changeset? Each
revlog change has a linkrev entry that does just this.

So, now we have a repository with a history of the changes applied to
that repository. Each change has a unique identifier. If we find that
change in another repository, it means that at the point in the other
repository we have exactly the same state; the file contents and
history are identical.

At this point we can see how pulling changes from another repository
works. Mercurial has to determine which changesets in the source
repository are missing in the target repository. To do this, for each
head in the source repo it has to find the most recent change in that
head that it already present in the target repo, and get any remaining
changes after that point. These changes are then copied over and
applied.

The Mercurial revlog format has proved remarkably durable. Over the
lifetime of Mercurial, there have been just two changes to the file
format. And one of those (a very recently change at the time of
writing, yet to appear in a release version) is a very small change to
filename storage required to deal with Windows-specific issues.
author	Jim Hague <jim.hague@icc-atcsolutions.com>
date	Sun, 21 Dec 2008 21:39:38 +0000
parents	608947872f72
children	175493e0e457