# HG changeset patch # User Jim Hague # Date 1228990527 0 # Node ID 48d338d29ce9d7e445f052b665822beb0b52c9cd First comitted version. diff -r 000000000000 -r 48d338d29ce9 Hg.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Hg.txt Thu Dec 11 10:15:27 2008 +0000 @@ -0,0 +1,568 @@ +Inside a distributed version control system +=========================================== + +Grinton Lodge is a Youth Hostel that sits on an exposed hillside just +above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales +National Park. A former Victorian shooting lodge, it now welcomes +walkers and other travellers from around the world. + +Tonight, a Wednesday in mid-November, is not one of its busiest +nights. Kat, the duty staff member, tells me that there is a small +corporate team-building group in the annex. There's no sign of them at +present. Otherwise, that portion of the world that has beaten a path +to the door of this grand building today consists of just me. And Kat +goes home soon. + +The November CVu, removed from its wrappers and read yesterday, lies +in my bag. Taunting me. Go on, it says, if you've ever going to put +finger to keyboard in the name of CVu, well, tonight you are out of +excuses. + +Bugger. + +Let's look into Mercurial +------------------------- + +Mercurial is a Distributed Version Control System (DVCS). It's one of a +number of DVCSs that have gained significant popularity in the +last few years. I switched a significant work project over to Mercurial +(from Subversion) over a year ago, because a customer site required +on-site work but could not allow access back to the company VPN. I +chose Mercurial for a variety of reasons which I won't bore you with +here. If you must know, see the box. + +What I want to do in this article is give you an insight into how a +DVCS works. OK, so specifically I'm going to be talking about +Mercurial, but Git and Bazaar attack the problem in a similar way. But +first I'd better give you some idea of how you use Mercurial. + +:::: +Box: OK, if you must know: + +o Implementability. I needed the system to work on Windows, Linux and +AIX. The latter was not one of the directly supported platforms for +any of the candidates. Git's implementation uses a horde of +tools. Bazaar requires only Python, but required Python 2.4 while IBM +stubbornly still supplies only Python 2.3. Mercurial requires Python +2.3 or greater, and uses some C for speed. + +o Simplicity. From the command line, Mercurial's core operations will +be familiar to a Subversion user. This is also true of Bazaar, but was +less true of Git. Git has improved in this matter since then, but a Mr +Winder of this parish tells me that it's still possible to seriously +embarass yourself. There was also a lack of Windows support for Git at +the time. + +o Speed. Mercurial is fast. In the same ballpark as Git. Bazaar +wasn't, and although it has improved significantly, has, in my +estimation, added user complexity in the process, and is still off the +pace for some operations. + +o Documentation. At the time, Bryan O'Sullivan's excellent Mercurial +book (http://hgbook.red-bean.com) was a clear winner for best +documentation. +:::: + +The 5 minute Mercurial overview +------------------------------- + +I think it unlikely that someone possessing the taste and discernment +to be reading CVu would not be familiar with at least one version +control system. So, while I want to give you a flavour of what it's +like to use, I'm not going to hang about. If you'd like a proper +introduction, or you don't follow something, I thoroughly recommend +you consult the Mercurial book. + +To start using Mercurial to keep track of a project. + +$ hg init +$ + +This creates the repository root in the current directory. + +Like CVS with its CVS directory and Subversion with its .svn +directory, Mercurial keeps its private data in a directory. Mercifully +there is only one of these, in the top level of your project. And +rather than holding details of where the actual repository is to be +found, the .hg directory holds the entire repository. + +Next you need to specify the files you want Mercurial to track. + +$ echo "There was a gibbon one morning" > pome.txt +$ hg add pome.txt +$ + +As you might expect, this marks the files as to be added. And as you +might also expect, you need to commit to record the added files in the +repository. The commit comment can be supplied on the command line; if +you don't supply a comment, you'll be dropped into an editor to +provide one. + +There is a suggested format for these messages - a one line summary +followed by any more required detail on following lines. By default +Mercurial will only display the first line of commit messages when +listing changes. In these examples I'll stick to terse messages, and +I'll enter them from the command line. + +$ hg commit -m "My Pome" -u "Jim Hague " +$ + +Mercurial records the user making the change as part of the change +information. It is usual to give your name and email address as I've +done here. You can imagine, though, that constantly having to repeat +this is a bit tedious, so you can set a default user name in a +configuration file. Mercurial keeps global, user and repository +configurations, and it can go in any of those. + +As with Subversion, after further edits you see how your working copy +differs from the repository. + +$ hg status +M pome.txt +$ hg diff +diff -r 33596ef855c1 pome.txt +--- a/pome.txt Wed Apr 23 22:36:33 2008 +0100 ++++ b/pome.txt Wed Apr 23 22:48:01 2008 +0100 +@@ -1,1 +1,2 @@ There was a gibbon one morning + There was a gibbon one morning ++said "I think I will fly to the moon". +$ hg commit -m "A great second line" +$ + +And look through a log of changes. + +$ hg log +changeset: 1:3d65e7a57890 +tag: tip +user: Jim Hague +date: Wed Apr 23 22:49:10 2008 +0100 +summary: A great second line + +changeset: 0:33596ef855c1 +user: Jim Hague +date: Wed Apr 23 22:36:33 2008 +0100 +summary: My Pome + +$ + +There are some items here that need an explanation. + +The changeset identifer is in fact two identifiers separated by a +colon. The first is the sequence number of the changeset in the +repository, and is directly comparable to the change number in a +Subversion repository. The second is a globally unique identifier for +that change. As the change is copied from one repository to another +(this is a distributed system, remember, even if we haven't come to +that bit yet), its sequence number in any particular repository will +change, but the global identifier will always remain the same. + +'tip' is a Mercurial term. It means simply the most recent change. + +Want to rename a file? + +$ hg mv pome.txt poem.txt +$ hg status +A poem.txt +R pome.txt +$ hg commit -m "Rename my file" +$ + +(The command to rename a file is actually 'hg rename', but Mercurial +saves Unix-trained fingers from typing embarrassment.) + +At this point you may be wondering about directories. 'hg mkdir' +perhaps? Well, no. Mercurial only tracks files. To be sure, the +directory a file occupies is tracked, but effectively only as a +component of the file name. This has the slightly unexpected result +that you can't record an empty directory in your repository. +(Footnote: I tripped over this converting a work Subversion +repository. One possibility is to create a placemaker file in the +directory. In the event I created the directory (which receives build +products) as part of the build instead.) + +Given this, and the status output above that suggests strongly that +Mercurial treats a rename as a copy followed by a delete, you may be +worried that Mercurial won't cope at all well with rearranging your +repository. Relax. Mercurial does store the details of the rename as +part of the changeset, and copes very well with rearrangements. + +(Footnote: The Mercurial designers justify not dealing with +directories as first class objects by pointing out that provided you +can correctly move files about in the tree, the other reasons for +tracking directories are uncommon and do not in their opinion justify +the considerable added complexity. So far I've found no reason to +doubt that judgement.) + +Want to rewind the working copy to a previous revision? + +$ hg update -r 1 +1 files updated, 0 files merged, 1 files removed, 0 files unresolved +$ + +'hg update' updates the working files. In this case I'm specifying +that I want to go back to local changeset 1. I could also have typed +'-r 3d65e7a57890', or even '-r 3d'; when specifying the global change +identifier you only need to type enough digits to make it unique. + +This is all very well, but it's not exactly distributed, is it? + +Copy an existing repository: + +elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem +updating working directory +1 files updated, 0 files merged, 0 files removed, 0 files unresolved + +(You can access other repositories via the file system, over http or +over ssh). + +elsewhere$ cd Jim-Poem +elsewhere$ hg log +changeset: 3:a065eb26e6b9 +tag: tip +user: Jim Hague +date: Thu Apr 24 18:52:31 2008 +0100 +summary: Rename my file + +changeset: 2:ff97668b7422 +user: Jim Hague +date: Thu Apr 24 18:50:22 2008 +0100 +summary: Finished first verse + +changeset: 1:3d65e7a57890 +user: Jim Hague +date: Wed Apr 23 22:49:10 2008 +0100 +summary: A great second line + +changeset: 0:33596ef855c1 +user: Jim Hague +date: Wed Apr 23 22:36:33 2008 +0100 +summary: My Pome + +'hg clone' is aptly named. It creates a new repository that contains +exactly the same changes as the source repository. You can make a +clone just by copying your project directory, if you're confident +nothing else will access it during the copy. 'hg clone' saves you this +worry, and sets the default push/pull location in the new repo to the +cloned repo. + +From that point, you use 'hg pull' to collect changes from other +places into your repo (though note it does not by default update your +working copy), and, as you might guess, 'hg push' shoves your changes +into a foreign repository. By default these will act on the repository +you cloned from, but you can specify any other repository. + +More on those in a moment. First, though, I want to show you something +you can't do in Subversion. Start with the repository with 4 changes +we just cloned. Let's focus on the first couple of lines. + +$ hg update -r 1 +1 files updated, 0 files merged, 1 files removed, 0 files unresolved + +And make a change. + +$ hg diff +diff -r 3d65e7a57890 pome.txt +--- a/pome.txt Wed Apr 23 22:49:10 2008 +0100 ++++ b/pome.txt Thu Apr 24 19:13:14 2008 +0100 +@@ -1,2 +1,2 @@ There was a gibbon one morning +-There was a gibbon one morning +-said "I think I will fly to the moon". ++There was a baboon who one afternoon ++said "I think I will fly to the sun". +$ hg commit -m "Better first two lines" +$ + +The alert among you will have sat up at that. Well done! Yes, there's +something very worrying. How can I commit a change at an old point? +If you try this in Subversion, it will complain mightily about your +file being out of date. But Mercurial just went ahead and did +something. The Bazaar experts among you will know that in Bazaar, if +you use 'bzr revert -r' to bring the working copy to a past revision, +make a change and commit, then your latest version will be the past +revision plus your change. Perhaps that's what Mercurial did? + +No. What Mercurial did is central to Mercurial's view of the +world. You took your working copy back to an old changeset, and the +committed a fresh change based at that changeset. Mercurial actually +did just what you asked it to do, no more and no less. Let's see the +initial evidence. + +$ hg heads +changeset: 4:267d32f158b3 +tag: tip +parent: 1:3d65e7a57890 +user: Jim Hague +date: Thu Apr 24 19:13:59 2008 +0100 +summary: Better first two lines + +changeset: 3:a065eb26e6b9 +user: Jim Hague +date: Thu Apr 24 18:52:31 2008 +0100 +summary: Rename my file + +$ + +Time for some more Mercurial terminology. You can think of a 'head' in +Mercurial as the most recent change on a branch. In Mercurial, a +branch is simply what happens when you commit a change that has as its +parent a change that already has a child. Mercurial has a standard +extension 'hg glog' which uses some ASCII art to show the current +state: + +$ hg glog +@ changeset: 4:267d32f158b3 +| tag: tip +| parent: 1:3d65e7a57890 +| user: Jim Hague +| date: Thu Apr 24 19:13:59 2008 +0100 +| summary: Better first two lines +| +| o changeset: 3:a065eb26e6b9 +| | user: Jim Hague +| | date: Thu Apr 24 18:52:31 2008 +0100 +| | summary: Rename my file +| | +| o changeset: 2:ff97668b7422 +|/ user: Jim Hague +| date: Thu Apr 24 18:50:22 2008 +0100 +| summary: Finished first verse +| +o changeset: 1:3d65e7a57890 +| user: Jim Hague +| date: Wed Apr 23 22:49:10 2008 +0100 +| summary: A great second line +| +o changeset: 0:33596ef855c1 + user: Jim Hague + date: Wed Apr 23 22:36:33 2008 +0100 + summary: My Pome + +$ + +'hg view' shows a nicer graphical view. (Footnote: Though, being +Tcl/Tk based, not that much nicer.) + +So the change is in there. It's the latest change, and is simply on a +different branch to the other changes. + +Almost invariably, you will want to bring everything back together and +merge the branches. A merge is a change that combines two heads back +into one. It prepares an updated working directory with the merged +contents of the two heads for you to review and, if satisfactory, commit. + +$ hg merge +merging pome.txt and poem.txt +0 files updated, 1 files merged, 0 files removed, 0 files unresolved +(branch merge, don't forget to commit) +$ cat poem.txt +There was a baboon who one afternoon +said "I think I will fly to the sun". +So with two great palms strapped to his arms, +he started his takeoff run. +$ hg commit -m "Merge first line branch" +$ + +(Footnote: I'm no poet. The poem is, of course, 'Silly Old Baboon' by +the late, great, Spike Milligan.) + +Here's the ASCII art again showing what just happened. Oh, and notice +that Mercurial has done the right thing with regard to the rename. + +$ hg glog +@ changeset: 5:792ab970fc80 +|\ tag: tip +| | parent: 4:267d32f158b3 +| | parent: 3:a065eb26e6b9 +| | user: Jim Hague +| | date: Thu Apr 24 19:29:53 2008 +0100 +| | summary: Merge first line branch +| | +| o changeset: 4:267d32f158b3 +| | parent: 1:3d65e7a57890 +| | user: Jim Hague +| | date: Thu Apr 24 19:13:59 2008 +0100 +| | summary: Better first two lines +| | +o | changeset: 3:a065eb26e6b9 +| | user: Jim Hague +| | date: Thu Apr 24 18:52:31 2008 +0100 +| | summary: Rename my file +| | +o | changeset: 2:ff97668b7422 +|/ user: Jim Hague +| date: Thu Apr 24 18:50:22 2008 +0100 +| summary: Finished first verse +| +o changeset: 1:3d65e7a57890 +| user: Jim Hague +| date: Wed Apr 23 22:49:10 2008 +0100 +| summary: A great second line +| +o changeset: 0:33596ef855c1 + user: Jim Hague + date: Wed Apr 23 22:36:33 2008 +0100 + summary: My Pome + +$ + +So, our little branch change has now been merged back, and we have a +single line of development again. Notice that unlike the other +changesets, changeset 5 has two parent changesets, indicating it is a +merge changeset. You can only merge two branches in one operation; or +putting it another way, a changeset can have a maximum of two parents. + +This behaviour is absolutely central to Mercurial's philosophy. If a +change is committed that takes as its starting point a change that +already has a child, then a branch gets created. Working with +Mercurial, branches get created frequently, and equally frequently +merged back. As befits any frequent operation, both are easy to do. + +You're probably thinking at this point that this making a commit onto +an old version is a slightly strange thing to do, and you'd be right. +But that's exactly what's going to happen the moment you go +distributed. Two people working independently with their own +repositories are going to make commits based, typically, on the latest +changes they happen to have incorporated into their tree. To be +Distributed, a DVCS has to deal with this. Mercurial faces it head-on. +When you pull changes into your repo (or someone else pushes them), if +any of the changes overlap - are both based on the same base change - +you get extra heads, and it's up to you to let these extra heads live +or merge, as you please. + +In practice this is more manageable then you might think. Consider a +typical Mercurial usage, where the 'master' repo sits on a known +server, and everyone pulls changes from the master and pushes their +own efforts the master. But default Mercurial won't let you push if +the receiving repo will gain an extra head as a result, so you +typically pull (and do any required merging) just before +pushing. Subversion users will recognised this pattern. Subversion +won't let you commit a change if your working copy is not at the very +latest revision, so the Subversion user will update, and merge if +necessary, just before committing. + +What, then, about a branch in the conventional sense of '1.0 +maintenance branch'? Typically in Mercurial you'd handle this by +keeping a separate cloned repository for those changes. Cloning is +fast, and if local uses hard links where possible on filesystems that +support them, so isn't necessarily extravagant on disc space. You can, +if you prefer, handle them all in a single repo with 'named +branches', but cloning is definitely simpler. + +OK, so now you know the basics of using Mercurial. We can proceed to +looking at how this magic is achieved. In particular, where does this +magic globally unique identifier for a change come from? + +Inside the Mercurial repo +------------------------- + +The way Mercurial handles its repo is really quite simple. + +That's simple, as in 'most things are simple once you know the +answer'. I found the explanation helpful, so this section attempts +the 10,000ft (FL100 if you prefer) view of Mercurial. + +(Foornote: Bryan O'Sullivan's excellent Mercurial book has a chapter +on the subject, and the Mercurial website has a fair amount of detail +too. This is 'research', OK?) + +First remember that any file or component can only have one or two +parents. You can't merge more than one other branch at once. + +We start with the basic building block, which Mercurial calls a +revlog. A revlog is a thing that holds a file and all the changes in +the file history. (Footnote: For any non-trivial file, this will +actually be two files on the disc, a data file and an index). The +revlog stores the (compressed) differences between successive versions +of the file, though it will periodically store a complete version of +the file instead of a difference, so that the content of any +particular file version can always be reconstructed without excessive +effort. + +Under the secret-squirrel Mercurial .hg directory at the top of your +project is a store which holds a revlog for each file in your project. + +Any point in the evolution of a revlog can be uniquely identified with +a nodeid. This is simply the SHA1 hash of the current file contents +concatenated with the nodeids of one or both parents of the current +revision. Note that this way, two file states are identical if and +only if the file contents are the same *and* the file has the +same history. + +Here's a dump of a revlog index: + +$ hg debugindex .hg/store/data/pome.txt.i + rev offset length base linkrev nodeid p1 p2 + 0 0 32 0 0 6bbbd5d6cc53 000000000000 000000000000 + 1 32 51 0 1 83d266583303 6bbbd5d6cc53 000000000000 + 2 83 84 0 2 14a54ec34bb6 83d266583303 000000000000 + 3 167 76 3 4 dc4df776b38b 83d266583303 000000000000 +$ + +Note here that a file state can have two parents. If both the parent +nodeids are non-null, the file state has two parents, and the state is +therefore the result of a merge. + +Let's dump out a revlog at a particular revision: + +$ hg debugdata .hg/store/data/pome.txt.i 2 +There was a gibbon one morning +said "I think I will fly to the moon". +So with two great palms strapped to his arms, +he started his takeoff run. +$ + +The next component is the manifest. This is simply a list of all the +files in the project, together with their current nodeids. The +manifest is a file, held in a revlog. The nodeid of the manifest, +therefore, identifies the project filesystem at a particular point. + +$ hg debugdata .hg/store/00manifest.i 5 +poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8 +$ + +Finally we have the changeset. This is the atomic collection of +changes to a repository that leads to a new revision. The changeset +info includes the nodeid of the corresponding manifest, the timestamp +and committer ID, a list of changed files and a comment. The changeset +also includes the nodeid of the parent changeset, or the two parents +if the change is a merge. The changeset description is held in a +revlog, the changelog. + +$ hg debugdata .hg/store/00changelog.i 5 +1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e +Jim Hague +1209061793 -3600 +poem.txt +pome.txt + +Merge first line branch +$ + +The nodeid of the changeset, therefore, gives us a globally unique +identifier for any particular change. Changesets have a +Subversion-like incrementing change number, but it is peculiar to that +repository. The nodeid, however, is global. + +One more detail remains to complete the picture. How do we get back +from a particular file change to find the responsible changeset? Each +revlog change has a linkrev entry that does just this. + +So, now we have a repository with a history of the changes applied to +that repository. Each change has a unique identifier. If we find that +change in another repository, it means that at the point in the other +repository we have exactly the same state; the file contents and +history are identical. + +At this point we can see how pulling changes from another repository +works. Mercurial has to determine which changesets in the source +repository are missing in the target repository. To do this, for each +head in the source repo it has to find the most recent change in that +head that it already present in the target repo, and get any remaining +changes after that point. These changes are then copied over and +applied. + +The Mercurial revlog format has proved remarkably durable. Over the +lifetime of Mercurial, there have been just two changes to the file +format. And one of those (a very recently change at the time of +writing, yet to appear in a release version) is a very small change to +filename storage required to deal with Windows-specific issues.