Mercurial > CVu-Mercurial
view Hg.txt @ 0:48d338d29ce9
First comitted version.
author | Jim Hague <jim.hague@acm.org> |
---|---|
date | Thu, 11 Dec 2008 10:15:27 +0000 |
parents | |
children | 608947872f72 |
line wrap: on
line source
Inside a distributed version control system =========================================== Grinton Lodge is a Youth Hostel that sits on an exposed hillside just above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales National Park. A former Victorian shooting lodge, it now welcomes walkers and other travellers from around the world. Tonight, a Wednesday in mid-November, is not one of its busiest nights. Kat, the duty staff member, tells me that there is a small corporate team-building group in the annex. There's no sign of them at present. Otherwise, that portion of the world that has beaten a path to the door of this grand building today consists of just me. And Kat goes home soon. The November CVu, removed from its wrappers and read yesterday, lies in my bag. Taunting me. Go on, it says, if you've ever going to put finger to keyboard in the name of CVu, well, tonight you are out of excuses. Bugger. Let's look into Mercurial ------------------------- Mercurial is a Distributed Version Control System (DVCS). It's one of a number of DVCSs that have gained significant popularity in the last few years. I switched a significant work project over to Mercurial (from Subversion) over a year ago, because a customer site required on-site work but could not allow access back to the company VPN. I chose Mercurial for a variety of reasons which I won't bore you with here. If you must know, see the box. What I want to do in this article is give you an insight into how a DVCS works. OK, so specifically I'm going to be talking about Mercurial, but Git and Bazaar attack the problem in a similar way. But first I'd better give you some idea of how you use Mercurial. :::: Box: OK, if you must know: o Implementability. I needed the system to work on Windows, Linux and AIX. The latter was not one of the directly supported platforms for any of the candidates. Git's implementation uses a horde of tools. Bazaar requires only Python, but required Python 2.4 while IBM stubbornly still supplies only Python 2.3. Mercurial requires Python 2.3 or greater, and uses some C for speed. o Simplicity. From the command line, Mercurial's core operations will be familiar to a Subversion user. This is also true of Bazaar, but was less true of Git. Git has improved in this matter since then, but a Mr Winder of this parish tells me that it's still possible to seriously embarass yourself. There was also a lack of Windows support for Git at the time. o Speed. Mercurial is fast. In the same ballpark as Git. Bazaar wasn't, and although it has improved significantly, has, in my estimation, added user complexity in the process, and is still off the pace for some operations. o Documentation. At the time, Bryan O'Sullivan's excellent Mercurial book (http://hgbook.red-bean.com) was a clear winner for best documentation. :::: The 5 minute Mercurial overview ------------------------------- I think it unlikely that someone possessing the taste and discernment to be reading CVu would not be familiar with at least one version control system. So, while I want to give you a flavour of what it's like to use, I'm not going to hang about. If you'd like a proper introduction, or you don't follow something, I thoroughly recommend you consult the Mercurial book. To start using Mercurial to keep track of a project. $ hg init $ This creates the repository root in the current directory. Like CVS with its CVS directory and Subversion with its .svn directory, Mercurial keeps its private data in a directory. Mercifully there is only one of these, in the top level of your project. And rather than holding details of where the actual repository is to be found, the .hg directory holds the entire repository. Next you need to specify the files you want Mercurial to track. $ echo "There was a gibbon one morning" > pome.txt $ hg add pome.txt $ As you might expect, this marks the files as to be added. And as you might also expect, you need to commit to record the added files in the repository. The commit comment can be supplied on the command line; if you don't supply a comment, you'll be dropped into an editor to provide one. There is a suggested format for these messages - a one line summary followed by any more required detail on following lines. By default Mercurial will only display the first line of commit messages when listing changes. In these examples I'll stick to terse messages, and I'll enter them from the command line. $ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>" $ Mercurial records the user making the change as part of the change information. It is usual to give your name and email address as I've done here. You can imagine, though, that constantly having to repeat this is a bit tedious, so you can set a default user name in a configuration file. Mercurial keeps global, user and repository configurations, and it can go in any of those. As with Subversion, after further edits you see how your working copy differs from the repository. $ hg status M pome.txt $ hg diff diff -r 33596ef855c1 pome.txt --- a/pome.txt Wed Apr 23 22:36:33 2008 +0100 +++ b/pome.txt Wed Apr 23 22:48:01 2008 +0100 @@ -1,1 +1,2 @@ There was a gibbon one morning There was a gibbon one morning +said "I think I will fly to the moon". $ hg commit -m "A great second line" $ And look through a log of changes. $ hg log changeset: 1:3d65e7a57890 tag: tip user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:49:10 2008 +0100 summary: A great second line changeset: 0:33596ef855c1 user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:36:33 2008 +0100 summary: My Pome $ There are some items here that need an explanation. The changeset identifer is in fact two identifiers separated by a colon. The first is the sequence number of the changeset in the repository, and is directly comparable to the change number in a Subversion repository. The second is a globally unique identifier for that change. As the change is copied from one repository to another (this is a distributed system, remember, even if we haven't come to that bit yet), its sequence number in any particular repository will change, but the global identifier will always remain the same. 'tip' is a Mercurial term. It means simply the most recent change. Want to rename a file? $ hg mv pome.txt poem.txt $ hg status A poem.txt R pome.txt $ hg commit -m "Rename my file" $ (The command to rename a file is actually 'hg rename', but Mercurial saves Unix-trained fingers from typing embarrassment.) At this point you may be wondering about directories. 'hg mkdir' perhaps? Well, no. Mercurial only tracks files. To be sure, the directory a file occupies is tracked, but effectively only as a component of the file name. This has the slightly unexpected result that you can't record an empty directory in your repository. (Footnote: I tripped over this converting a work Subversion repository. One possibility is to create a placemaker file in the directory. In the event I created the directory (which receives build products) as part of the build instead.) Given this, and the status output above that suggests strongly that Mercurial treats a rename as a copy followed by a delete, you may be worried that Mercurial won't cope at all well with rearranging your repository. Relax. Mercurial does store the details of the rename as part of the changeset, and copes very well with rearrangements. (Footnote: The Mercurial designers justify not dealing with directories as first class objects by pointing out that provided you can correctly move files about in the tree, the other reasons for tracking directories are uncommon and do not in their opinion justify the considerable added complexity. So far I've found no reason to doubt that judgement.) Want to rewind the working copy to a previous revision? $ hg update -r 1 1 files updated, 0 files merged, 1 files removed, 0 files unresolved $ 'hg update' updates the working files. In this case I'm specifying that I want to go back to local changeset 1. I could also have typed '-r 3d65e7a57890', or even '-r 3d'; when specifying the global change identifier you only need to type enough digits to make it unique. This is all very well, but it's not exactly distributed, is it? Copy an existing repository: elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem updating working directory 1 files updated, 0 files merged, 0 files removed, 0 files unresolved (You can access other repositories via the file system, over http or over ssh). elsewhere$ cd Jim-Poem elsewhere$ hg log changeset: 3:a065eb26e6b9 tag: tip user: Jim Hague <jim.hague@acm.org> date: Thu Apr 24 18:52:31 2008 +0100 summary: Rename my file changeset: 2:ff97668b7422 user: Jim Hague <jim.hague@acm.org> date: Thu Apr 24 18:50:22 2008 +0100 summary: Finished first verse changeset: 1:3d65e7a57890 user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:49:10 2008 +0100 summary: A great second line changeset: 0:33596ef855c1 user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:36:33 2008 +0100 summary: My Pome 'hg clone' is aptly named. It creates a new repository that contains exactly the same changes as the source repository. You can make a clone just by copying your project directory, if you're confident nothing else will access it during the copy. 'hg clone' saves you this worry, and sets the default push/pull location in the new repo to the cloned repo. From that point, you use 'hg pull' to collect changes from other places into your repo (though note it does not by default update your working copy), and, as you might guess, 'hg push' shoves your changes into a foreign repository. By default these will act on the repository you cloned from, but you can specify any other repository. More on those in a moment. First, though, I want to show you something you can't do in Subversion. Start with the repository with 4 changes we just cloned. Let's focus on the first couple of lines. $ hg update -r 1 1 files updated, 0 files merged, 1 files removed, 0 files unresolved And make a change. $ hg diff diff -r 3d65e7a57890 pome.txt --- a/pome.txt Wed Apr 23 22:49:10 2008 +0100 +++ b/pome.txt Thu Apr 24 19:13:14 2008 +0100 @@ -1,2 +1,2 @@ There was a gibbon one morning -There was a gibbon one morning -said "I think I will fly to the moon". +There was a baboon who one afternoon +said "I think I will fly to the sun". $ hg commit -m "Better first two lines" $ The alert among you will have sat up at that. Well done! Yes, there's something very worrying. How can I commit a change at an old point? If you try this in Subversion, it will complain mightily about your file being out of date. But Mercurial just went ahead and did something. The Bazaar experts among you will know that in Bazaar, if you use 'bzr revert -r' to bring the working copy to a past revision, make a change and commit, then your latest version will be the past revision plus your change. Perhaps that's what Mercurial did? No. What Mercurial did is central to Mercurial's view of the world. You took your working copy back to an old changeset, and the committed a fresh change based at that changeset. Mercurial actually did just what you asked it to do, no more and no less. Let's see the initial evidence. $ hg heads changeset: 4:267d32f158b3 tag: tip parent: 1:3d65e7a57890 user: Jim Hague <jim.hague@acm.org> date: Thu Apr 24 19:13:59 2008 +0100 summary: Better first two lines changeset: 3:a065eb26e6b9 user: Jim Hague <jim.hague@acm.org> date: Thu Apr 24 18:52:31 2008 +0100 summary: Rename my file $ Time for some more Mercurial terminology. You can think of a 'head' in Mercurial as the most recent change on a branch. In Mercurial, a branch is simply what happens when you commit a change that has as its parent a change that already has a child. Mercurial has a standard extension 'hg glog' which uses some ASCII art to show the current state: $ hg glog @ changeset: 4:267d32f158b3 | tag: tip | parent: 1:3d65e7a57890 | user: Jim Hague <jim.hague@acm.org> | date: Thu Apr 24 19:13:59 2008 +0100 | summary: Better first two lines | | o changeset: 3:a065eb26e6b9 | | user: Jim Hague <jim.hague@acm.org> | | date: Thu Apr 24 18:52:31 2008 +0100 | | summary: Rename my file | | | o changeset: 2:ff97668b7422 |/ user: Jim Hague <jim.hague@acm.org> | date: Thu Apr 24 18:50:22 2008 +0100 | summary: Finished first verse | o changeset: 1:3d65e7a57890 | user: Jim Hague <jim.hague@acm.org> | date: Wed Apr 23 22:49:10 2008 +0100 | summary: A great second line | o changeset: 0:33596ef855c1 user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:36:33 2008 +0100 summary: My Pome $ 'hg view' shows a nicer graphical view. (Footnote: Though, being Tcl/Tk based, not that much nicer.) So the change is in there. It's the latest change, and is simply on a different branch to the other changes. Almost invariably, you will want to bring everything back together and merge the branches. A merge is a change that combines two heads back into one. It prepares an updated working directory with the merged contents of the two heads for you to review and, if satisfactory, commit. $ hg merge merging pome.txt and poem.txt 0 files updated, 1 files merged, 0 files removed, 0 files unresolved (branch merge, don't forget to commit) $ cat poem.txt There was a baboon who one afternoon said "I think I will fly to the sun". So with two great palms strapped to his arms, he started his takeoff run. $ hg commit -m "Merge first line branch" $ (Footnote: I'm no poet. The poem is, of course, 'Silly Old Baboon' by the late, great, Spike Milligan.) Here's the ASCII art again showing what just happened. Oh, and notice that Mercurial has done the right thing with regard to the rename. $ hg glog @ changeset: 5:792ab970fc80 |\ tag: tip | | parent: 4:267d32f158b3 | | parent: 3:a065eb26e6b9 | | user: Jim Hague <jim.hague@acm.org> | | date: Thu Apr 24 19:29:53 2008 +0100 | | summary: Merge first line branch | | | o changeset: 4:267d32f158b3 | | parent: 1:3d65e7a57890 | | user: Jim Hague <jim.hague@acm.org> | | date: Thu Apr 24 19:13:59 2008 +0100 | | summary: Better first two lines | | o | changeset: 3:a065eb26e6b9 | | user: Jim Hague <jim.hague@acm.org> | | date: Thu Apr 24 18:52:31 2008 +0100 | | summary: Rename my file | | o | changeset: 2:ff97668b7422 |/ user: Jim Hague <jim.hague@acm.org> | date: Thu Apr 24 18:50:22 2008 +0100 | summary: Finished first verse | o changeset: 1:3d65e7a57890 | user: Jim Hague <jim.hague@acm.org> | date: Wed Apr 23 22:49:10 2008 +0100 | summary: A great second line | o changeset: 0:33596ef855c1 user: Jim Hague <jim.hague@acm.org> date: Wed Apr 23 22:36:33 2008 +0100 summary: My Pome $ So, our little branch change has now been merged back, and we have a single line of development again. Notice that unlike the other changesets, changeset 5 has two parent changesets, indicating it is a merge changeset. You can only merge two branches in one operation; or putting it another way, a changeset can have a maximum of two parents. This behaviour is absolutely central to Mercurial's philosophy. If a change is committed that takes as its starting point a change that already has a child, then a branch gets created. Working with Mercurial, branches get created frequently, and equally frequently merged back. As befits any frequent operation, both are easy to do. You're probably thinking at this point that this making a commit onto an old version is a slightly strange thing to do, and you'd be right. But that's exactly what's going to happen the moment you go distributed. Two people working independently with their own repositories are going to make commits based, typically, on the latest changes they happen to have incorporated into their tree. To be Distributed, a DVCS has to deal with this. Mercurial faces it head-on. When you pull changes into your repo (or someone else pushes them), if any of the changes overlap - are both based on the same base change - you get extra heads, and it's up to you to let these extra heads live or merge, as you please. In practice this is more manageable then you might think. Consider a typical Mercurial usage, where the 'master' repo sits on a known server, and everyone pulls changes from the master and pushes their own efforts the master. But default Mercurial won't let you push if the receiving repo will gain an extra head as a result, so you typically pull (and do any required merging) just before pushing. Subversion users will recognised this pattern. Subversion won't let you commit a change if your working copy is not at the very latest revision, so the Subversion user will update, and merge if necessary, just before committing. What, then, about a branch in the conventional sense of '1.0 maintenance branch'? Typically in Mercurial you'd handle this by keeping a separate cloned repository for those changes. Cloning is fast, and if local uses hard links where possible on filesystems that support them, so isn't necessarily extravagant on disc space. You can, if you prefer, handle them all in a single repo with 'named branches', but cloning is definitely simpler. OK, so now you know the basics of using Mercurial. We can proceed to looking at how this magic is achieved. In particular, where does this magic globally unique identifier for a change come from? Inside the Mercurial repo ------------------------- The way Mercurial handles its repo is really quite simple. That's simple, as in 'most things are simple once you know the answer'. I found the explanation helpful, so this section attempts the 10,000ft (FL100 if you prefer) view of Mercurial. (Foornote: Bryan O'Sullivan's excellent Mercurial book has a chapter on the subject, and the Mercurial website has a fair amount of detail too. This is 'research', OK?) First remember that any file or component can only have one or two parents. You can't merge more than one other branch at once. We start with the basic building block, which Mercurial calls a revlog. A revlog is a thing that holds a file and all the changes in the file history. (Footnote: For any non-trivial file, this will actually be two files on the disc, a data file and an index). The revlog stores the (compressed) differences between successive versions of the file, though it will periodically store a complete version of the file instead of a difference, so that the content of any particular file version can always be reconstructed without excessive effort. Under the secret-squirrel Mercurial .hg directory at the top of your project is a store which holds a revlog for each file in your project. Any point in the evolution of a revlog can be uniquely identified with a nodeid. This is simply the SHA1 hash of the current file contents concatenated with the nodeids of one or both parents of the current revision. Note that this way, two file states are identical if and only if the file contents are the same *and* the file has the same history. Here's a dump of a revlog index: $ hg debugindex .hg/store/data/pome.txt.i rev offset length base linkrev nodeid p1 p2 0 0 32 0 0 6bbbd5d6cc53 000000000000 000000000000 1 32 51 0 1 83d266583303 6bbbd5d6cc53 000000000000 2 83 84 0 2 14a54ec34bb6 83d266583303 000000000000 3 167 76 3 4 dc4df776b38b 83d266583303 000000000000 $ Note here that a file state can have two parents. If both the parent nodeids are non-null, the file state has two parents, and the state is therefore the result of a merge. Let's dump out a revlog at a particular revision: $ hg debugdata .hg/store/data/pome.txt.i 2 There was a gibbon one morning said "I think I will fly to the moon". So with two great palms strapped to his arms, he started his takeoff run. $ The next component is the manifest. This is simply a list of all the files in the project, together with their current nodeids. The manifest is a file, held in a revlog. The nodeid of the manifest, therefore, identifies the project filesystem at a particular point. $ hg debugdata .hg/store/00manifest.i 5 poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8 $ Finally we have the changeset. This is the atomic collection of changes to a repository that leads to a new revision. The changeset info includes the nodeid of the corresponding manifest, the timestamp and committer ID, a list of changed files and a comment. The changeset also includes the nodeid of the parent changeset, or the two parents if the change is a merge. The changeset description is held in a revlog, the changelog. $ hg debugdata .hg/store/00changelog.i 5 1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e Jim Hague <jim.hague@acm.org> 1209061793 -3600 poem.txt pome.txt Merge first line branch $ The nodeid of the changeset, therefore, gives us a globally unique identifier for any particular change. Changesets have a Subversion-like incrementing change number, but it is peculiar to that repository. The nodeid, however, is global. One more detail remains to complete the picture. How do we get back from a particular file change to find the responsible changeset? Each revlog change has a linkrev entry that does just this. So, now we have a repository with a history of the changes applied to that repository. Each change has a unique identifier. If we find that change in another repository, it means that at the point in the other repository we have exactly the same state; the file contents and history are identical. At this point we can see how pulling changes from another repository works. Mercurial has to determine which changesets in the source repository are missing in the target repository. To do this, for each head in the source repo it has to find the most recent change in that head that it already present in the target repo, and get any remaining changes after that point. These changes are then copied over and applied. The Mercurial revlog format has proved remarkably durable. Over the lifetime of Mercurial, there have been just two changes to the file format. And one of those (a very recently change at the time of writing, yet to appear in a release version) is a very small change to filename storage required to deal with Windows-specific issues.