Mercurial > CVu-Mercurial
changeset 9:2155510c62f3
A version formatted with Latex. And spellchecked.
author | Jim Hague <jim.hague@acm.org> |
---|---|
date | Fri, 22 May 2009 10:23:40 +0100 |
parents | abca12aaa38d |
children | 2e4d690ffabb |
files | Hg.tex |
diffstat | 1 files changed, 752 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Hg.tex Fri May 22 10:23:40 2009 +0100 @@ -0,0 +1,752 @@ +\documentclass[a4paper]{article} +\usepackage{pslatex} +\usepackage{url} + +\newcommand{\standout}[1]{ + {\begin{center} \large \textbf{#1} \end{center}} +} + +\setlength{\parskip}{2mm} +\setlength{\parindent}{0mm} + +\begin{document} +\title{Inside a distributed version control system} +\author{Jim Hague\\ + \texttt{jim.hague@acm.org}} +\date{May 2009} +\maketitle + +\section{Preamble} +Grinton Lodge is a Youth Hostel that sits on an exposed hillside just +above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales +National Park. A former Victorian shooting lodge, it now welcomes +walkers and other travellers from around the world. + +Tonight, a Wednesday in mid-November, is not one of its busiest +nights. Kat, the duty staff member, tells me that there is a small +corporate team-building group in the annex. There's no sign of them at +present. Otherwise, that portion of the world that has beaten a path +to the door of this grand building today consists of just me. And Kat +goes home soon. + +The November CVu, removed from its wrappers and read yesterday, lies +in my bag. Taunting me. Go on, it says, if you've ever going to put +finger to keyboard in the name of CVu, well, tonight you are out of +excuses. + +Bugger. + +\section{Let's look into Mercurial} +If you're at all interested in version control systems~--- and any +software developer not using one daily is a strange beast indeed~--- +you'll at least have become vaguely aware in the last few years of the +growing maturity of the latest group of version control systems +offering funky new stuff. These are the distributed version control +systems (DVCS). There is more to them than just their headline +attributes, being able to check history and do checkins while +disconnected from a central server, but these are damm useful to start +with. + +When I first heard about DVCS, it wasn't immediately obvious to me (to +put it mildly) how they would work. After years of using a centralised +version control system, I had rough mental model of what went on. But +how do you cope without the central server forcing ordering onto the +changes? + +Since then I've started using Mercurial\footnote{ +\url{http://www.selenic.com/mercurial}}. +Mercurial is a DVCS. It's one of +three DVCSs that have gained significant popularity in the last few +years, the other two being Git\footnote{\url{http://git-scm.com}} and +Bazaar\footnote{\url{http://bazaar-vcs.org/}}. +I switched a significant work project over +to Mercurial (from Subversion) in mid-2007, because a customer site +required on-site work but could not allow access back to the company +VPN. I chose Mercurial for a variety of reasons which I won't bore you +with here\footnote{ +OK, if you must know: +\begin{itemize} +\item Implementability. I needed the system to work on Windows, Linux and +AIX. The latter was not one of the directly supported platforms for +any of the candidates. Git's implementation uses a horde of +tools. Bazaar requires only Python, but required Python 2.4 while IBM +stubbornly still supplies only Python 2.3. Mercurial requires Python +2.3 or greater, and uses some C for speed. +\item Simplicity. My users used Subversion daily, but did not generally +have much experience with other VCS. From the command line, +Mercurial's core operations will be familiar to a Subversion +user. This is also true of Bazaar, but was less true of Git. Git has +improved in this matter since then, but a Mr Winder of this parish +tells me that it's still possible to seriously embarrass +yourself. There was also a lack of Windows support for Git at the +time. +\item Speed. Mercurial is fast. In the same ballpark as Git. Bazaar +wasn't, and although it has improved significantly, has, in my +estimation, added user complexity in the process, and at the time +of writing is still off the pace for some operations. +\item Documentation. At the time, Bryan O'Sullivan's excellent Mercurial +book (\url{http://hgbook.red-bean.com}) was a clear winner for best +documentation. +\end{itemize}}. + +What I want to do in this article is give you an insight into how a +DVCS works. OK, so specifically I'm going to be talking about +Mercurial, but Git and Bazaar attack the problem in a similar way. But +first I'd better give you some idea of how you use Mercurial. + +\subsection{The 5 minute Mercurial overview} +\subsubsection{The basics} +I think it unlikely that someone possessing the taste and discernment +to be reading CVu would not be familiar with at least one version +control system. So, while I want to give you a flavour of what it's +like to use, I'm not going to hang about. If you'd like a proper +introduction, or you don't follow something, I thoroughly recommend +you consult the Mercurial book. + +To start using Mercurial to keep track of a project. + +\begin{verbatim} +$ hg init +$ +\end{verbatim} + +This creates the repository root in the current directory. + +Like CVS\footnote{\url{http://www.nongnu.org/cvs/}} +with its \texttt{CVS} directory and +Subversion\footnote{\url{http://subversion.tigris.org/}} +with its \texttt{.svn} +directory, Mercurial keeps its private data in a directory. Mercifully there is +only one of these, in the top level of your project. And rather than +holding details of where the actual repository is to be found, the \texttt{.hg} +directory holds the entire repository. + +Next you need to specify the files you want Mercurial to track. + +\begin{verbatim} +$ echo "There was a gibbon one morning" > pome.txt +$ hg add pome.txt +$ +\end{verbatim} + +As you might expect, this marks the files as to be added. And as you +might also expect, you need to commit to record the added files in the +repository. The commit comment can be supplied on the command line; if +you don't supply a comment, you'll be dropped into an editor to +provide one. + +There is a suggested format for these messages~--- a one line summary +followed by any more required detail on following lines. By default +Mercurial will only display the first line of commit messages when +listing changes. In these examples I'll stick to terse messages, and +I'll enter them from the command line. + +\begin{verbatim} +$ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>" +$ +\end{verbatim} + +Mercurial records the user making the change as part of the change +information. It is usual to give your name and email address as I've +done here. You can imagine, though, that constantly having to repeat +this is a bit tedious, so you can set a default user name in a +configuration file. Mercurial keeps global, user and repository +configurations, and it can go in any of those. + +As with Subversion, after further edits you see how your working copy +differs from the repository. + +\begin{verbatim} +$ hg status +M pome.txt +$ hg diff +diff -r 33596ef855c1 pome.txt +--- a/pome.txt Wed Apr 23 22:36:33 2008 +0100 ++++ b/pome.txt Wed Apr 23 22:48:01 2008 +0100 +@@ -1,1 +1,2 @@ There was a gibbon one morning + There was a gibbon one morning ++said "I think I will fly to the moon". +$ hg commit -m "A great second line" +$ +\end{verbatim} + +And look through a log of changes. + +\begin{verbatim} +$ hg log +changeset: 1:3d65e7a57890 +tag: tip +user: Jim Hague <jim.hague@acm.org> +date: Wed Apr 23 22:49:10 2008 +0100 +summary: A great second line + +changeset: 0:33596ef855c1 +user: Jim Hague <jim.hague@acm.org> +date: Wed Apr 23 22:36:33 2008 +0100 +summary: My Pome + +$ +\end{verbatim} + +There are some items here that need an explanation. + +The changeset identifier is in fact two identifiers separated by a +colon. The first is the sequence number of the changeset in the +repository, and is directly comparable to the change number in a +Subversion repository. The second is a globally unique identifier for +that change. As the change is copied from one repository to another +(this is a distributed system, remember, even if we haven't come to +that bit yet), its sequence number in any particular repository will +change, but the global identifier will always remain the same. + +\texttt{tip} is a Mercurial term. It means simply the most recent change. + +Want to rename a file? + +\begin{verbatim} +$ hg mv pome.txt poem.txt +$ hg status +A poem.txt +R pome.txt +$ hg commit -m "Rename my file" +$ +\end{verbatim} +(The command to rename a file is actually \texttt{hg rename}, +but Mercurial saves Unix-trained fingers from +typing embarrassment.) + +At this point you may be wondering about directories. \texttt{hg mkdir} +perhaps? Well, no. Mercurial only tracks files. To be sure, the +directory a file occupies is tracked, but effectively only as a +component of the file name. This has the slightly unexpected result +that you can't record an empty directory in your repository.\footnote{ +I tripped over this converting a work Subversion +repository. One possibility is to create a placeholder file in the +directory. In the event I created the directory (which receives build +products) as part of the build instead.} + +Given this, and the status output above that suggests strongly that +Mercurial treats a rename as a copy followed by a delete, you may be +worried that Mercurial won't cope at all well with rearranging your +repository. Relax. Mercurial does store the details of the rename as +part of the changeset, and copes very well with rearrangements\footnote{ +The Mercurial designers justify not dealing with +directories as first class objects by pointing out that provided you +can correctly move files about in the tree, the other reasons for +tracking directories are uncommon and do not in their opinion justify +the considerable added complexity. So far I've found no reason to +doubt that judgement.}. + +Want to rewind the working copy to a previous revision? + +\begin{verbatim} +$ hg update -r 1 +1 files updated, 0 files merged, 1 files removed, 0 files unresolved +$ +\end{verbatim} + +\texttt{hg update} updates the working files. In this case I'm specifying +that I want to go back to local changeset 1. I could also have typed +\texttt{-r 3d65e7a57890}, or even \texttt{-r 3d}; +when specifying the global change +identifier you only need to type enough digits to make it unique. + +This is all very well, but it's not exactly distributed, is it? + +\subsubsection{Going distributed} +A version control system goes Distributed by allowing multiple copies +of the repository to exist, and work to be done in all those +repositories in parallel. So when you start work on an existing +project, the first thing to do is to get your own copy of the +repository. + +\begin{verbatim} +elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem +updating working directory +1 files updated, 0 files merged, 0 files removed, 0 files unresolved +\end{verbatim} + +Mercurial lets you access other repositories via the file system, over http or +over ssh. + +\begin{verbatim} +elsewhere$ cd Jim-Poem +elsewhere$ hg log +changeset: 3:a065eb26e6b9 +tag: tip +user: Jim Hague <jim.hague@acm.org> +date: Thu Apr 24 18:52:31 2008 +0100 +summary: Rename my file + +changeset: 2:ff97668b7422 +user: Jim Hague <jim.hague@acm.org> +date: Thu Apr 24 18:50:22 2008 +0100 +summary: Finished first verse + +changeset: 1:3d65e7a57890 +user: Jim Hague <jim.hague@acm.org> +date: Wed Apr 23 22:49:10 2008 +0100 +summary: A great second line + +changeset: 0:33596ef855c1 +user: Jim Hague <jim.hague@acm.org> +date: Wed Apr 23 22:36:33 2008 +0100 +summary: My Pome + +$ +\end{verbatim} + +\texttt{hg clone} is aptly named. It creates a new repository that contains +exactly the same changes as the source repository. You can make a +clone just by copying your project directory, if you're confident +nothing else will access it during the copy. \texttt{hg clone} saves you this +worry, and sets the default push/pull location in the new repo to the +cloned repo. + +From that point, you use \texttt{hg pull} to collect changes from other +places into your repo (though note it does not by default update your +working copy), and, as you might guess, \texttt{hg push} shoves your changes +into a foreign repository. By default these will act on the repository +you cloned from, but you can specify any other repository. + +More on those in a moment. First, though, I want to show you something +you can't do in Subversion. Start with the repository with 4 changes +we just cloned. I want to focus on the first couple of lines, so I'll +wind the working copy back to the point where only those lines exist. + +\begin{verbatim} +$ hg update -r 1 +1 files updated, 0 files merged, 1 files removed, 0 files unresolved +$ +\end{verbatim} + +And make a change. + +\begin{verbatim} +$ hg diff +diff -r 3d65e7a57890 pome.txt +--- a/pome.txt Wed Apr 23 22:49:10 2008 +0100 ++++ b/pome.txt Thu Apr 24 19:13:14 2008 +0100 +@@ -1,2 +1,2 @@ There was a gibbon one morning +-There was a gibbon one morning +-said "I think I will fly to the moon". ++There was a baboon who one afternoon ++said "I think I will fly to the sun". +$ hg commit -m "Better first two lines" +$ +\end{verbatim} + +The alert among you will have sat up at that. Well done! Yes, there's +something very worrying. How can I commit a change at an old point? +If you try this in Subversion, it will complain mightily about your +file being out of date. But Mercurial just went ahead and did +something. The Bazaar experts among you will know that in Bazaar, if +you use \texttt{bzr revert -r} to bring the working copy to a past revision, +make a change and commit, then your latest version will be the past +revision plus your change. Perhaps that's what Mercurial did? + +No. What Mercurial did is central to Mercurial's view of the +world. You took your working copy back to an old changeset, and then +committed a fresh change based at that changeset. Mercurial actually +did just what you asked it to do, no more and no less. Let's see the +initial evidence. + +\begin{verbatim} +$ hg heads +changeset: 4:267d32f158b3 +tag: tip +parent: 1:3d65e7a57890 +user: Jim Hague <jim.hague@acm.org> +date: Thu Apr 24 19:13:59 2008 +0100 +summary: Better first two lines + +changeset: 3:a065eb26e6b9 +user: Jim Hague <jim.hague@acm.org> +date: Thu Apr 24 18:52:31 2008 +0100 +summary: Rename my file + +$ +\end{verbatim} + +Time for some more Mercurial terminology. You can think of a \texttt{head} in +Mercurial as the most recent change on a branch. In Mercurial, a +branch is simply what happens when you commit a change that has as its +parent a change that already has a child. Mercurial has a standard +extension \texttt{hg glog} which uses some ASCII art to show the current +state: + +\begin{verbatim} +$ hg glog +@ changeset: 4:267d32f158b3 +| tag: tip +| parent: 1:3d65e7a57890 +| user: Jim Hague <jim.hague@acm.org> +| date: Thu Apr 24 19:13:59 2008 +0100 +| summary: Better first two lines +| +| o changeset: 3:a065eb26e6b9 +| | user: Jim Hague <jim.hague@acm.org> +| | date: Thu Apr 24 18:52:31 2008 +0100 +| | summary: Rename my file +| | +| o changeset: 2:ff97668b7422 +|/ user: Jim Hague <jim.hague@acm.org> +| date: Thu Apr 24 18:50:22 2008 +0100 +| summary: Finished first verse +| +o changeset: 1:3d65e7a57890 +| user: Jim Hague <jim.hague@acm.org> +| date: Wed Apr 23 22:49:10 2008 +0100 +| summary: A great second line +| +o changeset: 0:33596ef855c1 + user: Jim Hague <jim.hague@acm.org> + date: Wed Apr 23 22:36:33 2008 +0100 + summary: My Pome + +$ +\end{verbatim} + +\texttt{hg view} shows a nicer graphical view\footnote{Though, being +Tcl/Tk based, not that much nicer.}. + +So the change is in there. It's the latest change, and is simply on a +different branch to the other changes. + +Almost invariably, you will want to bring everything back together and +merge the branches. A merge is a change that combines two heads back +into one. It prepares an updated working directory with the merged +contents of the two heads for you to review and, if satisfactory, +commit. + +\begin{verbatim} +$ hg merge +merging pome.txt and poem.txt +0 files updated, 1 files merged, 0 files removed, 0 files unresolved +(branch merge, don't forget to commit) +$ cat poem.txt +There was a baboon who one afternoon +said "I think I will fly to the sun". +So with two great palms strapped to his arms, +he started his takeoff run. +$ hg commit -m "Merge first line branch" +$ +\end{verbatim} + +(I'm no poet. The poem is, of +course, \textit{Silly Old Baboon} by the late, great, Spike +Milligan. From \textit{A Book of Milliganimals}, Puffin, 1971.) + +Here's the ASCII art again showing what just happened. +Oh, and notice in the above that Mercurial has done the +right thing with regard to the rename. + +\begin{verbatim} +$ hg glog +@ changeset: 5:792ab970fc80 +|\ tag: tip +| | parent: 4:267d32f158b3 +| | parent: 3:a065eb26e6b9 +| | user: Jim Hague <jim.hague@acm.org> +| | date: Thu Apr 24 19:29:53 2008 +0100 +| | summary: Merge first line branch +| | +| o changeset: 4:267d32f158b3 +| | parent: 1:3d65e7a57890 +| | user: Jim Hague <jim.hague@acm.org> +| | date: Thu Apr 24 19:13:59 2008 +0100 +| | summary: Better first two lines +| | +o | changeset: 3:a065eb26e6b9 +| | user: Jim Hague <jim.hague@acm.org> +| | date: Thu Apr 24 18:52:31 2008 +0100 +| | summary: Rename my file +| | +o | changeset: 2:ff97668b7422 +|/ user: Jim Hague <jim.hague@acm.org> +| date: Thu Apr 24 18:50:22 2008 +0100 +| summary: Finished first verse +| +o changeset: 1:3d65e7a57890 +| user: Jim Hague <jim.hague@acm.org> +| date: Wed Apr 23 22:49:10 2008 +0100 +| summary: A great second line +| +o changeset: 0:33596ef855c1 + user: Jim Hague <jim.hague@acm.org> + date: Wed Apr 23 22:36:33 2008 +0100 + summary: My Pome + +$ +\end{verbatim} + +So, our little branch change has now been merged back, and we have a +single line of development again. Notice that unlike the other +changesets, changeset 5 has two parent changesets, indicating it is a +merge changeset. You can only merge two branches in one operation; or +putting it another way, a changeset can have a maximum of two parents. + +This behaviour is absolutely central to Mercurial's philosophy. If a +change is committed that takes as its starting point a change that +already has a child, then a branch gets created. Working with +Mercurial, branches get created frequently, and equally frequently +merged back. As befits any frequent operation, both are easy to do. + +You're probably thinking at this point that this making a commit onto +an old version is a slightly strange thing to do, and you'd be right. +But that's exactly what's going to happen the moment you go +distributed. Two people working independently with their own +repositories are going to make commits based, typically, on the latest +changes they happen to have incorporated into their tree. To be +Distributed, a DVCS has to deal with this. Mercurial faces it head-on. +When you pull changes into your repo (or someone else pushes them), if +any of the changes overlap~--- are both based on the same base change~--- +you get extra heads, and it's up to you to let these extra heads live +or merge, as you please. + +In practice this is more manageable then you might think. Consider a +typical Mercurial usage, where the 'master' repo sits on a known +server, and everyone pulls changes from the master and pushes their +own efforts to the master. But default Mercurial won't let you push if +the receiving repo will gain an extra head as a result, so you +typically pull (and do any required merging) just before +pushing. Subversion users will recognised this pattern. Subversion +won't let you commit a change if your working copy is not at the very +latest revision, so the Subversion user will update, and merge if +necessary, just before committing. + +What, then, about a branch in the conventional sense of '1.0 +maintenance branch'? Typically in Mercurial you'd handle this by +keeping a separate cloned repository for those changes. Cloning is +fast, and if local uses hard links where possible on filesystems that +support them, so isn't necessarily extravagant on disc space. You can, +if you prefer, handle them all in a single repo with 'named +branches', but cloning is definitely simpler. + +OK, so now you know the basics of using Mercurial. We can proceed to +looking at how this magic is achieved. In particular, where does this +magic globally unique identifier for a change come from? + +\subsection{Inside the Mercurial repo} +The way Mercurial handles its repo is really quite simple. + +That's simple, as in 'most things are simple once you know the +answer'. I found the explanation helpful\footnote{For the curious, +Bryan O'Sullivan's excellent Mercurial book +has a chapter on the subject, and the Mercurial website has a fair amount +of detail too.}, so this section attempts +the 10,000ft (FL100 if you prefer) view of Mercurial. + +First remember that any file or component can only have one or two +parents. You can't merge more than one other branch at once. + +We start with the basic building block, which Mercurial calls a +revlog. A revlog is a thing that holds a file and all the changes in +the file history\footnote{For any non-trivial file, this will +actually be two files on the disc, a data file and an index.}. The +revlog stores the differences between successive versions +of the file, though it will periodically store a complete version of +the file instead of a difference, so that the content of any +particular file version can always be reconstructed without excessive +effort. + +Under the secret-squirrel Mercurial \texttt{.hg} directory at the top of your +project is a store which holds a revlog for each file in your +project. So you have the complete history of the project locally. No +more round trips to the server. + +Both the differences between successive versions and the periodic +complete versions of a file are compressed before storing. This is +surprisingly effective at minimising the storage requirements this +entire history of your project. I have a small Java project handy, +comprising a little over 300 source modules. There are 5 branches plus +the mainline, and some 1920 commits in all. A Subversion checkout of +the current mainline takes 51Mb. Converting the project to Mercurial +yields a Mercurial repository that takes 60Mb, so a little +bigger. Remember, though, that the Mercurial repository includes not +just the working copy, but also the entire history of the project. + +Any point in the evolution of a revlog can be uniquely identified with +a nodeid. This is simply the SHA1 hash of the current file contents +concatenated with the nodeids of one or both parents of the current +revision. Note that this way, two file states are identical if and +only if the file contents are the same *and* the file has the +same history. + +Here's a dump of a revlog index: + +\begin{verbatim} +$ hg debugindex .hg/store/data/pome.txt.i + rev offset length base linkrev nodeid p1 p2 + 0 0 32 0 0 6bbbd5d6cc53 000000000000 000000000000 + 1 32 51 0 1 83d266583303 6bbbd5d6cc53 000000000000 + 2 83 84 0 2 14a54ec34bb6 83d266583303 000000000000 + 3 167 76 3 4 dc4df776b38b 83d266583303 000000000000 +$ +\end{verbatim} + +Note here that a file state can have two parents. If both the parent +nodeids are non-null, the file state has two parents, and the state is +therefore the result of a merge. + +Let's dump out a revlog at a particular revision: + +\begin{verbatim} +$ hg debugdata .hg/store/data/pome.txt.i 2 +There was a gibbon one morning +said "I think I will fly to the moon". +So with two great palms strapped to his arms, +he started his takeoff run. +$ +\end{verbatim} + +The next component is the manifest. This is simply a list of all the +files in the project, together with their current nodeids. The +manifest is a file, held in a revlog. The nodeid of the manifest, +therefore, identifies the project filesystem at a particular point. + +\begin{verbatim} +$ hg debugdata .hg/store/00manifest.i 5 +poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8 +$ +\end{verbatim} + +Finally we have the changeset. This is the atomic collection of +changes to a repository that leads to a new revision. The changeset +info includes the nodeid of the corresponding manifest, the timestamp +and committer ID, a list of changed files and a comment. The changeset +also includes the nodeid of the parent changeset, or the two parents +if the change is a merge. The changeset description is held in a +revlog, the changelog. + +\begin{verbatim} +$ hg debugdata .hg/store/00changelog.i 5 +1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e +Jim Hague <jim.hague@acm.org> +1209061793 -3600 +poem.txt +pome.txt + +Merge first line branch +$ +\end{verbatim} + +The nodeid of the changeset, therefore, gives us a globally unique +identifier for any particular change. Changesets have a +Subversion-like incrementing change number, but it is peculiar to that +repository. The nodeid, however, is global. + +One more detail remains to complete the picture. How do we get back +from a particular file change to find the responsible changeset? Each +revlog change has a linkrev entry that does just this. + +So, now we have a repository with a history of the changes applied to +that repository. Each change has a unique identifier. If we find that +change in another repository, it means that at the point in the other +repository we have exactly the same state; the file contents and +history are identical. + +At this point we can see how pulling changes from another repository +works. Mercurial has to determine which changesets in the source +repository are missing in the target repository. To do this, for each +head in the source repo it has to find the most recent change in that +head that it already present in the target repo, and get any remaining +changes after that point. These changes are then copied over and +applied. + +The Mercurial revlog format has proved remarkably durable. Since the +first release of Mercurial in April 2005, these have been a total of 5 +changes to the file format. However, of those, all but one have been +changes to the handling of file names. The most recent change, in +October 2008, and its predecessor in December 2006, were both +introduced purely to cope with Windows specific issues. The one change +that touched the data structures described above was in April 2006. The +format introduced, RevLogNG, changed only the details of index data +held, not the overall design. The chief Mercurial developer, Matt +Mackall, notes that the code in present-day Mercurial devoted to +reading the old format comprises 28 lines of Python. Compared with, +say, the early tribulations of Subversion and the switch from \texttt{bdfs} to +\texttt{fsfs}, this is an impressive record. + +\section{Reflections on going distributed} +It's nearly traditional at this stage in an introduction to DVCS to +demonstrate several different workflow scenarios that you can build +with a DVCS. Which makes the important point that a DVCS can be +adapted to your workflow in a way that is at best unwieldy with a +CVCS. I intend, though, to break with tradition here. + +By this stage, I hope you can see that distributing version control +works by introducing branches where development takes place in +parallel. Mercurial treats these branches as arising naturally from +the commits made and transferred between repositories. Both Git and +Bazaar take a slightly different viewpoint, and explicitly generate a +fresh branch for work in a particular repositories. But in both cases +the underlying principle of identifying changes by a globally unique +identifier and resolving parallel development by merges between +overlapping changes is the same. And all three can be used in a truly +distributed manner, with full history and the ability to commit being +available locally. + +So instead of chatter on about workflows, I want instead to reflect on +the consequences all this has for that all-important question of +whether a DVCS is a suitable vehicle for your data. + +The first is a minor and rather obvious point. If you want to store +files that are very large and which change often in your DVCS, then +all the compression in the world is unlikely to stop the storage +requirements for the full project history from becoming uncomfortably +large, particularly if the files are not very compressible to start +with. + +The second, and main, point is that there is an important question you +need to ask about your data. We've seen that a DVCS relies on +branching and merging to weave its magic. So take a close look at your +data, and ask: + +\standout{Will It Merge?} + +The subset of plain old text which comprises program source +code requires some human oversight, but will merge automatically +well enough for the process to be well within the bounds of the +possible. + +Unfortunately when we move further afield mergeability becomes a rarer +commodity. I nearly began the previous paragraph by stating that +plain old text will merge well enough. Then Doubt set in~--- what about +XML? Or BASE64 encoded content? + +Of course, merge doesn't necessarily have to be textual merge. I am +told that Word can be used to diff and merge two Word \texttt{.doc} files, a +data format notorious for its binary impenetrability. As long as some +suitable merge agent is available, and the DVCS can be configured to +use it for data of a particular type\footnote{Mercurial can have the +merge and diff tools specified with reference to the file extension on +which they operate~--- I assume Bazaar and Git are similar.}, then there +is no bar to successful DVCS use. + +Before this reliance on mergeability causes you to dismiss DVCS out of +hand, reflect. A CVCS can only handle non-mergeable data by acting as +a versioned file store; in other words, having as the only available +merge option the use of one or other of the merge candidates in its +entirety. Useful though a versioned file store can be, it cannot be +considered a full-featured version control system. By treating the +offending unmergeable files as external to the DVCS, or with careful +workflow~--- disabling the distributed and mergeable potentials~--- a DVCS +can deal with these files, but only at a cost of its distributedness +or its version control system-ness. In this it differs little from a +CVCS. + +So, for all data you want to version control, let your battle cry be: + +\standout{Will It Merge?} + +At this point, I have an urge to don lab coat and safety goggles and +be videoed attempting to mechanically merge data in a variety of +different formats. Frankly, this is unlikely to be as exciting at +blending iPhones\footnote{\url{http://www.willitblend.com}}, +but from a system development point of view it's rather more +important. And, I think gives us a large clue as to one of the +reasons for the continuing +popularity of Plain Old Text as a source code representation mechanism. + +\end{document}