comparison Hg.tex @ 9:2155510c62f3

A version formatted with Latex. And spellchecked.
author Jim Hague <jim.hague@acm.org>
date Fri, 22 May 2009 10:23:40 +0100
parents
children 2e4d690ffabb
comparison
equal deleted inserted replaced
8:abca12aaa38d 9:2155510c62f3
1 \documentclass[a4paper]{article}
2 \usepackage{pslatex}
3 \usepackage{url}
4
5 \newcommand{\standout}[1]{
6 {\begin{center} \large \textbf{#1} \end{center}}
7 }
8
9 \setlength{\parskip}{2mm}
10 \setlength{\parindent}{0mm}
11
12 \begin{document}
13 \title{Inside a distributed version control system}
14 \author{Jim Hague\\
15 \texttt{jim.hague@acm.org}}
16 \date{May 2009}
17 \maketitle
18
19 \section{Preamble}
20 Grinton Lodge is a Youth Hostel that sits on an exposed hillside just
21 above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales
22 National Park. A former Victorian shooting lodge, it now welcomes
23 walkers and other travellers from around the world.
24
25 Tonight, a Wednesday in mid-November, is not one of its busiest
26 nights. Kat, the duty staff member, tells me that there is a small
27 corporate team-building group in the annex. There's no sign of them at
28 present. Otherwise, that portion of the world that has beaten a path
29 to the door of this grand building today consists of just me. And Kat
30 goes home soon.
31
32 The November CVu, removed from its wrappers and read yesterday, lies
33 in my bag. Taunting me. Go on, it says, if you've ever going to put
34 finger to keyboard in the name of CVu, well, tonight you are out of
35 excuses.
36
37 Bugger.
38
39 \section{Let's look into Mercurial}
40 If you're at all interested in version control systems~--- and any
41 software developer not using one daily is a strange beast indeed~---
42 you'll at least have become vaguely aware in the last few years of the
43 growing maturity of the latest group of version control systems
44 offering funky new stuff. These are the distributed version control
45 systems (DVCS). There is more to them than just their headline
46 attributes, being able to check history and do checkins while
47 disconnected from a central server, but these are damm useful to start
48 with.
49
50 When I first heard about DVCS, it wasn't immediately obvious to me (to
51 put it mildly) how they would work. After years of using a centralised
52 version control system, I had rough mental model of what went on. But
53 how do you cope without the central server forcing ordering onto the
54 changes?
55
56 Since then I've started using Mercurial\footnote{
57 \url{http://www.selenic.com/mercurial}}.
58 Mercurial is a DVCS. It's one of
59 three DVCSs that have gained significant popularity in the last few
60 years, the other two being Git\footnote{\url{http://git-scm.com}} and
61 Bazaar\footnote{\url{http://bazaar-vcs.org/}}.
62 I switched a significant work project over
63 to Mercurial (from Subversion) in mid-2007, because a customer site
64 required on-site work but could not allow access back to the company
65 VPN. I chose Mercurial for a variety of reasons which I won't bore you
66 with here\footnote{
67 OK, if you must know:
68 \begin{itemize}
69 \item Implementability. I needed the system to work on Windows, Linux and
70 AIX. The latter was not one of the directly supported platforms for
71 any of the candidates. Git's implementation uses a horde of
72 tools. Bazaar requires only Python, but required Python 2.4 while IBM
73 stubbornly still supplies only Python 2.3. Mercurial requires Python
74 2.3 or greater, and uses some C for speed.
75 \item Simplicity. My users used Subversion daily, but did not generally
76 have much experience with other VCS. From the command line,
77 Mercurial's core operations will be familiar to a Subversion
78 user. This is also true of Bazaar, but was less true of Git. Git has
79 improved in this matter since then, but a Mr Winder of this parish
80 tells me that it's still possible to seriously embarrass
81 yourself. There was also a lack of Windows support for Git at the
82 time.
83 \item Speed. Mercurial is fast. In the same ballpark as Git. Bazaar
84 wasn't, and although it has improved significantly, has, in my
85 estimation, added user complexity in the process, and at the time
86 of writing is still off the pace for some operations.
87 \item Documentation. At the time, Bryan O'Sullivan's excellent Mercurial
88 book (\url{http://hgbook.red-bean.com}) was a clear winner for best
89 documentation.
90 \end{itemize}}.
91
92 What I want to do in this article is give you an insight into how a
93 DVCS works. OK, so specifically I'm going to be talking about
94 Mercurial, but Git and Bazaar attack the problem in a similar way. But
95 first I'd better give you some idea of how you use Mercurial.
96
97 \subsection{The 5 minute Mercurial overview}
98 \subsubsection{The basics}
99 I think it unlikely that someone possessing the taste and discernment
100 to be reading CVu would not be familiar with at least one version
101 control system. So, while I want to give you a flavour of what it's
102 like to use, I'm not going to hang about. If you'd like a proper
103 introduction, or you don't follow something, I thoroughly recommend
104 you consult the Mercurial book.
105
106 To start using Mercurial to keep track of a project.
107
108 \begin{verbatim}
109 $ hg init
110 $
111 \end{verbatim}
112
113 This creates the repository root in the current directory.
114
115 Like CVS\footnote{\url{http://www.nongnu.org/cvs/}}
116 with its \texttt{CVS} directory and
117 Subversion\footnote{\url{http://subversion.tigris.org/}}
118 with its \texttt{.svn}
119 directory, Mercurial keeps its private data in a directory. Mercifully there is
120 only one of these, in the top level of your project. And rather than
121 holding details of where the actual repository is to be found, the \texttt{.hg}
122 directory holds the entire repository.
123
124 Next you need to specify the files you want Mercurial to track.
125
126 \begin{verbatim}
127 $ echo "There was a gibbon one morning" > pome.txt
128 $ hg add pome.txt
129 $
130 \end{verbatim}
131
132 As you might expect, this marks the files as to be added. And as you
133 might also expect, you need to commit to record the added files in the
134 repository. The commit comment can be supplied on the command line; if
135 you don't supply a comment, you'll be dropped into an editor to
136 provide one.
137
138 There is a suggested format for these messages~--- a one line summary
139 followed by any more required detail on following lines. By default
140 Mercurial will only display the first line of commit messages when
141 listing changes. In these examples I'll stick to terse messages, and
142 I'll enter them from the command line.
143
144 \begin{verbatim}
145 $ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>"
146 $
147 \end{verbatim}
148
149 Mercurial records the user making the change as part of the change
150 information. It is usual to give your name and email address as I've
151 done here. You can imagine, though, that constantly having to repeat
152 this is a bit tedious, so you can set a default user name in a
153 configuration file. Mercurial keeps global, user and repository
154 configurations, and it can go in any of those.
155
156 As with Subversion, after further edits you see how your working copy
157 differs from the repository.
158
159 \begin{verbatim}
160 $ hg status
161 M pome.txt
162 $ hg diff
163 diff -r 33596ef855c1 pome.txt
164 --- a/pome.txt Wed Apr 23 22:36:33 2008 +0100
165 +++ b/pome.txt Wed Apr 23 22:48:01 2008 +0100
166 @@ -1,1 +1,2 @@ There was a gibbon one morning
167 There was a gibbon one morning
168 +said "I think I will fly to the moon".
169 $ hg commit -m "A great second line"
170 $
171 \end{verbatim}
172
173 And look through a log of changes.
174
175 \begin{verbatim}
176 $ hg log
177 changeset: 1:3d65e7a57890
178 tag: tip
179 user: Jim Hague <jim.hague@acm.org>
180 date: Wed Apr 23 22:49:10 2008 +0100
181 summary: A great second line
182
183 changeset: 0:33596ef855c1
184 user: Jim Hague <jim.hague@acm.org>
185 date: Wed Apr 23 22:36:33 2008 +0100
186 summary: My Pome
187
188 $
189 \end{verbatim}
190
191 There are some items here that need an explanation.
192
193 The changeset identifier is in fact two identifiers separated by a
194 colon. The first is the sequence number of the changeset in the
195 repository, and is directly comparable to the change number in a
196 Subversion repository. The second is a globally unique identifier for
197 that change. As the change is copied from one repository to another
198 (this is a distributed system, remember, even if we haven't come to
199 that bit yet), its sequence number in any particular repository will
200 change, but the global identifier will always remain the same.
201
202 \texttt{tip} is a Mercurial term. It means simply the most recent change.
203
204 Want to rename a file?
205
206 \begin{verbatim}
207 $ hg mv pome.txt poem.txt
208 $ hg status
209 A poem.txt
210 R pome.txt
211 $ hg commit -m "Rename my file"
212 $
213 \end{verbatim}
214 (The command to rename a file is actually \texttt{hg rename},
215 but Mercurial saves Unix-trained fingers from
216 typing embarrassment.)
217
218 At this point you may be wondering about directories. \texttt{hg mkdir}
219 perhaps? Well, no. Mercurial only tracks files. To be sure, the
220 directory a file occupies is tracked, but effectively only as a
221 component of the file name. This has the slightly unexpected result
222 that you can't record an empty directory in your repository.\footnote{
223 I tripped over this converting a work Subversion
224 repository. One possibility is to create a placeholder file in the
225 directory. In the event I created the directory (which receives build
226 products) as part of the build instead.}
227
228 Given this, and the status output above that suggests strongly that
229 Mercurial treats a rename as a copy followed by a delete, you may be
230 worried that Mercurial won't cope at all well with rearranging your
231 repository. Relax. Mercurial does store the details of the rename as
232 part of the changeset, and copes very well with rearrangements\footnote{
233 The Mercurial designers justify not dealing with
234 directories as first class objects by pointing out that provided you
235 can correctly move files about in the tree, the other reasons for
236 tracking directories are uncommon and do not in their opinion justify
237 the considerable added complexity. So far I've found no reason to
238 doubt that judgement.}.
239
240 Want to rewind the working copy to a previous revision?
241
242 \begin{verbatim}
243 $ hg update -r 1
244 1 files updated, 0 files merged, 1 files removed, 0 files unresolved
245 $
246 \end{verbatim}
247
248 \texttt{hg update} updates the working files. In this case I'm specifying
249 that I want to go back to local changeset 1. I could also have typed
250 \texttt{-r 3d65e7a57890}, or even \texttt{-r 3d};
251 when specifying the global change
252 identifier you only need to type enough digits to make it unique.
253
254 This is all very well, but it's not exactly distributed, is it?
255
256 \subsubsection{Going distributed}
257 A version control system goes Distributed by allowing multiple copies
258 of the repository to exist, and work to be done in all those
259 repositories in parallel. So when you start work on an existing
260 project, the first thing to do is to get your own copy of the
261 repository.
262
263 \begin{verbatim}
264 elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem
265 updating working directory
266 1 files updated, 0 files merged, 0 files removed, 0 files unresolved
267 \end{verbatim}
268
269 Mercurial lets you access other repositories via the file system, over http or
270 over ssh.
271
272 \begin{verbatim}
273 elsewhere$ cd Jim-Poem
274 elsewhere$ hg log
275 changeset: 3:a065eb26e6b9
276 tag: tip
277 user: Jim Hague <jim.hague@acm.org>
278 date: Thu Apr 24 18:52:31 2008 +0100
279 summary: Rename my file
280
281 changeset: 2:ff97668b7422
282 user: Jim Hague <jim.hague@acm.org>
283 date: Thu Apr 24 18:50:22 2008 +0100
284 summary: Finished first verse
285
286 changeset: 1:3d65e7a57890
287 user: Jim Hague <jim.hague@acm.org>
288 date: Wed Apr 23 22:49:10 2008 +0100
289 summary: A great second line
290
291 changeset: 0:33596ef855c1
292 user: Jim Hague <jim.hague@acm.org>
293 date: Wed Apr 23 22:36:33 2008 +0100
294 summary: My Pome
295
296 $
297 \end{verbatim}
298
299 \texttt{hg clone} is aptly named. It creates a new repository that contains
300 exactly the same changes as the source repository. You can make a
301 clone just by copying your project directory, if you're confident
302 nothing else will access it during the copy. \texttt{hg clone} saves you this
303 worry, and sets the default push/pull location in the new repo to the
304 cloned repo.
305
306 From that point, you use \texttt{hg pull} to collect changes from other
307 places into your repo (though note it does not by default update your
308 working copy), and, as you might guess, \texttt{hg push} shoves your changes
309 into a foreign repository. By default these will act on the repository
310 you cloned from, but you can specify any other repository.
311
312 More on those in a moment. First, though, I want to show you something
313 you can't do in Subversion. Start with the repository with 4 changes
314 we just cloned. I want to focus on the first couple of lines, so I'll
315 wind the working copy back to the point where only those lines exist.
316
317 \begin{verbatim}
318 $ hg update -r 1
319 1 files updated, 0 files merged, 1 files removed, 0 files unresolved
320 $
321 \end{verbatim}
322
323 And make a change.
324
325 \begin{verbatim}
326 $ hg diff
327 diff -r 3d65e7a57890 pome.txt
328 --- a/pome.txt Wed Apr 23 22:49:10 2008 +0100
329 +++ b/pome.txt Thu Apr 24 19:13:14 2008 +0100
330 @@ -1,2 +1,2 @@ There was a gibbon one morning
331 -There was a gibbon one morning
332 -said "I think I will fly to the moon".
333 +There was a baboon who one afternoon
334 +said "I think I will fly to the sun".
335 $ hg commit -m "Better first two lines"
336 $
337 \end{verbatim}
338
339 The alert among you will have sat up at that. Well done! Yes, there's
340 something very worrying. How can I commit a change at an old point?
341 If you try this in Subversion, it will complain mightily about your
342 file being out of date. But Mercurial just went ahead and did
343 something. The Bazaar experts among you will know that in Bazaar, if
344 you use \texttt{bzr revert -r} to bring the working copy to a past revision,
345 make a change and commit, then your latest version will be the past
346 revision plus your change. Perhaps that's what Mercurial did?
347
348 No. What Mercurial did is central to Mercurial's view of the
349 world. You took your working copy back to an old changeset, and then
350 committed a fresh change based at that changeset. Mercurial actually
351 did just what you asked it to do, no more and no less. Let's see the
352 initial evidence.
353
354 \begin{verbatim}
355 $ hg heads
356 changeset: 4:267d32f158b3
357 tag: tip
358 parent: 1:3d65e7a57890
359 user: Jim Hague <jim.hague@acm.org>
360 date: Thu Apr 24 19:13:59 2008 +0100
361 summary: Better first two lines
362
363 changeset: 3:a065eb26e6b9
364 user: Jim Hague <jim.hague@acm.org>
365 date: Thu Apr 24 18:52:31 2008 +0100
366 summary: Rename my file
367
368 $
369 \end{verbatim}
370
371 Time for some more Mercurial terminology. You can think of a \texttt{head} in
372 Mercurial as the most recent change on a branch. In Mercurial, a
373 branch is simply what happens when you commit a change that has as its
374 parent a change that already has a child. Mercurial has a standard
375 extension \texttt{hg glog} which uses some ASCII art to show the current
376 state:
377
378 \begin{verbatim}
379 $ hg glog
380 @ changeset: 4:267d32f158b3
381 | tag: tip
382 | parent: 1:3d65e7a57890
383 | user: Jim Hague <jim.hague@acm.org>
384 | date: Thu Apr 24 19:13:59 2008 +0100
385 | summary: Better first two lines
386 |
387 | o changeset: 3:a065eb26e6b9
388 | | user: Jim Hague <jim.hague@acm.org>
389 | | date: Thu Apr 24 18:52:31 2008 +0100
390 | | summary: Rename my file
391 | |
392 | o changeset: 2:ff97668b7422
393 |/ user: Jim Hague <jim.hague@acm.org>
394 | date: Thu Apr 24 18:50:22 2008 +0100
395 | summary: Finished first verse
396 |
397 o changeset: 1:3d65e7a57890
398 | user: Jim Hague <jim.hague@acm.org>
399 | date: Wed Apr 23 22:49:10 2008 +0100
400 | summary: A great second line
401 |
402 o changeset: 0:33596ef855c1
403 user: Jim Hague <jim.hague@acm.org>
404 date: Wed Apr 23 22:36:33 2008 +0100
405 summary: My Pome
406
407 $
408 \end{verbatim}
409
410 \texttt{hg view} shows a nicer graphical view\footnote{Though, being
411 Tcl/Tk based, not that much nicer.}.
412
413 So the change is in there. It's the latest change, and is simply on a
414 different branch to the other changes.
415
416 Almost invariably, you will want to bring everything back together and
417 merge the branches. A merge is a change that combines two heads back
418 into one. It prepares an updated working directory with the merged
419 contents of the two heads for you to review and, if satisfactory,
420 commit.
421
422 \begin{verbatim}
423 $ hg merge
424 merging pome.txt and poem.txt
425 0 files updated, 1 files merged, 0 files removed, 0 files unresolved
426 (branch merge, don't forget to commit)
427 $ cat poem.txt
428 There was a baboon who one afternoon
429 said "I think I will fly to the sun".
430 So with two great palms strapped to his arms,
431 he started his takeoff run.
432 $ hg commit -m "Merge first line branch"
433 $
434 \end{verbatim}
435
436 (I'm no poet. The poem is, of
437 course, \textit{Silly Old Baboon} by the late, great, Spike
438 Milligan. From \textit{A Book of Milliganimals}, Puffin, 1971.)
439
440 Here's the ASCII art again showing what just happened.
441 Oh, and notice in the above that Mercurial has done the
442 right thing with regard to the rename.
443
444 \begin{verbatim}
445 $ hg glog
446 @ changeset: 5:792ab970fc80
447 |\ tag: tip
448 | | parent: 4:267d32f158b3
449 | | parent: 3:a065eb26e6b9
450 | | user: Jim Hague <jim.hague@acm.org>
451 | | date: Thu Apr 24 19:29:53 2008 +0100
452 | | summary: Merge first line branch
453 | |
454 | o changeset: 4:267d32f158b3
455 | | parent: 1:3d65e7a57890
456 | | user: Jim Hague <jim.hague@acm.org>
457 | | date: Thu Apr 24 19:13:59 2008 +0100
458 | | summary: Better first two lines
459 | |
460 o | changeset: 3:a065eb26e6b9
461 | | user: Jim Hague <jim.hague@acm.org>
462 | | date: Thu Apr 24 18:52:31 2008 +0100
463 | | summary: Rename my file
464 | |
465 o | changeset: 2:ff97668b7422
466 |/ user: Jim Hague <jim.hague@acm.org>
467 | date: Thu Apr 24 18:50:22 2008 +0100
468 | summary: Finished first verse
469 |
470 o changeset: 1:3d65e7a57890
471 | user: Jim Hague <jim.hague@acm.org>
472 | date: Wed Apr 23 22:49:10 2008 +0100
473 | summary: A great second line
474 |
475 o changeset: 0:33596ef855c1
476 user: Jim Hague <jim.hague@acm.org>
477 date: Wed Apr 23 22:36:33 2008 +0100
478 summary: My Pome
479
480 $
481 \end{verbatim}
482
483 So, our little branch change has now been merged back, and we have a
484 single line of development again. Notice that unlike the other
485 changesets, changeset 5 has two parent changesets, indicating it is a
486 merge changeset. You can only merge two branches in one operation; or
487 putting it another way, a changeset can have a maximum of two parents.
488
489 This behaviour is absolutely central to Mercurial's philosophy. If a
490 change is committed that takes as its starting point a change that
491 already has a child, then a branch gets created. Working with
492 Mercurial, branches get created frequently, and equally frequently
493 merged back. As befits any frequent operation, both are easy to do.
494
495 You're probably thinking at this point that this making a commit onto
496 an old version is a slightly strange thing to do, and you'd be right.
497 But that's exactly what's going to happen the moment you go
498 distributed. Two people working independently with their own
499 repositories are going to make commits based, typically, on the latest
500 changes they happen to have incorporated into their tree. To be
501 Distributed, a DVCS has to deal with this. Mercurial faces it head-on.
502 When you pull changes into your repo (or someone else pushes them), if
503 any of the changes overlap~--- are both based on the same base change~---
504 you get extra heads, and it's up to you to let these extra heads live
505 or merge, as you please.
506
507 In practice this is more manageable then you might think. Consider a
508 typical Mercurial usage, where the 'master' repo sits on a known
509 server, and everyone pulls changes from the master and pushes their
510 own efforts to the master. But default Mercurial won't let you push if
511 the receiving repo will gain an extra head as a result, so you
512 typically pull (and do any required merging) just before
513 pushing. Subversion users will recognised this pattern. Subversion
514 won't let you commit a change if your working copy is not at the very
515 latest revision, so the Subversion user will update, and merge if
516 necessary, just before committing.
517
518 What, then, about a branch in the conventional sense of '1.0
519 maintenance branch'? Typically in Mercurial you'd handle this by
520 keeping a separate cloned repository for those changes. Cloning is
521 fast, and if local uses hard links where possible on filesystems that
522 support them, so isn't necessarily extravagant on disc space. You can,
523 if you prefer, handle them all in a single repo with 'named
524 branches', but cloning is definitely simpler.
525
526 OK, so now you know the basics of using Mercurial. We can proceed to
527 looking at how this magic is achieved. In particular, where does this
528 magic globally unique identifier for a change come from?
529
530 \subsection{Inside the Mercurial repo}
531 The way Mercurial handles its repo is really quite simple.
532
533 That's simple, as in 'most things are simple once you know the
534 answer'. I found the explanation helpful\footnote{For the curious,
535 Bryan O'Sullivan's excellent Mercurial book
536 has a chapter on the subject, and the Mercurial website has a fair amount
537 of detail too.}, so this section attempts
538 the 10,000ft (FL100 if you prefer) view of Mercurial.
539
540 First remember that any file or component can only have one or two
541 parents. You can't merge more than one other branch at once.
542
543 We start with the basic building block, which Mercurial calls a
544 revlog. A revlog is a thing that holds a file and all the changes in
545 the file history\footnote{For any non-trivial file, this will
546 actually be two files on the disc, a data file and an index.}. The
547 revlog stores the differences between successive versions
548 of the file, though it will periodically store a complete version of
549 the file instead of a difference, so that the content of any
550 particular file version can always be reconstructed without excessive
551 effort.
552
553 Under the secret-squirrel Mercurial \texttt{.hg} directory at the top of your
554 project is a store which holds a revlog for each file in your
555 project. So you have the complete history of the project locally. No
556 more round trips to the server.
557
558 Both the differences between successive versions and the periodic
559 complete versions of a file are compressed before storing. This is
560 surprisingly effective at minimising the storage requirements this
561 entire history of your project. I have a small Java project handy,
562 comprising a little over 300 source modules. There are 5 branches plus
563 the mainline, and some 1920 commits in all. A Subversion checkout of
564 the current mainline takes 51Mb. Converting the project to Mercurial
565 yields a Mercurial repository that takes 60Mb, so a little
566 bigger. Remember, though, that the Mercurial repository includes not
567 just the working copy, but also the entire history of the project.
568
569 Any point in the evolution of a revlog can be uniquely identified with
570 a nodeid. This is simply the SHA1 hash of the current file contents
571 concatenated with the nodeids of one or both parents of the current
572 revision. Note that this way, two file states are identical if and
573 only if the file contents are the same *and* the file has the
574 same history.
575
576 Here's a dump of a revlog index:
577
578 \begin{verbatim}
579 $ hg debugindex .hg/store/data/pome.txt.i
580 rev offset length base linkrev nodeid p1 p2
581 0 0 32 0 0 6bbbd5d6cc53 000000000000 000000000000
582 1 32 51 0 1 83d266583303 6bbbd5d6cc53 000000000000
583 2 83 84 0 2 14a54ec34bb6 83d266583303 000000000000
584 3 167 76 3 4 dc4df776b38b 83d266583303 000000000000
585 $
586 \end{verbatim}
587
588 Note here that a file state can have two parents. If both the parent
589 nodeids are non-null, the file state has two parents, and the state is
590 therefore the result of a merge.
591
592 Let's dump out a revlog at a particular revision:
593
594 \begin{verbatim}
595 $ hg debugdata .hg/store/data/pome.txt.i 2
596 There was a gibbon one morning
597 said "I think I will fly to the moon".
598 So with two great palms strapped to his arms,
599 he started his takeoff run.
600 $
601 \end{verbatim}
602
603 The next component is the manifest. This is simply a list of all the
604 files in the project, together with their current nodeids. The
605 manifest is a file, held in a revlog. The nodeid of the manifest,
606 therefore, identifies the project filesystem at a particular point.
607
608 \begin{verbatim}
609 $ hg debugdata .hg/store/00manifest.i 5
610 poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8
611 $
612 \end{verbatim}
613
614 Finally we have the changeset. This is the atomic collection of
615 changes to a repository that leads to a new revision. The changeset
616 info includes the nodeid of the corresponding manifest, the timestamp
617 and committer ID, a list of changed files and a comment. The changeset
618 also includes the nodeid of the parent changeset, or the two parents
619 if the change is a merge. The changeset description is held in a
620 revlog, the changelog.
621
622 \begin{verbatim}
623 $ hg debugdata .hg/store/00changelog.i 5
624 1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e
625 Jim Hague <jim.hague@acm.org>
626 1209061793 -3600
627 poem.txt
628 pome.txt
629
630 Merge first line branch
631 $
632 \end{verbatim}
633
634 The nodeid of the changeset, therefore, gives us a globally unique
635 identifier for any particular change. Changesets have a
636 Subversion-like incrementing change number, but it is peculiar to that
637 repository. The nodeid, however, is global.
638
639 One more detail remains to complete the picture. How do we get back
640 from a particular file change to find the responsible changeset? Each
641 revlog change has a linkrev entry that does just this.
642
643 So, now we have a repository with a history of the changes applied to
644 that repository. Each change has a unique identifier. If we find that
645 change in another repository, it means that at the point in the other
646 repository we have exactly the same state; the file contents and
647 history are identical.
648
649 At this point we can see how pulling changes from another repository
650 works. Mercurial has to determine which changesets in the source
651 repository are missing in the target repository. To do this, for each
652 head in the source repo it has to find the most recent change in that
653 head that it already present in the target repo, and get any remaining
654 changes after that point. These changes are then copied over and
655 applied.
656
657 The Mercurial revlog format has proved remarkably durable. Since the
658 first release of Mercurial in April 2005, these have been a total of 5
659 changes to the file format. However, of those, all but one have been
660 changes to the handling of file names. The most recent change, in
661 October 2008, and its predecessor in December 2006, were both
662 introduced purely to cope with Windows specific issues. The one change
663 that touched the data structures described above was in April 2006. The
664 format introduced, RevLogNG, changed only the details of index data
665 held, not the overall design. The chief Mercurial developer, Matt
666 Mackall, notes that the code in present-day Mercurial devoted to
667 reading the old format comprises 28 lines of Python. Compared with,
668 say, the early tribulations of Subversion and the switch from \texttt{bdfs} to
669 \texttt{fsfs}, this is an impressive record.
670
671 \section{Reflections on going distributed}
672 It's nearly traditional at this stage in an introduction to DVCS to
673 demonstrate several different workflow scenarios that you can build
674 with a DVCS. Which makes the important point that a DVCS can be
675 adapted to your workflow in a way that is at best unwieldy with a
676 CVCS. I intend, though, to break with tradition here.
677
678 By this stage, I hope you can see that distributing version control
679 works by introducing branches where development takes place in
680 parallel. Mercurial treats these branches as arising naturally from
681 the commits made and transferred between repositories. Both Git and
682 Bazaar take a slightly different viewpoint, and explicitly generate a
683 fresh branch for work in a particular repositories. But in both cases
684 the underlying principle of identifying changes by a globally unique
685 identifier and resolving parallel development by merges between
686 overlapping changes is the same. And all three can be used in a truly
687 distributed manner, with full history and the ability to commit being
688 available locally.
689
690 So instead of chatter on about workflows, I want instead to reflect on
691 the consequences all this has for that all-important question of
692 whether a DVCS is a suitable vehicle for your data.
693
694 The first is a minor and rather obvious point. If you want to store
695 files that are very large and which change often in your DVCS, then
696 all the compression in the world is unlikely to stop the storage
697 requirements for the full project history from becoming uncomfortably
698 large, particularly if the files are not very compressible to start
699 with.
700
701 The second, and main, point is that there is an important question you
702 need to ask about your data. We've seen that a DVCS relies on
703 branching and merging to weave its magic. So take a close look at your
704 data, and ask:
705
706 \standout{Will It Merge?}
707
708 The subset of plain old text which comprises program source
709 code requires some human oversight, but will merge automatically
710 well enough for the process to be well within the bounds of the
711 possible.
712
713 Unfortunately when we move further afield mergeability becomes a rarer
714 commodity. I nearly began the previous paragraph by stating that
715 plain old text will merge well enough. Then Doubt set in~--- what about
716 XML? Or BASE64 encoded content?
717
718 Of course, merge doesn't necessarily have to be textual merge. I am
719 told that Word can be used to diff and merge two Word \texttt{.doc} files, a
720 data format notorious for its binary impenetrability. As long as some
721 suitable merge agent is available, and the DVCS can be configured to
722 use it for data of a particular type\footnote{Mercurial can have the
723 merge and diff tools specified with reference to the file extension on
724 which they operate~--- I assume Bazaar and Git are similar.}, then there
725 is no bar to successful DVCS use.
726
727 Before this reliance on mergeability causes you to dismiss DVCS out of
728 hand, reflect. A CVCS can only handle non-mergeable data by acting as
729 a versioned file store; in other words, having as the only available
730 merge option the use of one or other of the merge candidates in its
731 entirety. Useful though a versioned file store can be, it cannot be
732 considered a full-featured version control system. By treating the
733 offending unmergeable files as external to the DVCS, or with careful
734 workflow~--- disabling the distributed and mergeable potentials~--- a DVCS
735 can deal with these files, but only at a cost of its distributedness
736 or its version control system-ness. In this it differs little from a
737 CVCS.
738
739 So, for all data you want to version control, let your battle cry be:
740
741 \standout{Will It Merge?}
742
743 At this point, I have an urge to don lab coat and safety goggles and
744 be videoed attempting to mechanically merge data in a variety of
745 different formats. Frankly, this is unlikely to be as exciting at
746 blending iPhones\footnote{\url{http://www.willitblend.com}},
747 but from a system development point of view it's rather more
748 important. And, I think gives us a large clue as to one of the
749 reasons for the continuing
750 popularity of Plain Old Text as a source code representation mechanism.
751
752 \end{document}