Mercurial > CVu-Mercurial
comparison Hg.tex @ 9:2155510c62f3
A version formatted with Latex. And spellchecked.
author | Jim Hague <jim.hague@acm.org> |
---|---|
date | Fri, 22 May 2009 10:23:40 +0100 |
parents | |
children | 2e4d690ffabb |
comparison
equal
deleted
inserted
replaced
8:abca12aaa38d | 9:2155510c62f3 |
---|---|
1 \documentclass[a4paper]{article} | |
2 \usepackage{pslatex} | |
3 \usepackage{url} | |
4 | |
5 \newcommand{\standout}[1]{ | |
6 {\begin{center} \large \textbf{#1} \end{center}} | |
7 } | |
8 | |
9 \setlength{\parskip}{2mm} | |
10 \setlength{\parindent}{0mm} | |
11 | |
12 \begin{document} | |
13 \title{Inside a distributed version control system} | |
14 \author{Jim Hague\\ | |
15 \texttt{jim.hague@acm.org}} | |
16 \date{May 2009} | |
17 \maketitle | |
18 | |
19 \section{Preamble} | |
20 Grinton Lodge is a Youth Hostel that sits on an exposed hillside just | |
21 above the small hamlet of Grinton in Swaledale, in the Yorkshire Dales | |
22 National Park. A former Victorian shooting lodge, it now welcomes | |
23 walkers and other travellers from around the world. | |
24 | |
25 Tonight, a Wednesday in mid-November, is not one of its busiest | |
26 nights. Kat, the duty staff member, tells me that there is a small | |
27 corporate team-building group in the annex. There's no sign of them at | |
28 present. Otherwise, that portion of the world that has beaten a path | |
29 to the door of this grand building today consists of just me. And Kat | |
30 goes home soon. | |
31 | |
32 The November CVu, removed from its wrappers and read yesterday, lies | |
33 in my bag. Taunting me. Go on, it says, if you've ever going to put | |
34 finger to keyboard in the name of CVu, well, tonight you are out of | |
35 excuses. | |
36 | |
37 Bugger. | |
38 | |
39 \section{Let's look into Mercurial} | |
40 If you're at all interested in version control systems~--- and any | |
41 software developer not using one daily is a strange beast indeed~--- | |
42 you'll at least have become vaguely aware in the last few years of the | |
43 growing maturity of the latest group of version control systems | |
44 offering funky new stuff. These are the distributed version control | |
45 systems (DVCS). There is more to them than just their headline | |
46 attributes, being able to check history and do checkins while | |
47 disconnected from a central server, but these are damm useful to start | |
48 with. | |
49 | |
50 When I first heard about DVCS, it wasn't immediately obvious to me (to | |
51 put it mildly) how they would work. After years of using a centralised | |
52 version control system, I had rough mental model of what went on. But | |
53 how do you cope without the central server forcing ordering onto the | |
54 changes? | |
55 | |
56 Since then I've started using Mercurial\footnote{ | |
57 \url{http://www.selenic.com/mercurial}}. | |
58 Mercurial is a DVCS. It's one of | |
59 three DVCSs that have gained significant popularity in the last few | |
60 years, the other two being Git\footnote{\url{http://git-scm.com}} and | |
61 Bazaar\footnote{\url{http://bazaar-vcs.org/}}. | |
62 I switched a significant work project over | |
63 to Mercurial (from Subversion) in mid-2007, because a customer site | |
64 required on-site work but could not allow access back to the company | |
65 VPN. I chose Mercurial for a variety of reasons which I won't bore you | |
66 with here\footnote{ | |
67 OK, if you must know: | |
68 \begin{itemize} | |
69 \item Implementability. I needed the system to work on Windows, Linux and | |
70 AIX. The latter was not one of the directly supported platforms for | |
71 any of the candidates. Git's implementation uses a horde of | |
72 tools. Bazaar requires only Python, but required Python 2.4 while IBM | |
73 stubbornly still supplies only Python 2.3. Mercurial requires Python | |
74 2.3 or greater, and uses some C for speed. | |
75 \item Simplicity. My users used Subversion daily, but did not generally | |
76 have much experience with other VCS. From the command line, | |
77 Mercurial's core operations will be familiar to a Subversion | |
78 user. This is also true of Bazaar, but was less true of Git. Git has | |
79 improved in this matter since then, but a Mr Winder of this parish | |
80 tells me that it's still possible to seriously embarrass | |
81 yourself. There was also a lack of Windows support for Git at the | |
82 time. | |
83 \item Speed. Mercurial is fast. In the same ballpark as Git. Bazaar | |
84 wasn't, and although it has improved significantly, has, in my | |
85 estimation, added user complexity in the process, and at the time | |
86 of writing is still off the pace for some operations. | |
87 \item Documentation. At the time, Bryan O'Sullivan's excellent Mercurial | |
88 book (\url{http://hgbook.red-bean.com}) was a clear winner for best | |
89 documentation. | |
90 \end{itemize}}. | |
91 | |
92 What I want to do in this article is give you an insight into how a | |
93 DVCS works. OK, so specifically I'm going to be talking about | |
94 Mercurial, but Git and Bazaar attack the problem in a similar way. But | |
95 first I'd better give you some idea of how you use Mercurial. | |
96 | |
97 \subsection{The 5 minute Mercurial overview} | |
98 \subsubsection{The basics} | |
99 I think it unlikely that someone possessing the taste and discernment | |
100 to be reading CVu would not be familiar with at least one version | |
101 control system. So, while I want to give you a flavour of what it's | |
102 like to use, I'm not going to hang about. If you'd like a proper | |
103 introduction, or you don't follow something, I thoroughly recommend | |
104 you consult the Mercurial book. | |
105 | |
106 To start using Mercurial to keep track of a project. | |
107 | |
108 \begin{verbatim} | |
109 $ hg init | |
110 $ | |
111 \end{verbatim} | |
112 | |
113 This creates the repository root in the current directory. | |
114 | |
115 Like CVS\footnote{\url{http://www.nongnu.org/cvs/}} | |
116 with its \texttt{CVS} directory and | |
117 Subversion\footnote{\url{http://subversion.tigris.org/}} | |
118 with its \texttt{.svn} | |
119 directory, Mercurial keeps its private data in a directory. Mercifully there is | |
120 only one of these, in the top level of your project. And rather than | |
121 holding details of where the actual repository is to be found, the \texttt{.hg} | |
122 directory holds the entire repository. | |
123 | |
124 Next you need to specify the files you want Mercurial to track. | |
125 | |
126 \begin{verbatim} | |
127 $ echo "There was a gibbon one morning" > pome.txt | |
128 $ hg add pome.txt | |
129 $ | |
130 \end{verbatim} | |
131 | |
132 As you might expect, this marks the files as to be added. And as you | |
133 might also expect, you need to commit to record the added files in the | |
134 repository. The commit comment can be supplied on the command line; if | |
135 you don't supply a comment, you'll be dropped into an editor to | |
136 provide one. | |
137 | |
138 There is a suggested format for these messages~--- a one line summary | |
139 followed by any more required detail on following lines. By default | |
140 Mercurial will only display the first line of commit messages when | |
141 listing changes. In these examples I'll stick to terse messages, and | |
142 I'll enter them from the command line. | |
143 | |
144 \begin{verbatim} | |
145 $ hg commit -m "My Pome" -u "Jim Hague <jim.hague@acm.org>" | |
146 $ | |
147 \end{verbatim} | |
148 | |
149 Mercurial records the user making the change as part of the change | |
150 information. It is usual to give your name and email address as I've | |
151 done here. You can imagine, though, that constantly having to repeat | |
152 this is a bit tedious, so you can set a default user name in a | |
153 configuration file. Mercurial keeps global, user and repository | |
154 configurations, and it can go in any of those. | |
155 | |
156 As with Subversion, after further edits you see how your working copy | |
157 differs from the repository. | |
158 | |
159 \begin{verbatim} | |
160 $ hg status | |
161 M pome.txt | |
162 $ hg diff | |
163 diff -r 33596ef855c1 pome.txt | |
164 --- a/pome.txt Wed Apr 23 22:36:33 2008 +0100 | |
165 +++ b/pome.txt Wed Apr 23 22:48:01 2008 +0100 | |
166 @@ -1,1 +1,2 @@ There was a gibbon one morning | |
167 There was a gibbon one morning | |
168 +said "I think I will fly to the moon". | |
169 $ hg commit -m "A great second line" | |
170 $ | |
171 \end{verbatim} | |
172 | |
173 And look through a log of changes. | |
174 | |
175 \begin{verbatim} | |
176 $ hg log | |
177 changeset: 1:3d65e7a57890 | |
178 tag: tip | |
179 user: Jim Hague <jim.hague@acm.org> | |
180 date: Wed Apr 23 22:49:10 2008 +0100 | |
181 summary: A great second line | |
182 | |
183 changeset: 0:33596ef855c1 | |
184 user: Jim Hague <jim.hague@acm.org> | |
185 date: Wed Apr 23 22:36:33 2008 +0100 | |
186 summary: My Pome | |
187 | |
188 $ | |
189 \end{verbatim} | |
190 | |
191 There are some items here that need an explanation. | |
192 | |
193 The changeset identifier is in fact two identifiers separated by a | |
194 colon. The first is the sequence number of the changeset in the | |
195 repository, and is directly comparable to the change number in a | |
196 Subversion repository. The second is a globally unique identifier for | |
197 that change. As the change is copied from one repository to another | |
198 (this is a distributed system, remember, even if we haven't come to | |
199 that bit yet), its sequence number in any particular repository will | |
200 change, but the global identifier will always remain the same. | |
201 | |
202 \texttt{tip} is a Mercurial term. It means simply the most recent change. | |
203 | |
204 Want to rename a file? | |
205 | |
206 \begin{verbatim} | |
207 $ hg mv pome.txt poem.txt | |
208 $ hg status | |
209 A poem.txt | |
210 R pome.txt | |
211 $ hg commit -m "Rename my file" | |
212 $ | |
213 \end{verbatim} | |
214 (The command to rename a file is actually \texttt{hg rename}, | |
215 but Mercurial saves Unix-trained fingers from | |
216 typing embarrassment.) | |
217 | |
218 At this point you may be wondering about directories. \texttt{hg mkdir} | |
219 perhaps? Well, no. Mercurial only tracks files. To be sure, the | |
220 directory a file occupies is tracked, but effectively only as a | |
221 component of the file name. This has the slightly unexpected result | |
222 that you can't record an empty directory in your repository.\footnote{ | |
223 I tripped over this converting a work Subversion | |
224 repository. One possibility is to create a placeholder file in the | |
225 directory. In the event I created the directory (which receives build | |
226 products) as part of the build instead.} | |
227 | |
228 Given this, and the status output above that suggests strongly that | |
229 Mercurial treats a rename as a copy followed by a delete, you may be | |
230 worried that Mercurial won't cope at all well with rearranging your | |
231 repository. Relax. Mercurial does store the details of the rename as | |
232 part of the changeset, and copes very well with rearrangements\footnote{ | |
233 The Mercurial designers justify not dealing with | |
234 directories as first class objects by pointing out that provided you | |
235 can correctly move files about in the tree, the other reasons for | |
236 tracking directories are uncommon and do not in their opinion justify | |
237 the considerable added complexity. So far I've found no reason to | |
238 doubt that judgement.}. | |
239 | |
240 Want to rewind the working copy to a previous revision? | |
241 | |
242 \begin{verbatim} | |
243 $ hg update -r 1 | |
244 1 files updated, 0 files merged, 1 files removed, 0 files unresolved | |
245 $ | |
246 \end{verbatim} | |
247 | |
248 \texttt{hg update} updates the working files. In this case I'm specifying | |
249 that I want to go back to local changeset 1. I could also have typed | |
250 \texttt{-r 3d65e7a57890}, or even \texttt{-r 3d}; | |
251 when specifying the global change | |
252 identifier you only need to type enough digits to make it unique. | |
253 | |
254 This is all very well, but it's not exactly distributed, is it? | |
255 | |
256 \subsubsection{Going distributed} | |
257 A version control system goes Distributed by allowing multiple copies | |
258 of the repository to exist, and work to be done in all those | |
259 repositories in parallel. So when you start work on an existing | |
260 project, the first thing to do is to get your own copy of the | |
261 repository. | |
262 | |
263 \begin{verbatim} | |
264 elsewhere$ hg clone ssh://jim.home.net/Poem Jim-Poem | |
265 updating working directory | |
266 1 files updated, 0 files merged, 0 files removed, 0 files unresolved | |
267 \end{verbatim} | |
268 | |
269 Mercurial lets you access other repositories via the file system, over http or | |
270 over ssh. | |
271 | |
272 \begin{verbatim} | |
273 elsewhere$ cd Jim-Poem | |
274 elsewhere$ hg log | |
275 changeset: 3:a065eb26e6b9 | |
276 tag: tip | |
277 user: Jim Hague <jim.hague@acm.org> | |
278 date: Thu Apr 24 18:52:31 2008 +0100 | |
279 summary: Rename my file | |
280 | |
281 changeset: 2:ff97668b7422 | |
282 user: Jim Hague <jim.hague@acm.org> | |
283 date: Thu Apr 24 18:50:22 2008 +0100 | |
284 summary: Finished first verse | |
285 | |
286 changeset: 1:3d65e7a57890 | |
287 user: Jim Hague <jim.hague@acm.org> | |
288 date: Wed Apr 23 22:49:10 2008 +0100 | |
289 summary: A great second line | |
290 | |
291 changeset: 0:33596ef855c1 | |
292 user: Jim Hague <jim.hague@acm.org> | |
293 date: Wed Apr 23 22:36:33 2008 +0100 | |
294 summary: My Pome | |
295 | |
296 $ | |
297 \end{verbatim} | |
298 | |
299 \texttt{hg clone} is aptly named. It creates a new repository that contains | |
300 exactly the same changes as the source repository. You can make a | |
301 clone just by copying your project directory, if you're confident | |
302 nothing else will access it during the copy. \texttt{hg clone} saves you this | |
303 worry, and sets the default push/pull location in the new repo to the | |
304 cloned repo. | |
305 | |
306 From that point, you use \texttt{hg pull} to collect changes from other | |
307 places into your repo (though note it does not by default update your | |
308 working copy), and, as you might guess, \texttt{hg push} shoves your changes | |
309 into a foreign repository. By default these will act on the repository | |
310 you cloned from, but you can specify any other repository. | |
311 | |
312 More on those in a moment. First, though, I want to show you something | |
313 you can't do in Subversion. Start with the repository with 4 changes | |
314 we just cloned. I want to focus on the first couple of lines, so I'll | |
315 wind the working copy back to the point where only those lines exist. | |
316 | |
317 \begin{verbatim} | |
318 $ hg update -r 1 | |
319 1 files updated, 0 files merged, 1 files removed, 0 files unresolved | |
320 $ | |
321 \end{verbatim} | |
322 | |
323 And make a change. | |
324 | |
325 \begin{verbatim} | |
326 $ hg diff | |
327 diff -r 3d65e7a57890 pome.txt | |
328 --- a/pome.txt Wed Apr 23 22:49:10 2008 +0100 | |
329 +++ b/pome.txt Thu Apr 24 19:13:14 2008 +0100 | |
330 @@ -1,2 +1,2 @@ There was a gibbon one morning | |
331 -There was a gibbon one morning | |
332 -said "I think I will fly to the moon". | |
333 +There was a baboon who one afternoon | |
334 +said "I think I will fly to the sun". | |
335 $ hg commit -m "Better first two lines" | |
336 $ | |
337 \end{verbatim} | |
338 | |
339 The alert among you will have sat up at that. Well done! Yes, there's | |
340 something very worrying. How can I commit a change at an old point? | |
341 If you try this in Subversion, it will complain mightily about your | |
342 file being out of date. But Mercurial just went ahead and did | |
343 something. The Bazaar experts among you will know that in Bazaar, if | |
344 you use \texttt{bzr revert -r} to bring the working copy to a past revision, | |
345 make a change and commit, then your latest version will be the past | |
346 revision plus your change. Perhaps that's what Mercurial did? | |
347 | |
348 No. What Mercurial did is central to Mercurial's view of the | |
349 world. You took your working copy back to an old changeset, and then | |
350 committed a fresh change based at that changeset. Mercurial actually | |
351 did just what you asked it to do, no more and no less. Let's see the | |
352 initial evidence. | |
353 | |
354 \begin{verbatim} | |
355 $ hg heads | |
356 changeset: 4:267d32f158b3 | |
357 tag: tip | |
358 parent: 1:3d65e7a57890 | |
359 user: Jim Hague <jim.hague@acm.org> | |
360 date: Thu Apr 24 19:13:59 2008 +0100 | |
361 summary: Better first two lines | |
362 | |
363 changeset: 3:a065eb26e6b9 | |
364 user: Jim Hague <jim.hague@acm.org> | |
365 date: Thu Apr 24 18:52:31 2008 +0100 | |
366 summary: Rename my file | |
367 | |
368 $ | |
369 \end{verbatim} | |
370 | |
371 Time for some more Mercurial terminology. You can think of a \texttt{head} in | |
372 Mercurial as the most recent change on a branch. In Mercurial, a | |
373 branch is simply what happens when you commit a change that has as its | |
374 parent a change that already has a child. Mercurial has a standard | |
375 extension \texttt{hg glog} which uses some ASCII art to show the current | |
376 state: | |
377 | |
378 \begin{verbatim} | |
379 $ hg glog | |
380 @ changeset: 4:267d32f158b3 | |
381 | tag: tip | |
382 | parent: 1:3d65e7a57890 | |
383 | user: Jim Hague <jim.hague@acm.org> | |
384 | date: Thu Apr 24 19:13:59 2008 +0100 | |
385 | summary: Better first two lines | |
386 | | |
387 | o changeset: 3:a065eb26e6b9 | |
388 | | user: Jim Hague <jim.hague@acm.org> | |
389 | | date: Thu Apr 24 18:52:31 2008 +0100 | |
390 | | summary: Rename my file | |
391 | | | |
392 | o changeset: 2:ff97668b7422 | |
393 |/ user: Jim Hague <jim.hague@acm.org> | |
394 | date: Thu Apr 24 18:50:22 2008 +0100 | |
395 | summary: Finished first verse | |
396 | | |
397 o changeset: 1:3d65e7a57890 | |
398 | user: Jim Hague <jim.hague@acm.org> | |
399 | date: Wed Apr 23 22:49:10 2008 +0100 | |
400 | summary: A great second line | |
401 | | |
402 o changeset: 0:33596ef855c1 | |
403 user: Jim Hague <jim.hague@acm.org> | |
404 date: Wed Apr 23 22:36:33 2008 +0100 | |
405 summary: My Pome | |
406 | |
407 $ | |
408 \end{verbatim} | |
409 | |
410 \texttt{hg view} shows a nicer graphical view\footnote{Though, being | |
411 Tcl/Tk based, not that much nicer.}. | |
412 | |
413 So the change is in there. It's the latest change, and is simply on a | |
414 different branch to the other changes. | |
415 | |
416 Almost invariably, you will want to bring everything back together and | |
417 merge the branches. A merge is a change that combines two heads back | |
418 into one. It prepares an updated working directory with the merged | |
419 contents of the two heads for you to review and, if satisfactory, | |
420 commit. | |
421 | |
422 \begin{verbatim} | |
423 $ hg merge | |
424 merging pome.txt and poem.txt | |
425 0 files updated, 1 files merged, 0 files removed, 0 files unresolved | |
426 (branch merge, don't forget to commit) | |
427 $ cat poem.txt | |
428 There was a baboon who one afternoon | |
429 said "I think I will fly to the sun". | |
430 So with two great palms strapped to his arms, | |
431 he started his takeoff run. | |
432 $ hg commit -m "Merge first line branch" | |
433 $ | |
434 \end{verbatim} | |
435 | |
436 (I'm no poet. The poem is, of | |
437 course, \textit{Silly Old Baboon} by the late, great, Spike | |
438 Milligan. From \textit{A Book of Milliganimals}, Puffin, 1971.) | |
439 | |
440 Here's the ASCII art again showing what just happened. | |
441 Oh, and notice in the above that Mercurial has done the | |
442 right thing with regard to the rename. | |
443 | |
444 \begin{verbatim} | |
445 $ hg glog | |
446 @ changeset: 5:792ab970fc80 | |
447 |\ tag: tip | |
448 | | parent: 4:267d32f158b3 | |
449 | | parent: 3:a065eb26e6b9 | |
450 | | user: Jim Hague <jim.hague@acm.org> | |
451 | | date: Thu Apr 24 19:29:53 2008 +0100 | |
452 | | summary: Merge first line branch | |
453 | | | |
454 | o changeset: 4:267d32f158b3 | |
455 | | parent: 1:3d65e7a57890 | |
456 | | user: Jim Hague <jim.hague@acm.org> | |
457 | | date: Thu Apr 24 19:13:59 2008 +0100 | |
458 | | summary: Better first two lines | |
459 | | | |
460 o | changeset: 3:a065eb26e6b9 | |
461 | | user: Jim Hague <jim.hague@acm.org> | |
462 | | date: Thu Apr 24 18:52:31 2008 +0100 | |
463 | | summary: Rename my file | |
464 | | | |
465 o | changeset: 2:ff97668b7422 | |
466 |/ user: Jim Hague <jim.hague@acm.org> | |
467 | date: Thu Apr 24 18:50:22 2008 +0100 | |
468 | summary: Finished first verse | |
469 | | |
470 o changeset: 1:3d65e7a57890 | |
471 | user: Jim Hague <jim.hague@acm.org> | |
472 | date: Wed Apr 23 22:49:10 2008 +0100 | |
473 | summary: A great second line | |
474 | | |
475 o changeset: 0:33596ef855c1 | |
476 user: Jim Hague <jim.hague@acm.org> | |
477 date: Wed Apr 23 22:36:33 2008 +0100 | |
478 summary: My Pome | |
479 | |
480 $ | |
481 \end{verbatim} | |
482 | |
483 So, our little branch change has now been merged back, and we have a | |
484 single line of development again. Notice that unlike the other | |
485 changesets, changeset 5 has two parent changesets, indicating it is a | |
486 merge changeset. You can only merge two branches in one operation; or | |
487 putting it another way, a changeset can have a maximum of two parents. | |
488 | |
489 This behaviour is absolutely central to Mercurial's philosophy. If a | |
490 change is committed that takes as its starting point a change that | |
491 already has a child, then a branch gets created. Working with | |
492 Mercurial, branches get created frequently, and equally frequently | |
493 merged back. As befits any frequent operation, both are easy to do. | |
494 | |
495 You're probably thinking at this point that this making a commit onto | |
496 an old version is a slightly strange thing to do, and you'd be right. | |
497 But that's exactly what's going to happen the moment you go | |
498 distributed. Two people working independently with their own | |
499 repositories are going to make commits based, typically, on the latest | |
500 changes they happen to have incorporated into their tree. To be | |
501 Distributed, a DVCS has to deal with this. Mercurial faces it head-on. | |
502 When you pull changes into your repo (or someone else pushes them), if | |
503 any of the changes overlap~--- are both based on the same base change~--- | |
504 you get extra heads, and it's up to you to let these extra heads live | |
505 or merge, as you please. | |
506 | |
507 In practice this is more manageable then you might think. Consider a | |
508 typical Mercurial usage, where the 'master' repo sits on a known | |
509 server, and everyone pulls changes from the master and pushes their | |
510 own efforts to the master. But default Mercurial won't let you push if | |
511 the receiving repo will gain an extra head as a result, so you | |
512 typically pull (and do any required merging) just before | |
513 pushing. Subversion users will recognised this pattern. Subversion | |
514 won't let you commit a change if your working copy is not at the very | |
515 latest revision, so the Subversion user will update, and merge if | |
516 necessary, just before committing. | |
517 | |
518 What, then, about a branch in the conventional sense of '1.0 | |
519 maintenance branch'? Typically in Mercurial you'd handle this by | |
520 keeping a separate cloned repository for those changes. Cloning is | |
521 fast, and if local uses hard links where possible on filesystems that | |
522 support them, so isn't necessarily extravagant on disc space. You can, | |
523 if you prefer, handle them all in a single repo with 'named | |
524 branches', but cloning is definitely simpler. | |
525 | |
526 OK, so now you know the basics of using Mercurial. We can proceed to | |
527 looking at how this magic is achieved. In particular, where does this | |
528 magic globally unique identifier for a change come from? | |
529 | |
530 \subsection{Inside the Mercurial repo} | |
531 The way Mercurial handles its repo is really quite simple. | |
532 | |
533 That's simple, as in 'most things are simple once you know the | |
534 answer'. I found the explanation helpful\footnote{For the curious, | |
535 Bryan O'Sullivan's excellent Mercurial book | |
536 has a chapter on the subject, and the Mercurial website has a fair amount | |
537 of detail too.}, so this section attempts | |
538 the 10,000ft (FL100 if you prefer) view of Mercurial. | |
539 | |
540 First remember that any file or component can only have one or two | |
541 parents. You can't merge more than one other branch at once. | |
542 | |
543 We start with the basic building block, which Mercurial calls a | |
544 revlog. A revlog is a thing that holds a file and all the changes in | |
545 the file history\footnote{For any non-trivial file, this will | |
546 actually be two files on the disc, a data file and an index.}. The | |
547 revlog stores the differences between successive versions | |
548 of the file, though it will periodically store a complete version of | |
549 the file instead of a difference, so that the content of any | |
550 particular file version can always be reconstructed without excessive | |
551 effort. | |
552 | |
553 Under the secret-squirrel Mercurial \texttt{.hg} directory at the top of your | |
554 project is a store which holds a revlog for each file in your | |
555 project. So you have the complete history of the project locally. No | |
556 more round trips to the server. | |
557 | |
558 Both the differences between successive versions and the periodic | |
559 complete versions of a file are compressed before storing. This is | |
560 surprisingly effective at minimising the storage requirements this | |
561 entire history of your project. I have a small Java project handy, | |
562 comprising a little over 300 source modules. There are 5 branches plus | |
563 the mainline, and some 1920 commits in all. A Subversion checkout of | |
564 the current mainline takes 51Mb. Converting the project to Mercurial | |
565 yields a Mercurial repository that takes 60Mb, so a little | |
566 bigger. Remember, though, that the Mercurial repository includes not | |
567 just the working copy, but also the entire history of the project. | |
568 | |
569 Any point in the evolution of a revlog can be uniquely identified with | |
570 a nodeid. This is simply the SHA1 hash of the current file contents | |
571 concatenated with the nodeids of one or both parents of the current | |
572 revision. Note that this way, two file states are identical if and | |
573 only if the file contents are the same *and* the file has the | |
574 same history. | |
575 | |
576 Here's a dump of a revlog index: | |
577 | |
578 \begin{verbatim} | |
579 $ hg debugindex .hg/store/data/pome.txt.i | |
580 rev offset length base linkrev nodeid p1 p2 | |
581 0 0 32 0 0 6bbbd5d6cc53 000000000000 000000000000 | |
582 1 32 51 0 1 83d266583303 6bbbd5d6cc53 000000000000 | |
583 2 83 84 0 2 14a54ec34bb6 83d266583303 000000000000 | |
584 3 167 76 3 4 dc4df776b38b 83d266583303 000000000000 | |
585 $ | |
586 \end{verbatim} | |
587 | |
588 Note here that a file state can have two parents. If both the parent | |
589 nodeids are non-null, the file state has two parents, and the state is | |
590 therefore the result of a merge. | |
591 | |
592 Let's dump out a revlog at a particular revision: | |
593 | |
594 \begin{verbatim} | |
595 $ hg debugdata .hg/store/data/pome.txt.i 2 | |
596 There was a gibbon one morning | |
597 said "I think I will fly to the moon". | |
598 So with two great palms strapped to his arms, | |
599 he started his takeoff run. | |
600 $ | |
601 \end{verbatim} | |
602 | |
603 The next component is the manifest. This is simply a list of all the | |
604 files in the project, together with their current nodeids. The | |
605 manifest is a file, held in a revlog. The nodeid of the manifest, | |
606 therefore, identifies the project filesystem at a particular point. | |
607 | |
608 \begin{verbatim} | |
609 $ hg debugdata .hg/store/00manifest.i 5 | |
610 poem.txt5168b1a5e2f44aa4e0f164e170820845183f50c8 | |
611 $ | |
612 \end{verbatim} | |
613 | |
614 Finally we have the changeset. This is the atomic collection of | |
615 changes to a repository that leads to a new revision. The changeset | |
616 info includes the nodeid of the corresponding manifest, the timestamp | |
617 and committer ID, a list of changed files and a comment. The changeset | |
618 also includes the nodeid of the parent changeset, or the two parents | |
619 if the change is a merge. The changeset description is held in a | |
620 revlog, the changelog. | |
621 | |
622 \begin{verbatim} | |
623 $ hg debugdata .hg/store/00changelog.i 5 | |
624 1ccc11b6f7308cc8fa1573c2f3811a4710c91e3e | |
625 Jim Hague <jim.hague@acm.org> | |
626 1209061793 -3600 | |
627 poem.txt | |
628 pome.txt | |
629 | |
630 Merge first line branch | |
631 $ | |
632 \end{verbatim} | |
633 | |
634 The nodeid of the changeset, therefore, gives us a globally unique | |
635 identifier for any particular change. Changesets have a | |
636 Subversion-like incrementing change number, but it is peculiar to that | |
637 repository. The nodeid, however, is global. | |
638 | |
639 One more detail remains to complete the picture. How do we get back | |
640 from a particular file change to find the responsible changeset? Each | |
641 revlog change has a linkrev entry that does just this. | |
642 | |
643 So, now we have a repository with a history of the changes applied to | |
644 that repository. Each change has a unique identifier. If we find that | |
645 change in another repository, it means that at the point in the other | |
646 repository we have exactly the same state; the file contents and | |
647 history are identical. | |
648 | |
649 At this point we can see how pulling changes from another repository | |
650 works. Mercurial has to determine which changesets in the source | |
651 repository are missing in the target repository. To do this, for each | |
652 head in the source repo it has to find the most recent change in that | |
653 head that it already present in the target repo, and get any remaining | |
654 changes after that point. These changes are then copied over and | |
655 applied. | |
656 | |
657 The Mercurial revlog format has proved remarkably durable. Since the | |
658 first release of Mercurial in April 2005, these have been a total of 5 | |
659 changes to the file format. However, of those, all but one have been | |
660 changes to the handling of file names. The most recent change, in | |
661 October 2008, and its predecessor in December 2006, were both | |
662 introduced purely to cope with Windows specific issues. The one change | |
663 that touched the data structures described above was in April 2006. The | |
664 format introduced, RevLogNG, changed only the details of index data | |
665 held, not the overall design. The chief Mercurial developer, Matt | |
666 Mackall, notes that the code in present-day Mercurial devoted to | |
667 reading the old format comprises 28 lines of Python. Compared with, | |
668 say, the early tribulations of Subversion and the switch from \texttt{bdfs} to | |
669 \texttt{fsfs}, this is an impressive record. | |
670 | |
671 \section{Reflections on going distributed} | |
672 It's nearly traditional at this stage in an introduction to DVCS to | |
673 demonstrate several different workflow scenarios that you can build | |
674 with a DVCS. Which makes the important point that a DVCS can be | |
675 adapted to your workflow in a way that is at best unwieldy with a | |
676 CVCS. I intend, though, to break with tradition here. | |
677 | |
678 By this stage, I hope you can see that distributing version control | |
679 works by introducing branches where development takes place in | |
680 parallel. Mercurial treats these branches as arising naturally from | |
681 the commits made and transferred between repositories. Both Git and | |
682 Bazaar take a slightly different viewpoint, and explicitly generate a | |
683 fresh branch for work in a particular repositories. But in both cases | |
684 the underlying principle of identifying changes by a globally unique | |
685 identifier and resolving parallel development by merges between | |
686 overlapping changes is the same. And all three can be used in a truly | |
687 distributed manner, with full history and the ability to commit being | |
688 available locally. | |
689 | |
690 So instead of chatter on about workflows, I want instead to reflect on | |
691 the consequences all this has for that all-important question of | |
692 whether a DVCS is a suitable vehicle for your data. | |
693 | |
694 The first is a minor and rather obvious point. If you want to store | |
695 files that are very large and which change often in your DVCS, then | |
696 all the compression in the world is unlikely to stop the storage | |
697 requirements for the full project history from becoming uncomfortably | |
698 large, particularly if the files are not very compressible to start | |
699 with. | |
700 | |
701 The second, and main, point is that there is an important question you | |
702 need to ask about your data. We've seen that a DVCS relies on | |
703 branching and merging to weave its magic. So take a close look at your | |
704 data, and ask: | |
705 | |
706 \standout{Will It Merge?} | |
707 | |
708 The subset of plain old text which comprises program source | |
709 code requires some human oversight, but will merge automatically | |
710 well enough for the process to be well within the bounds of the | |
711 possible. | |
712 | |
713 Unfortunately when we move further afield mergeability becomes a rarer | |
714 commodity. I nearly began the previous paragraph by stating that | |
715 plain old text will merge well enough. Then Doubt set in~--- what about | |
716 XML? Or BASE64 encoded content? | |
717 | |
718 Of course, merge doesn't necessarily have to be textual merge. I am | |
719 told that Word can be used to diff and merge two Word \texttt{.doc} files, a | |
720 data format notorious for its binary impenetrability. As long as some | |
721 suitable merge agent is available, and the DVCS can be configured to | |
722 use it for data of a particular type\footnote{Mercurial can have the | |
723 merge and diff tools specified with reference to the file extension on | |
724 which they operate~--- I assume Bazaar and Git are similar.}, then there | |
725 is no bar to successful DVCS use. | |
726 | |
727 Before this reliance on mergeability causes you to dismiss DVCS out of | |
728 hand, reflect. A CVCS can only handle non-mergeable data by acting as | |
729 a versioned file store; in other words, having as the only available | |
730 merge option the use of one or other of the merge candidates in its | |
731 entirety. Useful though a versioned file store can be, it cannot be | |
732 considered a full-featured version control system. By treating the | |
733 offending unmergeable files as external to the DVCS, or with careful | |
734 workflow~--- disabling the distributed and mergeable potentials~--- a DVCS | |
735 can deal with these files, but only at a cost of its distributedness | |
736 or its version control system-ness. In this it differs little from a | |
737 CVCS. | |
738 | |
739 So, for all data you want to version control, let your battle cry be: | |
740 | |
741 \standout{Will It Merge?} | |
742 | |
743 At this point, I have an urge to don lab coat and safety goggles and | |
744 be videoed attempting to mechanically merge data in a variety of | |
745 different formats. Frankly, this is unlikely to be as exciting at | |
746 blending iPhones\footnote{\url{http://www.willitblend.com}}, | |
747 but from a system development point of view it's rather more | |
748 important. And, I think gives us a large clue as to one of the | |
749 reasons for the continuing | |
750 popularity of Plain Old Text as a source code representation mechanism. | |
751 | |
752 \end{document} |