comparison Hg.txt @ 5:2ec53c0ed5d8

Musings on Merging and Mergeability.
author Jim Hague <jim.hague@icc-atcsolutions.com>
date Fri, 06 Mar 2009 14:07:34 +0000
parents 561edf852797
children a942bf7bc2ab
comparison
equal deleted inserted replaced
4:561edf852797 5:2ec53c0ed5d8
493 493
494 We start with the basic building block, which Mercurial calls a 494 We start with the basic building block, which Mercurial calls a
495 revlog. A revlog is a thing that holds a file and all the changes in 495 revlog. A revlog is a thing that holds a file and all the changes in
496 the file history. (Footnote: For any non-trivial file, this will 496 the file history. (Footnote: For any non-trivial file, this will
497 actually be two files on the disc, a data file and an index). The 497 actually be two files on the disc, a data file and an index). The
498 revlog stores the (compressed) differences between successive versions 498 revlog stores the differences between successive versions
499 of the file, though it will periodically store a complete version of 499 of the file, though it will periodically store a complete version of
500 the file instead of a difference, so that the content of any 500 the file instead of a difference, so that the content of any
501 particular file version can always be reconstructed without excessive 501 particular file version can always be reconstructed without excessive
502 effort. 502 effort.
503 503
504 Under the secret-squirrel Mercurial .hg directory at the top of your 504 Under the secret-squirrel Mercurial .hg directory at the top of your
505 project is a store which holds a revlog for each file in your project. 505 project is a store which holds a revlog for each file in your
506 project. So you have the complete history of the project locally. No
507 more round trips to the server.
508
509 Both the differences between successive versions and the periodic
510 complete versions of a file are compressed before storing. This is
511 surprisingly effective at minimising the storage requirements this
512 entire history of your project. <!!!Comparison of .svn space
513 requirements for Waldo>.
506 514
507 Any point in the evolution of a revlog can be uniquely identified with 515 Any point in the evolution of a revlog can be uniquely identified with
508 a nodeid. This is simply the SHA1 hash of the current file contents 516 a nodeid. This is simply the SHA1 hash of the current file contents
509 concatenated with the nodeids of one or both parents of the current 517 concatenated with the nodeids of one or both parents of the current
510 revision. Note that this way, two file states are identical if and 518 revision. Note that this way, two file states are identical if and
595 held, not the overall design. The chief Mercurial developer, Matt 603 held, not the overall design. The chief Mercurial developer, Matt
596 Mackall, notes that the code in present-day Mercurial devoted to 604 Mackall, notes that the code in present-day Mercurial devoted to
597 reading the old format comprises 28 lines of Python. Compared with, 605 reading the old format comprises 28 lines of Python. Compared with,
598 say, the early tribulations of Subversion and the switch from bdfs to 606 say, the early tribulations of Subversion and the switch from bdfs to
599 fsfs, this is an impressive record. 607 fsfs, this is an impressive record.
608
609 Reflections on going distributed
610 --------------------------------
611
612 It's nearly traditional at this stage in an introduction to DVCS to
613 demonstrate several differenet workflow scanarios that you can build
614 with a DVCS. Which makes the important point that a DVCS can be
615 adapted to your workflow in a way that is at best unwieldy with a
616 CVCS. I intend, though, to break with tradition here.
617
618 By this stage, I hope you can see that distributing version control
619 works by introducing branches where development takes place in
620 parallel. Mercurial treats these branches as arising naturally from
621 the commits made and transferred between repositories. Both Git and
622 Bazaar take a slightly different viewpoint, and explicitly generate a
623 fresh branch for work in a particular repositories. But in both cases
624 the underlying principle of identifying changes by a globally unique
625 identifier and resolving parallel development by merges between
626 overlapping changes is the same. And all three can be used in a truly
627 distributed manner, with full history and the ability to commit being
628 available locally.
629
630 I want now to reflect on the consequences all this has for that
631 all-important question of whether a DVCS is a suitable vehicle for
632 your data.
633
634 The first is a minor and rather obvious point. If you want to store
635 files that are both very large and which change often in your DVCS,
636 then all the compression in the world is unlikely to stop the storage
637 requirements for the full project history from becoming
638 uncomfortably large.
639
640 The second, and main, point is that there is an important question you
641 need to ask about your data. We've seen that a DVCS relies on
642 branching and merging to weave its magic. So take a close look at your
643 data, and ask:
644
645 Will It Merge?
646
647 The subset of plain old text which comprises program source
648 code requires some human oversight, but will merge automatically
649 well enough for the process to be well within the bounds of the
650 possible.
651
652 Unfortunately when we move further afield mergeability becomes a rarer
653 commodity. I nearly began the previous paragraph by stating that
654 plain old text will merge well enough. Then Doubt set in - what about
655 XML? Or BASE64 encoded content?
656
657 Of course, merge doesn't necessarily have to be textual merge. I am
658 told that Word can be used to diff and merge two Word .doc files, a
659 data format notorious for its binary impenetrability. As long as some
660 suitable merge agent is available, and the DVCS can be configured to
661 use it for data of a particular type (Footnote: Mercurial can have the
662 merge and diff tools specified with reference to the file extension on
663 which they operate - I assume Bazaar and Git are similar.), then there
664 is no bar to successful DVCS use.
665
666 Before this reliance on mergeability causes you to dismiss DVCS out of
667 hand, reflect. A CVCS can only handle non-mergeable data by acting as
668 a versioned file store; in other words, having as the only available
669 merge option the use of one or other of the merge candidates in its
670 entireity. Useful though a versioned file store can be, it cannot be
671 considered a full-featured version control system. By treating the
672 offending unmergeable files as external to the DVCS, or with careful
673 workflow - disabling the distributed and mergeable potentials - a DVCS
674 can deal with these files, but only at a cost of its distributedness
675 or its version control system-ness. In this it differs little from a
676 CVCS.
677
678 So, for all data you want to version control, let your battle cry be
679
680 Will It Merge?
681
682 At this point, I have an urge to don lab coat and safety goggles and
683 be videoed attempting to mechanically merge data in a variety of
684 different formats. Frankly, this is unlike to be as exciting at
685 blending iPhones (Ref: www.willitblend.com), but from a system
686 development point of view it's rather more important. And, I think
687 gives us a large clue as to one of the reasons for the continuing
688 popularity of Plain Old Text as a source code representation mechanism.