* This is work in progress * [[PageOutline]] = Rationale for git = Comparison with bzr / hg, problems of svn. We started a page about git workflows [[there GitWorkflow]]. == Speed == - Git: is extremely fast. Almost every command is instantaneous, including merges, branch creations and cloning, which are painfully slow on svn (< 1s vs minutes for svn). Getting the diff, log between two branches is also instantaneous - blame is comparatively slower, but still much faster than svn. Network-wise, the git protocol is OK - the http one (necessary when behind most proxy) is slow. - Hg: operations fast enough, but in each command there is the Python startup time added. Cloning, branching, log and merging are fast. - Bzr: status, diff against last commit are fast enough, even for big working trees. History related commands are slow. Network operations are very inefficient (up to bzr 1.12 at least). == Branches == One big difference between git, bzr and hg is the visibility of a branch. Bzr, and hg focus on one branch/repository (but hg has named branches to have multiple branches in one repository), whereas git focus on many branches in one repository. This has significant consequences on the workflow. Pros of one branch per repository - it is easier to deal with external tools (meld, diff, etc...), as every branch has its separate working tree. - it is easier to know in which branch you are working on. It has significant drawbacks, though: - comparing branches is difficult, as they are in different repositories. For example, say one branch is the mainline (trunk), and one branch is a release branch, comparing, diffing them from the tool is awkward at best. git log/diff branch1..branch2 is extremely useful to follow what's happening in a branch, for review and so on. - switching between branches in different repositories is inconvenient. Evaluation: - Git's branches can be easily used as topic branches, and also as permanent ones. Switching between branches is very fast within the same working copy. - Git's collaboration features for tracking what's going on in remote branches appear to be the best of the three. - Bzr's repository branches are clumsier than Git's for use as topic branches; requires cd'ing to a different working copy. Long URLs difficult to remember, and no way to list available branches from command line. - Hg's repository branches: similar to Bzr. - Hg's named branches: use as throwaway topic branches questionable, since they cannot be deleted. - Hg's plugins that provide git-like branches: "non-standard" solutions. At least the localbranch one seemed to suffer from the 80%-20% problem. Also, it was not suitable for collaboration, as IIRC there was no way to publish the local branches separately. Another consequence of those differences is that Bzr uses "simple" numbers for versioning (HG also uses simple numbers for convenience in UI, alongside SHA hashes similar to git), but this breaks horribly when using branches: there is no simple ordering in a DAG anymore, and things like topological order can only be done for a known structure, that is the number -> internal id changes (a same version can refer to different things on different branches). This is extremely confusing in advanced scenario. Note that this happens in svn too: when you have several branches, svn revisions 'jump' for each branch, and they can't be used in any meaningful way. Git simply gives up the normal "simple revision" concept, and uses the internally used revision by default. In practice, this is not so much a problem, because git can refer to a specific change in many different ways: the internal number (sha1 checksum), parent ordering, time, etc...: {{{ # Diff between the working tree and the last commit on the current branch git diff HEAD # Difference between last last commit and last commit git diff HEAD^..HEAD # 4 last commits in the current branch git log HEAD^^^^..HEAD # Commits the last 7 days: git log --since="7 days" }}} To talk about a given change, one can simply tag it a simple name, use branches for review, etc... In my (David) experience, the weird internal revision number feeling in git disappears quickly. However, in my experience (Pauli) one ends up pasting the SHA hashes especially when rebasing interactively against older revisions, and the "convenience" numbers on the revisions could be useful. Strangely enough, HG does not appear to have a syntax similar to HEAD^^ for referring quickly to changesets preceding current head. XXX: is this true? == Merging == Merging in Git/Hg: - Fetch remote branch to local storage - Merge the new branch/head to the local branch Merging in Bzr: - Merge directly against remote. In both HG and Git the changesets to be merged are locally accessible in the normal way to commands "diff", "revert", etc., and it is easy to examine the csets to be merged. In Bzr, you apparently need to use a bit more awkward branch:URL syntax to refer to the remote branch, which also requires over-the-network operations. XXX: Is there in Bzr some other way to referer to the changesets to be merged? It appears that they *are* temporarily stored locally, but I couldn't figure out the proper syntax. == Publishing branches == Git, Bzr and HG support "dumb" read-only repositories served over plain HTTP. (Git requires running git update-server-info, though, after every push.) Also, for all of the three, there are free services (github, gitorius, launchpad, bitbucket, freehg) for publishing branches conveniently. == History rewriting == Attitude to history rewriting is perhaps the most controversial difference between Git and Bzr/HG. Git comes built-in with the "rebase" command, which is fairly easy to use. In Bzr and HG, similar features are available as extensions, though. XXX: I (Pauli) haven't tested the HG and Bzr rebase functionality carefully. Does it work well? == Explicit meta-data vs guessing == Another strong internal difference between git and bzr is how to deal with meta-data: file moves, tracking directories, etc... Git does not track any of this (git in particular cannot track empty directory). This has consequences on rename/merge interaction (e.g. assuming one file foo.c in trunk renamed to foobar.c, but barfoo.c in branch1, what happens when branch1 and is merged into trunk ?). Git's approach advantages: - if you create two projects with exactly the same repository independently, you will end up with exactly the same repositories (you can merge, etc...). For example, when you convert repositories to git from svn with several branches, git automatically detects the common commits between branches. - changing things from a patch, or from git itself results in exactly the same commit (assuming the changes are the same of course) The bzr/hg advantages: - rename sometimes behaves more like people expect during merges. - git automatic detection of renames is based on heuristics, which sometimes fail. Mercurial repository format also has a disadvantage when it comes to moving big files: when you move a file, the data currently in it becomes duplicated in the repository. So if you have a X MB file in the repository, after renaming it, you repository size has increased by X MB. (Bzr and Git do not have this problem.) git usage of heuristics is a bit controversial: some people say that git behaves better for renames because of this, other the contrary. One potentially very useful feature is the automatic code move detection for code moves within files. For example, here is the output of git blame for one file of the talkbox git repo: {{{ ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 1) descr = """Talkbox, to make your numpy environment speech aware ! ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 2) 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 3) Talkbox is set of python modules for speech/signal processing. The goal of this ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 4) toolbox is to be a sandbox for features which may end up in scipy at some 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 5) point. The following features are planned before a 1.0 release: ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 6) ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 7) * Spectrum estimation related functions: both parametic (lpc, high ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 8) resolution methods like music and co), and non-parametric (Welch, 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 9) periodogram) 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 10) * Fourier-like transforms (DCT, DST, MDCT, etc...) ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 11) * Basic signal processing tasks such as resampling ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 12) * Speech related functionalities: mfcc, mel spectrum, etc.. ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 13) * More as it comes ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 14) 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 15) I want talkbox to be useful for both research and educational purpose. As such, 052ed182 setup.py (David Cournapeau 2008-09-15 04:46:02 +0000 16) a requirement is to have a pure python implementation for everything - for da6cc0da setup.py (David Cournapeau 2009-03-26 17:59:20 +0900 17) educational purpose and reproducibility - and optional C for speed.""" ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 18) ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 19) DISTNAME = 'scikits.talkbox' ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 20) DESCRIPTION = 'Talkbox, a set of python modules for speech/signal processing' ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 21) LONG_DESCRIPTION = descr ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 22) MAINTAINER = 'David Cournapeau', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 23) MAINTAINER_EMAIL = 'david@ar.media.kyoto-u.ac.jp', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 24) URL = 'http://projects.scipy.org/scipy/scikits' 9ebbba27 setup.py (David Cournapeau 2008-09-07 18:24:14 +0000 25) LICENSE = 'MIT' ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 26) DOWNLOAD_URL = URL 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 27) 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 28) MAJOR = 0 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 29) MINOR = 2 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 30) MICRO = 2 7910ac4c common.py (David Cournapeau 2009-03-26 18:24:29 +0900 31) DEV = False 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 32) 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 33) CLASSIFIERS = [ 'Development Status :: 1 - Planning', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 34) 'Environment :: Console', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 35) 'Intended Audience :: Developers', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 36) 'Intended Audience :: Science/Research', ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 37) 'License :: OSI Approved :: BSD License', 60f32e8a setup.py (David Cournapeau 2009-03-26 18:00:33 +0900 38) 'Topic :: Scientific/Engineering'] ^6f8dca2 setup.py (David Cournapeau 2008-09-03 10:33:42 +0000 39) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 40) def build_verstring(): dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 41) return '%d.%d.%d' % (MAJOR, MINOR, MICRO) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 42) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 43) def build_fverstring(): dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 44) if DEV: dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 45) return build_verstring() + '.dev' dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 46) else: dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 47) return build_verstring() dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 48) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 49) def write_version(fname): dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 50) f = open(fname, "w") dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 51) f.write(""" dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 52) short_version='%s' dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 53) version=short_version dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 54) dev=%s dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 55) if dev: dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 56) version += '.dev' dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 57) """ % (build_verstring(), DEV)) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 58) f.close() dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 59) dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 60) VERSION = build_fverstring() dc91bfef common.py (David Cournapeau 2009-03-26 18:02:24 +0900 61) INSTALL_REQUIRE = 'numpy' }}} As you can see, blame not only outputs the commit for each line, but also the original file. Note that the committer has never told git about code move: git automatically saw that there were some common code portions between this file and some other files at a previous revision. = Some scenario where git helps = == Release scenario == In a release context, a few things are useful: 1 comparing the dev branch and the release branches 2 being able to tag some points in the history 3 cherry-picking some bug fixes from the dev branch into release branches Every above point is painful in svn, and trivial in git. Assuming 'master' is the dev branch, and 'release' is the release branch, point 1 can be done by: - git log master..release -> give the log specific to the release branch - git diff master..release -> give the changes - git diff --stat master..release -> give a summary of changes (number of line changes/file) In svn, any of this operation is so slow and awkward that nobody does it. Points 2 and 3 are equally easy (tag is one command, and cherry picking is done through git cherry-pick). == Keeping a "patch" up-to-date == XXX: the rebase feature allows one to keep a patch up-to-date easily, and also polish it. (Though it has the danger that you lose history...) == Significant changes == XXX == Review == XXX XXX: Linux-like workflow, use of Signed-off-by = How to do the migration = The migration from svn repository to git repository should keep as mush information from svn as possible: history, tags and branches. == Tool for the migration == The general tool for migration is fast-import, which is a format to import/export data from/to other source code control systems. For svn, two expoters at least exist: * svn-fast-export * svn-all-fast-export: see http://repo.or.cz/w/svn-all-fast-export.git svn-all-fast-export is an exporter coded by KDE people to handle KDE migration - thus, it can certainly handle numpy and scipy size-wise. It can skip some branches, or paths outside the usual trunk/branches/tags (f2py-research, for example), and export svn "tags" as real tags. The later may be dangerous, as svn does not have a tag concept: tag are exactly like branches (new commits can be done to it), so this should be investigated carefully. === usage === svn-all-fast-export is a C++ program which is very fast (it can convert the whole numpy repository in a few minutes), and is based on a configuration file. For numpy, the following seems to work - it ignores branches outside the /branches namespace ('bad' branches, f2py), and convert the tags. {{{ create repository myproject end repository match /trunk/ repository myproject branch master end match # Ignore extra 'repositories' which are not numpy code, but were in numpy # repository. match /f2py-research/ end match match /vendor/ end match match /numpy.sunperf/ end match match /cleaned_math_config/ end match match /numpy-docs/ end match # Take usual svn branches match /branches/([^/]+)/ repository myproject branch \1 end match # This rule will create tags that don't exist in any of the # branches. It's not what you want. # See the merged-branches-tags.rules file match /tags/([^/]+)/ repository myproject branch refs/tags/\1 end match }}} After this initial conversion, git filter-branch can be useful (to filter svn-all-fast-export messages, canonicalize email/committers, etc...).