Sample chado repo git conversion ready for review

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Sample chado repo git conversion ready for review

Karl O. Pinc
Hi,

I've completed a sample svn to git conversion
of the chado repo.  Results are available here,
with instructions for repeating the process below:

https://github.com/kpinc/chado


As you folks predicted, the svn2git program
failed to work, for unknown reasons.  Instead
I based the conversion off of:

http://john.albin.net/git/convert-subversion-to-git

I more or less followed the above as a recipe,
not being steeped enough in git to fully
understand all the dark corners.  So this is
going to need some review.

I wound up using git 1.9.0, since I installed
it because svn2git wouldn't work with 1.7.1,
the git that comes with RHEL 6.4.  I believe
that chado2git will work fine with git 1.7.1,
but see below.


The converted chado repo was created using chado2git.  
The chado2git program is appended to this email
and there's a github repo:

  https://github.com/kpinc/chado2git


chado2git expects a file called
authors-transform.txt
in the pwd.  The latest authors-transform.txt
can be found here (although it probably
does not have Hilmar's latest revisions,
it works):

  https://gist.github.com/kpinc/9139445

The output of chado2git is 2 new directories
in your pwd, chado/ and chado.git/.  
chado.git/ is the desired result,
a bare git repo.  However chado/ is useful,
see below.

Note that chado2git omits from the repo
various .bz2 db backups in:
modules/sequence/apollo-bridge/sample_db/.*\.bz2$


The next step is to put the bare chado.git
repo on github.  I did this with:

  cd chado.git
  git push --mirror https://kpinc@.../kpinc/chado.git

But there's an issue here.  Git 1.7.1 won't
do the push.  It may be that this is because
of the @ in some of the branch names and tags,
but that's an uninformed guess.
It pushes fine with git 1.9.0.  I have no idea
what this means for cloning and using the repo.
I can tell you that 1.7.1 will clone the github
repo I created and that I can do a
  git branch --track foo remotes/origin/foo foo
but I haven't otherwise tried to break it.


I am sure there's a lot more work that could be done
to clean cruft out of the git repo.  Now would
be the time.  To that end I've another program,
bigfiles, appended and in a github repo:

  https://github.com/kpinc/bigfiles

What bigfiles does is pull large files out of
all the various revisions of the svn repo.
They then could be added to a --ignore-paths
in chado2git.  Invoke with:

# Syntax: bigfiles dir size
#
# Input:
#   dir   Directory imported directly from svn2git, not the bare dir.
#   size  Find files at least this size (a find(1) -size argument)

Like so:

  cp chado chado-copy
  ./bigfiles chado-copy 256k

WARNING: The given directory content is destroyed.

Do not run bigfiles on the chado.git/ directory, run it on
a copy of the chado/ directory -- the raw output of
"git svn clone", without branches put in place.

The result of bigfiles is a directory in the pwd
called "sizes".  It contains a sub-directory for
each git commit, numbered starting at 0 with the
most recent commit and increasing until the first
commit is reached.  Each numbered sub-directory
contains pathnames of files that are at least
as large as the given size.  256k in the above
example.  

The files in the number directories
are hard linked to avoid
sucking large amounts of space.
Copies of all the files in all the
revisions >=256k (the above example)
take about 11G.  (But this does not
include the db dumps because they
were already excluded.)

To find all files > 5M:

find sizes -size +5M | cut -d / -f 3- | sort -u

(You gotta wonder whether chado/soi/t/data/AE003790.soi.xml
should be in the repo.  I don't feel
qualified to make these choices.)

To see (sorta) how much space is taken
by "large files" per commit:

ls sizes \
  | sort -n \
  | xargs -I file -n 1 du -s sizes/file \
  | sort -nrk 1,1


An uneducated glance seems to indicate
that there's obvious cruft.  It can't be that
hard to find some more "big offenders" and
add that to the chado2git ignore pattern.


Please let me know what's next, besides
the obvious step of getting the authors right.

If somebody wants to put an archive of the
mailing list on google drive or somewhere
where I can get to it I'd be willing to take
a quick stab at further work on the author list.


Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

-------------<snip>----------------
#!/bin/sh
# This program is chado2git
# Copyright (C) 2014 The Meme Factory, Inc.  http://www.meme.com/
#
#    chado2git is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published
by
#    the Free Software Foundation; either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
# Karl O. Pinc <[hidden email]>

git svn clone \
  https://svn.code.sf.net/p/gmod/svn/schema \
  --no-metadata \
  --authors-file=authors-transform.txt \
  --stdlayout \
  --prefix='' \
  --ignore-paths='modules/sequence/apollo-bridge/sample_db/.*\.bz2$' \
  chado

# Near as I can tell there are no svn:ignore properties.

git init --bare chado.git
cd chado.git
git symbolic-ref HEAD refs/heads/trunk
cd ..

# push the temp repository to the new bare repository.
cd chado
git remote add bare ../chado.git
git config remote.bare.push 'refs/remotes/*:refs/heads/*'
git push bare
cd ..

# Rename “trunk” branch to “master”

cd chado.git
git branch -m trunk master

# Clean up branches and tags

git for-each-ref --format='%(refname)' refs/heads/tags |
cut -d / -f 4 |
while read ref
do
  git tag "$ref" "refs/heads/tags/$ref";
  git branch -D "tags/$ref";
done
-------------<snip>----------------



-------------<snip>----------------
#!/bin/sh
# This program is bigfiles
# Copyright (C) 2014 The Meme Factory, Inc.  http://www.meme.com/
#
#    bigfiles is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published
by
#    the Free Software Foundation; either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
# Karl O. Pinc <[hidden email]>
#
#Script to search for big files.
#
# Syntax: bigfiles dir size
#
# Input:
#   dir   Directory imported directly from svn2git, not the bare dir.
#   size  Find files at least this size (a find(1) -size argument)
#
# Run it on the chado directory, the one created
# directly by 'git svn clone', not the chado.git directory.

export SIZELIMIT=$2

cd "$1"

export c=0
export r=0
while [[ $r = 0 ]] ; do
  find . -path ./.git \
         -prune \
         -o \
         -size +$SIZELIMIT \
         -exec sh -c 'd=$(dirname "$1" | sed "s%^\./%%")
                      archive_d="../sizes/$2/$d"
                      mkdir -p "$archive_d"
                      export l_d="../sizes/$(($2 - 1))/$d"
                      if [ -d "$l_d" ] ; then
                        ldarg="--link-dest=$(pwd)/$l_d"
                      else
                        ldarg=""
                      fi
                      rsync $ldarg \
                            "$1" \
                            "$archive_d"' \
                  archivebig \
                  {} \
                  $c \
               \;
  c=$((c + 1))
  git reset --hard HEAD^
  r=$?
done

-------------<snip>----------------
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Sample chado repo git conversion ready for review

Karl O. Pinc
Hi,

On 02/25/2014 12:45:32 AM, Karl O. Pinc wrote:

> I've completed a sample svn to git conversion
> of the chado repo.

> Note that chado2git omits from the repo
> various .bz2 db backups in:
> modules/sequence/apollo-bridge/sample_db/.*\.bz2$

> I am sure there's a lot more work that could be done
> to clean cruft out of the git repo.

I don't know when the Malaysian conference is happening
but perhaps it would make sense to use the face time
for a discussion of what belongs in the chado git repo.

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Sample chado repo git conversion ready for review

Scott Cain
Hi Karl,

Thanks so much for doing this.  The event in Malaysia is a course, not a conference, so it isn't really the right venue for discussing git repos :-)  I probably won't really get a chance to look at this until I get home, but a cursory glance shows that the history did get migrated.  As for cruft, I'm generally inclined to keep everything if I can (it's the DBA in me :-)

Thanks,
Scott



On Tue, Feb 25, 2014 at 8:19 AM, Karl O. Pinc <[hidden email]> wrote:
Hi,

On 02/25/2014 12:45:32 AM, Karl O. Pinc wrote:

> I've completed a sample svn to git conversion
> of the chado repo.

> Note that chado2git omits from the repo
> various .bz2 db backups in:
> modules/sequence/apollo-bridge/sample_db/.*\.bz2$

> I am sure there's a lot more work that could be done
> to clean cruft out of the git repo.

I don't know when the Malaysian conference is happening
but perhaps it would make sense to use the face time
for a discussion of what belongs in the chado git repo.

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Sample chado repo git conversion ready for review

Karl O. Pinc
On 02/25/2014 03:18:20 PM, Scott Cain wrote:

> Thanks so much for doing this.

You're welcome.  Thank the Babase folks for letting me
do it.

>  I probably won't really get a chance to look at this until I get
> home, but
> a cursory glance shows that the history did get migrated.

It looked ok to me too.  Except that a few of the branches
and tags end in "@NNNN" where NNNN is a svn revision.  I don't
know if this is supposed to happen or not.

>  As for
> cruft,
> I'm generally inclined to keep everything if I can (it's the DBA in
> me
> :-)

But what if I want to hack on Chado on my cellphone!?  :-)

I understand the sentiment.  There's also something to be
said for "make maintainer-clean" and related tidiness.
(And at the same time it's something of a relief to think
that the git migration might be close to happening.)

Regards,

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Sample chado repo git conversion ready for review

Siddhartha Basu
In reply to this post by Karl O. Pinc
Hi Karl(and Babase),
Thanks ton for doing this, really appreciate.

On Tue, 25 Feb 2014, Karl O. Pinc wrote:

> Hi,
>
> I've completed a sample svn to git conversion
> of the chado repo.  Results are available here,
> with instructions for repeating the process below:
>
> https://github.com/kpinc/chado
> Please let me know what's next, besides
> the obvious step of getting the authors right.
>
> If somebody wants to put an archive of the
> mailing list on google drive or somewhere
> where I can get to it I'd be willing to take
> a quick stab at further work on the author list.

One of the mailing list admin should be able to pull a mbox archive
of gmod-schema list.
http://sourceforge.net/apps/trac/sourceforge/wiki/Mailing%20list%20archive%20backups%20(mbox%20files)

thanks,
-siddhartha


>
>
> Regards,
>
> Karl <[hidden email]>
> Free Software:  "You don't pay back, you pay forward."
>                  -- Robert A. Heinlein
>
> -------------<snip>----------------
> #!/bin/sh
> # This program is chado2git
> # Copyright (C) 2014 The Meme Factory, Inc.  http://www.meme.com/
> #
> #    chado2git is free software; you can redistribute it and/or modify
> #    it under the terms of the GNU General Public License as published
> by
> #    the Free Software Foundation; either version 3 of the License, or
> #    (at your option) any later version.
> #
> #    This program is distributed in the hope that it will be useful,
> #    but WITHOUT ANY WARRANTY; without even the implied warranty of
> #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> #    GNU General Public License for more details.
> #
> # Karl O. Pinc <[hidden email]>
>
> git svn clone \
>   https://svn.code.sf.net/p/gmod/svn/schema \
>   --no-metadata \
>   --authors-file=authors-transform.txt \
>   --stdlayout \
>   --prefix='' \
>   --ignore-paths='modules/sequence/apollo-bridge/sample_db/.*\.bz2$' \
>   chado
>
> # Near as I can tell there are no svn:ignore properties.
>
> git init --bare chado.git
> cd chado.git
> git symbolic-ref HEAD refs/heads/trunk
> cd ..
>
> # push the temp repository to the new bare repository.
> cd chado
> git remote add bare ../chado.git
> git config remote.bare.push 'refs/remotes/*:refs/heads/*'
> git push bare
> cd ..
>
> # Rename “trunk” branch to “master”
>
> cd chado.git
> git branch -m trunk master
>
> # Clean up branches and tags
>
> git for-each-ref --format='%(refname)' refs/heads/tags |
> cut -d / -f 4 |
> while read ref
> do
>   git tag "$ref" "refs/heads/tags/$ref";
>   git branch -D "tags/$ref";
> done
> -------------<snip>----------------
>
>
>
> -------------<snip>----------------
> #!/bin/sh
> # This program is bigfiles
> # Copyright (C) 2014 The Meme Factory, Inc.  http://www.meme.com/
> #
> #    bigfiles is free software; you can redistribute it and/or modify
> #    it under the terms of the GNU General Public License as published
> by
> #    the Free Software Foundation; either version 3 of the License, or
> #    (at your option) any later version.
> #
> #    This program is distributed in the hope that it will be useful,
> #    but WITHOUT ANY WARRANTY; without even the implied warranty of
> #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> #    GNU General Public License for more details.
> #
> # Karl O. Pinc <[hidden email]>
> #
> #Script to search for big files.
> #
> # Syntax: bigfiles dir size
> #
> # Input:
> #   dir   Directory imported directly from svn2git, not the bare dir.
> #   size  Find files at least this size (a find(1) -size argument)
> #
> # Run it on the chado directory, the one created
> # directly by 'git svn clone', not the chado.git directory.
>
> export SIZELIMIT=$2
>
> cd "$1"
>
> export c=0
> export r=0
> while [[ $r = 0 ]] ; do
>   find . -path ./.git \
>          -prune \
>          -o \
>          -size +$SIZELIMIT \
>          -exec sh -c 'd=$(dirname "$1" | sed "s%^\./%%")
>                       archive_d="../sizes/$2/$d"
>                       mkdir -p "$archive_d"
>                       export l_d="../sizes/$(($2 - 1))/$d"
>                       if [ -d "$l_d" ] ; then
>                         ldarg="--link-dest=$(pwd)/$l_d"
>                       else
>                         ldarg=""
>                       fi
>                       rsync $ldarg \
>                             "$1" \
>                             "$archive_d"' \
>                   archivebig \
>                   {} \
>                   $c \
>                \;
>   c=$((c + 1))
>   git reset --hard HEAD^
>   r=$?
> done
>
> -------------<snip>----------------
> ------------------------------------------------------------------------------
> Flow-based real-time traffic analytics software. Cisco certified tool.
> Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
> Customize your own dashboards, set traffic alerts and generate reports.
> Network behavioral analysis & security monitoring. All-in-one tool.
> http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema