Chromosome coordinate systems: 0-based, 1-based

I’ve had hard time figuring out that different website and file formats are using different systems to represent genome coordinate.

Basically, the bases can be numerated in two way: starting at 0 or starting at 1. Those are the 0-based and 1-based coordinate system.

0-based:

ACTGACTG
012345678

1-based:

ACTGACTG
123456789

Then you say that the system is inclusive if the last index is part of the sequence or exclusive if it is not.

For instance to represent the sequence TGAC:

0-based inclusive: 2-5
1-based inclusive: 3-6
1-based exclusive: 3-7

I’ve tried to figure out which website-application are using each coordinate system. The results can be found bellow. For each source, I provide the URL of the reference website where I found the information, and a caption where the system is described.

I found most of those links in Biostar (https://www.biostars.org/p/6373/) and on the blog of Casey M. Bergman (http://bergmanlab.smith.man.ac.uk/?p=36), who also wrote an article about this argument: https://www.landesbioscience.com/journals/mge/article/19479/.

Question:
“I am confused about the start coordinates for items in the refGene table. It looks like you need to add “1” to the starting point in order to get the same start coordinate as is shown by the Genome Browser. Why is this the case?”Response:
Our internal database representations of coordinates always have a zero-based start and a one-based end. We add 1 to the start before displaying coordinates in the Genome Browser. Therefore, they appear as one-based start, one-based end in the graphical display. The refGene.txt file is a database file, and consequently is based on the internal representation.

We use this particular internal representation because it simplifies coordinate arithmetic, i.e. it eliminates the need to add or subtract 1 at every step. Unfortunately, it does create some confusion when the internal representation is exposed or when we forget to add 1 before displaying a start coordinate. However, it saves us from much trickier bugs. If you use a database dump file but would prefer to see the one-based start coordinates, you will always need to add 1 to each start coordinate.

If you submit data to the browser in position format (chr#:##-##), the browser assumes this information is 1-based. If you submit data in any other format (BED (chr# ## ##) or otherwise), the browser will assume it is 0-based. You can see this both in our liftOver utility and in our search bar, by entering the same numbers in position or BED format and observing the results. Similarly, any data returned by the browser in position format is 1-based, while data returned in BED format is 0-based.

 

BED format uses zero-based, half-open coordinates, so the first 25 bases of a sequence are in the range 0-25 (those bases being numbered 0 to 24)

The first three required BED fields are:

chrom – The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart – The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd – The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
 
Lowest numeric position of the reported variant on the genomic reference sequence. start:  Mutation start coordinate (1-based coordinate system), end: Highest numeric genomic position of the reported variant on the genomic reference sequence. Mutation end coordinate (inclusive, 1-based coordinate system).
Advertisements

Glassfish setup and managing

Okay, my web-apps are running on Glassfish, and each time I reinstall it I’ve to search my notes to find all the options I usually modify for having it well working.

So I finally decided to put them all on a public post to share with myself and my colleagues, and eventually anyone who may be facing the same problems. Most of the tips and configuration here are related to version 3 of Glassfish. Note that this post may (should) be regularly updated.

 

Summary:

To go straight to the point, here are the principal configuration tricks I’m using, many of them have been found here. See bellow for more details and links about problems/solutions.

  • increase permsize (used by the class loader each time an application is loaded) : -XX:MaxPermSize=512m and -XX:PermSize=512m
  • Use option -server
  • Increase the memory: e.g.  -Xmx4096m and -Xms4096m (Xmx = max memory, Xms = memory allocated at startup)
  • increase the max and min size of http-thread-pool to 32 and 16 (see bellow)

 

Problem: Server stops responding:

I had set up a web service to run a local application on a remote machine and get back the result through a Java API. After some (not so much) time, the application stopped responding… I eventually understood that the Glassfish server was missing

I first found I should add this option:

-Dcom.sun.enterprise.server.ss.ASQuickStartup=false

Yes, I don’t know what it means yet, I’ll have to dig it out before to add it on production server (source: https://www.java.net/node/677689)

What helped me in this case was to increase the max and min size of the thread pool (under server-config, thread pools, http-thread-pool), to 32 and 16 respectively.

 

Problem: Deploy CommandException Error

In some cases it was just impossible to redeploy some of my applications. The error message was There is no installed container capable of handling this application . Unexpectedly, it turned out that the problem was that the application was not correctly canceled from the Glassfish application directory (source: http://stackoverflow.com/questions/5206712/glassfish-deploy-commandexception-error/), so I just had to remove it manually.

Problem:  Loader_<xxx> directories in generated/jsp application folders are not deleted after server

Yep, my Glassfish installation directory was becoming HUGE, particularly the generated directory. This directory was not cleaned automatically in some Glassfish versions. Adding the next lines to the startup script (or removing manually the directory from time to time) solved the problem:

if [ “$1” = “start-domain” ]; then
echo “Removing generated folder…”
rm -rf ../domains/domain1/generated
sleep 2
fi

source: https://java.net/jira/browse/GLASSFISH-19162

 

Several journals are trying to innovate the way scientists publish their results. I’ve recently been contacted by F1000 to review an article. Articles sent to F1000 are almost immediately published, as are the comments from the referees. I quite like the feeling of public comments, the author know who I am and I’m feeling like it enforces the discussion over pure judgement on the work submitted. It’s also a gain of time for the referee who doesn’t have to use proxies and other subterfuges to hide his/her identity.

After writing the review, I started digging a bit more into the F1000 websites for other features and I ended to this blog page which describes what I think should be a must have for any journal publishing articles that contains any piece of software: http://blog.f1000research.com/2013/10/11/open-access-software-our-recent-software-repository-collaborations/.

F1000 allow authors to publish the source code (and meta data) on a dedicated github repository. This means that

  1. the source code is made available on a public repository (and won’t disappear if the submitter’s web site is closed)
  2. it will always be possible to go back to the version of the software which has been used to generate the published data, even if updates are made later.

I’m wondering if other publishers are proposing similar features.

MINT, Intact, MINTACT, and other protein interactions databases

Protein-protein interactions are stored in public databases. Many of them exists, but only a few capture enough experimental information to allow for evaluating the data quality, or even what kind of interaction has been described.

In a recent review (Towards a detailed atlas of protein–protein interactions) I commented about the level of curation and the databases which allow for instance to distinguish binary interactions. MINT, Intact and DIP are three of those.

MINT has always been my favourite one: I worked on it for many years. I believe it has one of the most user friendly interfaces (check this OpenHelix post about how it is “so fun to use”). It was also appreciated by wet laboratories for the data it contains. Indeed one of the particularity of MINT is that it is run by a wet-lab group (the group of Gianni Cesareni at the University of Rome) and the data is often curated according to the needs of the laboratory. In the past it permits to extends the coverage of domain mediated interactions or protein interactions in viruses.

Recently, with the name of “MINTACT” project, MINT and Intact announced that the interactions curated by the MINT staff will be directly inserted into Intact. While it first sounded like the death of MINT, it isn’t yet. While the future of the web interface in uncertain (the MINT group has recently developed a new interface called mentha to aggregate interactions from different databases), the curation staff is still working. If you prefer submit your interaction data to MINT, you can still do it. MINT curators will process your data as they used to do. The big advantage on the user side is that it will be automatically incorporated with Intact. Downloading data from Intact (or using the really useful psicquic REST server from Intact) will provides access to both MINT and Intact. The update of the data  will also be ensured by the Intact team (MINT data, because of the low human resources of the group, was not as updated as Intact’s one).

Other databases were already sending their data directly in Intact. As a consequence, all IMEx interactions, except the ones from DIP, are stored and available with all the detailed information in Intact.

I recommend reading the MINTACT paper for more information.