Linux virtual machine in Windows for genetic analysis and genealogy (PLINK, R, writing .fam, .ind files)

Last modified: 1st November 2017

Instructions are written for Ubuntu (and more specifically its Desktop stable release 16.04) because it is a popular operating system, and therefore there are more packages precompiled and more chances you will find an answer to any problem you might encounter. I normally use OpenSUSE, but for these questions I use Ubuntu in a Virtual Machine.

Second, I am interested for the time being JUST in ancient populations. Modern populations from the 1000 Genome, etc. are interesting insofar as they are related to ancient populations. That being said, if you want help with your personal files I might be able to help, but please take into account that my time is very limited at the moment.

Virtual Machine

This is a guide on how to create a virtual machine with Linux in VMware to work with human ancestry software.

I decided to assign approximately half the capacity of my computer, to speed up all processes (Of course, the best way to speed up analyses is to use a fully dedicated computer to it). You can change these parameters later, if you prefer.

VMware and Slurm work OK with an equivalent number of processors and cores (for Vmware it seems to be a question of Windows licensing). That is, if you want to assign 2 cores, you can set two processors with 1 core each, or 1 processor with 2 cores.

I wanted to work with multiple cores, and I found that the option –j in Admixture looks for processors, so I thought it best to put 2 processors, 1 core for each processor when configuring Vmware (and Slurm.conf), but that does not mean that it won’t work with the other configuration.

Be careful to select an appropriate size. I began with 20Gb, but later wanted to increase the size of the main partition, could not easily change the BIOS settings or start from a live CD or USB (like Gparted Live), and ended up playing with partitions while using them; the filesystem was corrupted, so I had to reinstall everything again. Not that there is any problem if you make copies of important files, but, you know, precious time lost… I use 50Gb now, and haven’t had an issue, but you can of course function with a lot less if you are careful not to store too much data in the Linux machine.

Install Ubuntu 16.0.4 (i.e. the latest stable version at the time of this writing) over VMWare. You can select a newer version, but most packages that can be installed automatically with APT are made for the stable version, so you could end up having to install packages from source. Not that there is nothing wrong with that, indeed… But if you need this tutorial, you probably want to avoid that for now.

In My Computer, select your machine -> Settings, tab Options, enable Shared Folders. I named mine “ubuntu” in the host (i.e. in the Windows computer). Also, in my experience, under Settings, tab Hardware, General, select use enhanced keyboard (required) for better compatibility.

You probably need to install VMwareTools first (go to the tab VM, select (Re)Install VmwareTools, then open the virtual CD that appears in your Ubuntu machine, and extract the content of VmwareTools-x.x.x-xxxx.tar.gz. Then enter the extracted folder

cd vmware-tools-distrib

and install vmwaretools

sudo ./vmware-install.pl -d

Now you should be able to find your shared folder in Ubuntu in mnt/hgfs/ .

I prefer to have certain bookmarked folders in the file manager (Nautilus, in the case of Ubuntu) for quick access. I have one for the folder ubuntu; I recommend that this be one of yours too.

To update the repositories in Ubuntu:

sudo apt-get update.

Then upgrade the system:

sudo apt-get upgrade

PLINK

You can find documentation and detailed information on PLINK, its commands, and its different versions at the official documentation website.

Install PLINK version 1.9, which has a supported package in Ubuntu:

sudo apt-get install plink1.9

and click yes to install all dependencies

You can also install the old PLINK version 1.07

sudo apt-get install plink

If you are using Linux, you will need to follow certain instructions:

http://zzz.bwh.harvard.edu/plink/download.shtml#nix

For Windows, you have to place plink.exe (or plink2.exe) in a certain folder, from where you will call the commands.

Version 1.9 (and in the future version 2) are supposed to replace the old version, and they are much faster and have less memory requirements, and you can get rid of the annoying –noweb parameter. Believe it or not, with certain medium-sized datasets and PLINK version 1, a computer with four cores (i7) and 32Gb will give you an “Out of memory” message, and you will need to split the data… However, in Windows, I found that errors keep popping up if you use free datasets with PLINK2.

I prefer to use PLINK in Linux, because I am used to modify texts and open the command line there.

There is a package called gdplink if you prefer to use a (Java-based) GUI for PLINK.

For errors using PLINK, please visit https://www.cog-genomics.org/plink/1.9/errors first. Then try searching for your errors in your preferred search engine, and if not try specialized forums and wait for an answer. Contacting me is your worst option.

EIGENSOFT

You can find information on the Eigensoft software at https://github.com/argriffing/eigensoft. You can install it from source following its instructions, or you can install the package (available at least in Debian and Ubuntu):

sudo apt-get install eigensoft

R

R is probably the most interesting tool to learn and to work with. Not only for Human Ancestry, but for any scientific discipline you might be willing to work with in the future.

You can download the appropriate software package for your system here.

For Ubuntu, the package installation will take care of dependencies:

sudo apt-get install r-base

See https://cran.r-project.org/bin/linux/ubuntu/README.html

I use Windows, because that is my preferred working environment. The default GUI is not that good, so I prefer to use R Server from Microsoft Visual Studio 2017 (which is the heir of the R Productivity Environment by Revolution Computing).

You can also run it in Excel (http://rcom.univie.ac.at/.), as a web application ( http://biostat.mc.vanderbilt.edu/rapache/. ), or as a server ( http://www.rforge.net/Rserve/index.html. ).

As well as any other package you might need (visit the R-project website).

If you are really into working with R with big data, read Setting up RStudio Server quickly on Amazon EC2. RStudio Server supplies an interactive user interface to your remote R environment that is nearly indistinguishable from a local RStudio console. For a few dollars you can work interactively on R tasks requiring hundreds of GB of memory and tens of CPUs and GPUs.

Regarding the R language, if you have previous knowledge of any programming language, or even just pseudocode, it is not difficult at all. You just have to dedicate some time to the basics, and then you will just need to look up in your search engine for usable code to implement.

The manual is a clear and thorough introduction to R, and a known manual is for example R in a Nutschell.

Other tools

You want to be able to write different documents. Especially in Windows – because in Linux you will not have these problems -, you might want to have at least:

– A good text editor, like Notepad++, to open text files of unknown extensions, without corrupting them when modifying and saving them. There are many other free software editors, maybe better than Notepad++, but for me it does the job.

– A spreadsheet editor, like Microsoft Excel, or any free software out there with similar funcionality. You will certainly want to be able to write over files with hundreds of samples without having to edit them one by one…

Join the discussion...

It is good practice to be registered and logged in to comment.
Please keep the discussion of this post on topic.
Civilized discussion. Academic tone.
For other topics, use the forums instead.

Leave a Reply