Last modified: 1st November 2017
This section was created to facilitate access to methods of Bioinformatics and Human Ancestry for everyone. I am a neophyte myself, so don’t expect too much complexity here. Use these instructions at your own risk.
If you are interested in Indo-European studies, or any other comparative historical study that could use genetics (say, Afroasiatic studies, Sino-Tibetan studies, etc.), and have no or little background in Bioinformatics and Genetics, I recommend you to read this guide to human ancestry carefully, and practice with it. You might be pleasantly surprised to learn that, with a little effort, you can discover new details that no doubt have escaped those who tried to analyse the data without having your knowledge.
Most texts are working drafts, I really don’t have time to improve this at the moment. You can send me questions or corrections to firstname.lastname@example.org, or you can leave a comment. Right now I only have time to post what I had written, but not to finish it properly, in the hope that it might be useful for someone.
These are the main sections:
- Linux virtual machine in Windows for genetic analysis and genealogy (PLINK, R, writing .fam, .ind files)
- Use of Slurm Workload Management for job scheduling
- Merge, remove, convert datasets in BED, PED-FAM, or GENO-SNP formats using PLINK and Eigensoft
- Principal Component Analysis (PCA) with Eigensoft and R
- Related page: PCA and Admixture of Eurasian populations
- ADMIXTURE – Ancestry components and R – PLINK, convertf, BED and PED files
- Related post: qpAdm best practices and common pitfalls
- Related post: RISE1.SG, R1b from Poland CWC, a likely mislabelled Balto-Slav
- Related post: AdmixTools: qpgraph, qp3Pop (f_3 test), qpBound, qpDstat, qpF4ratio, rollof
- Related post: Related post: Survival of hunter-gatherer ancestry in West-Central European Neolithic
- Related post: Proto-Corded Ware Late Trypillian
- Related post: “Steppe ancestry” step by step (2019): Mesolithic to Early Bronze Age Eurasia
- Related post: Bell Beakers and Mycenaeans from Yamnaya; Corded Ware from the forest steppe
Useful general resources (more on each individual tool):
- Here is an interesting review of available methods, feel free to try any of them: http://www.nature.com/nrg/journal/v16/n12/fig_tab/nrg4005_T1.html
- The GAWorkshop website is a great introduction to some of these methods of Genetic Analysis.
- The Reich Lab is always a great place to look for information.
- The Michigan Center for Statistical Genetics
- The ISOGG Wiki, for general information and resources
My thoughts on human ancestry
Human ancestry, a subdiscipline of Human Evolutionary Biology, is gradually developing into one of the main components of Anthropology. Unlike the more scientific biological (mostly medical) applications of Human Genetics, this subfield needs a great amount of knowledge of anthropological disciplines such as Archaeology and (for more recent times) Linguistics, and its methods need to include anthropological investigation.
Unlike other subfields of Genetics, where collaboration among like-minded scientists is easy, this discipline requires that researchers have knowledge and expertise in quite different and unrelated subjects. Even among academics who write about archaeology and language, it is rare to find someone with similar (or even enough) knowledge of the field where they have not been formally trained in. That is why linguists only rarely include shallow archaeological questions in their research, and why archaeologists only tentatively give a simple shape to the potential language of prehistoric cultures.
We are seeing today the emergence of an a priori unrelated field, human ancestry research, led by scientists dedicated to obtain and process samples from prehistoric individuals. Their results can later be analysed using certain algorithms, either previously available or expressly designed – or modified – for the task. That means that a scientist trained in Biology or Biochemistry needs specific training in Genetics to extract and process the samples, and (if they want to analyse data) Statistics and Bioinformatics to be able to obtain meaningful results. In fact, most researchers publishing right now are apparently specialised in Bioinformatics.
For the unexperienced and fully anaware (usually amateur) researcher, to enter into any one of these mentioned fields might look like a simple task. Huge mistake. It takes years to specialise in just one tiny subfield. It may take decades to be able to write coherently about more than one. Even if the ability to learn different disciplines improves with each new one learned (experience improves our learning curve), it may be impossible to dominate more than one discipline, let alone all those involved in human ancestry.
There is an expectation in science that the people who gather a dataset should be the ones to analyse it. This expectation has grown from times where fieldwork and data gathering was quite limited, to our modern times where high specialisation is required for each task, and huge amounts of datasets are gathered. Consequently, those specialised in obtaining and gathering data have often confused ideas of what they can do with the data. Even if some of those involved in data acquisition are able to analyse it, in this field more than any other this is is no qualification at all to derive meaningful conclusions. Human ancestry requires context. And prehistoric samples require a thorough understanding of Archaeology and Historical Linguistics, if one is to draw the conclusions most journals and readers expect from modern articles on the subject. In spite of these obvious limitations, journals eager to publish citebait articles are obviously encouraging researchers to publish striking conclusions, and researchers are required by their employers to publish in journals of high impact-factor, which poses the problem of conflict of interest.
As you can imagine, scientists in general like to show off our titles, and we like to state that we know more than we actually know. To accept publicly ignorance of certain subjects is to open wide the gate to ad hominem attacks. In my experience, professional geneticists involved in human ancestry research (i.e. usually involving complex statistical methods) – who are no doubt extraordinary pioneers in this amazing new field -, like to make general statements to present themselves, stating for example that they are geneticists, or specialists in Bioinformatics, but they also like or know Anthropology, or Prehistory. They give lectures or courses about anthropological genetics, they speak to broad audiences about human evolution and migration that happened in this or that culture, and about the potential language these peoples and cultures might have spoken…
This description gives a positive image of the researcher as a sort of Renaissance humanist, but it is misleading, because – under the umbrella of general knowledge – they are using imperfect data to answer complex questions; and I am not referring to damaged DNA, but to ignorance of anthropological subjects involved. Their self-described career choice seems to imply that they are somehow ipso facto entitled to write wide anthropological conclusions based on their personal interests. In my opinion, while distant prehistory might be more suited to this generalist approach to Anthropology, recent prehistory and proto-history lends itself poorly to generalised interpretations, as the ‘Yamna component‘ concept has shown.
I am new to the use of algorithms to study human ancestry. I have been reading about them and their application only since the hype in the news (and in Indo-European studies) about the results of Haak et al. (2015), Mathieson et al. (2015), Allentoft et al. (2015). I have been experimenting with their datasets just for some months, coinciding with my interest in publishing the Indo-European demic diffusion model. I will not pretend to know many details of their methods; memories of the practices for genetic analysis in the Biochemistry and Genetics lab are long buried under new, more practical knowledge, and the only reason I kept up with the basic knowledge of human genetics were genetic diseases, which frequently affect children who come for a consultation in my service of Pediatric Orthopaedics. Nevertheless, I think those with greater knowledge of precise cultures and languages – and I have been working with Indo-European studies for 15 years already – may paradoxically be in a better position to derive meaningful conclusions of human ancestry results than some geneticists who are currently publishing in the field.
- Correlation does not mean causation: the damage of the ‘Yamnaya ancestral component’, and the ‘Future America’ hypothesis
- New Ukraine Eneolithic sample from late Sredni Stog, near homeland of the Corded Ware culture
- Something is very wrong with models based on the so-called ‘steppe admixture’ – and archaeologists are catching up
- Germanic–Balto-Slavic and Satem (‘Indo-Slavonic’) dialect revisionism by amateur geneticists, or why R1a lineages *must* have spoken Proto-Indo-European
- Heyd, Mallory, and Prescott were right about Bell Beakers