SIBMED (SIBpair Mutation and Error Detection) Version 1.0 April 18, 2000 Julie A. Douglas and Michael Boehnke University of Michigan Department of Biostatistics School of Public Health 1420 Washington Heights Ann Arbor, Michigan 48109-2029 Phone: 734-647-9466 and 734-936-1001 FAX: 734-763-2215 Email: jddoug@umich.edu and boehnke@umich.edu TABLE OF CONTENTS Introduction Assumptions Input Results and Output Advice on Running SIBMED Default Array Dimensions Compiling and Running SIBMED Error Conditions and User Support Acknowledgements References Introduction Sib-pair designs are routinely employed in linkage studies of complex diseases and quantitative traits. When parents or additional sibs are available, genotyping errors and mutations can often be detected as Mendelian incompatibilities or apparent double recombinants. However, given only sib-pair data, genotyping errors and mutations cannot be detected as Mendelian incompatibilities. Still, they can be inferred probabilistically given a sufficiently dense map of markers. Genotyping errors occur when the observed genotype does not correspond to the true underlying genotype, owing to laboratory error or incorrect data interpretation or entry. Marker mutations, which are relatively frequent for short tandem repeat polymorphisms (Weber and Wong 1993), can mimic genotyping errors. The net effect of undetected genotyping errors and mutations for affected sib-pair designs is to diminish evidence for linkage. SIBMED is a FORTRAN 77 program that identifies likely genotyping errors and mutations for a sib pair in the context of multipoint mapping. Specifically, using a hidden Markov model, the program calculates the posterior probability of genotyping error or mutation for each sib-pair-marker combination, given all the available marker data, an assumed genotype-error rate, and a known genetic map. The subset of combinations for which this posterior error probability is high may be considered for exclusion, review, or retyping. Because the empirical distributions of the posterior error probabilities are a function of map density, prior error rate, marker position, and marker allele frequency, we use Monte Carlo simulation to determine marker-specific cutoff constants in order to identify unusually high posterior error probabilities. Monte Carlo simulation has the advantage of being completely specific to the data at hand, thereby providing a mechanism for controlling the false positive rate. Subject to small false positive rates (0.001), SIBMED does not generate false evidence for linkage by removing incorrectly identified errors. We recommend the use of larger false positive rates ONLY if re-scoring or retyping will be carried out. Optionally, SIBMED allows users to forgo Monte Carlo simulation, and instead produce a list of high posterior error probabilities with the associated sib-pair-marker combination information. Genotypes for which the posterior error probabilities are extremely high, for example, greater than 0.90, can be confidently excluded. The accuracy of error detection using SIBMED is directly related to map density; it is most advantageous for dense maps (1-3 cM marker spacing). The method and additional information are described in Douglas et al. (2000). Assumptions 1. Only full sibs are considered; MZ-twin pairings are excluded. 2. All genetic markers are autosomal and codominant, and marker allele frequencies are known. 3. Intermarker distances are known, and there are no sex differences in recombination fractions. 4. The assumed relationship (i.e. full sibs) is correct. 5. Hardy-Weinberg and linkage equilibrium. 6. No genetic interference (for purposes of calculating multipoint genotype probabilities). 7. Kosambi's mapping function is used to convert map distances to recombination fractions. 8. Prior genotype-error rate e (0e1). 9. No marker has more than 99 alleles, and no more than 9999 markers are to be analyzed. Chromosomes are of length no more than 50 Morgans. Input SIBMED requires three input files: (1) a control file that describes the basic analysis parameters and seeds for Monte Carlo simulation (optional), (2) a locus file that describes the marker loci to be analyzed, and (3) a pedigree file that describes the structure and genotype data for the families to be analyzed. The locus and pedigree files are similar to those required by MENDEL (Lange et al. 1988) and identical to those required by RELPAIR (Boehnke and Duren, 1997). (1) Control File The control file contains information describing the analyses to be carried out. It includes the following 11 (or 14) records: (i) locus file name, left justified in columns 1-72 (ii) pedigree file name, left justified in columns 1-72 (iii) result file name, left justified in columns 1-72 (iv) output file name, left justified in columns 1-72 (v) echo marker map indicator, in column 1 ("y" if the map information should be echoed to the result file, "n" if not; note--do not enter the quotation marks) (vi) echo pedigree data indicator, in column 1 ("y" if the pedigree data should be echoed to the result file, "n" if not; note--do not enter the quotation marks) (vii) female symbol, left justified in columns 1-8 (for example, "F") (viii) male symbol, left justified in columns 1-8 (for example, "M") (ix) prior error rate ((.000001), left justified in columns 1-8 (x) simulation indicator, in column 1 ("y" if the simulation should be performed, "n" if not; note--do not enter quotation marks) (xi) posterior error probability cutoff constant ((.000001), left justified in columns 1-8 or (xi) prescribed false positive rate ((.000001), left justified in columns 1-8 (xii) first random number seed (32,767), right justified in columns 1-5 (xiii) second random number seed (32,767), right justified in columns 1-5 (xiv) third random number seed (32,767), right justified in columns 1-5 For example, the control file below specifies a prior error rate of 1% and a posterior error probability cutoff of 90%: sample.loc sample.ped sample.res sample.out y y F M 0.01 n 0.90 End of file symbol In contrast, the control file below specifies a prior error rate of 1% and a prescribed false positive rate of 0.1%: sample.loc sample.ped sample.res sample.out y y F M 0.01 y 0.001 18438 7390 20094 End of file symbol Note that the control file must end with some kind of end of file symbol. On some computers, PCs for example, this is automatically done, and the symbol is invisible. On other computers, there is a visible or partially visible symbol. All FORTRAN 77 compilers have an ENDFILE command if it is necessary to produce the end of file symbol. (2) Locus File The locus file contains information describing the marker loci and map distances. Marker loci are provided one locus at a time with the following records required for each locus: (i) locus identifier record in (2A8,2I2,I4,F8.5) format specifying (a) locus name, in columns 1-8 (b) AUTOSOME or X-LINKED, depending on whether the locus is autosomal or X- linked, in columns 9-16 (SIBMED does not currently handle the X-linked case, so this option will be ignored) (c) number of alleles (<=99), right justified in columns 17-18 (d) number of phenotypes, right justified in columns 19-20 (SIBMED only handles codominant marker loci, so this value must be 0 by convention) (e) chromosome on which the locus is located, right justified in columns 21-24 (1-22) (f) map position of the locus (in Morgans) relative to one end of the chromosome, in columns 25-32 Note: Kosambi's mapping function is used to convert map distances to recombination fractions. Changing the program FUNCTION MAPFUN can easily modify this choice. (ii) for each allele, a record in (A8,F8.5) format specifying (a) allele name, in columns 1-8 (b) allele population frequency, in columns 9-16 For example, consider the locus file below: MARKER1 AUTOSOME 3 0 1 0.00 1 .18 2 .55 3 .27 MARKER2 AUTOSOME 4 0 1 0.02 1 .01 2 .07 3 .86 4 .06 MARKER3 AUTOSOME 3 0 1 0.05 1 .32 2 .61 3 .07 MARKER4 AUTOSOME 5 0 1 0.07 1 .14 2 .02 3 .79 4 .01 5 .04 End of file symbol This particular file specifies alleles and their frequencies for 4 autosomal marker loci with intermarker distances of 2, 3, and 2 cM, respectively. The locus file and the pedigree file must be coordinated so that the genotype fields for individuals match exactly the order of the loci in the locus file. Thus, a pedigree appropriate to the above locus file must have MARKER1 and MARKER2 as the sixth and seventh items in each individual record (see below). Finally, the locus file must end with some kind of end of file symbol (see Control File). (3) Pedigree File The pedigree file contains records for pedigrees and records for individuals within each pedigree. A pedigree record specifies the number of people in a pedigree and an ID for the pedigree. An ID can have a maximum of eight characters. Individual records for each person in the pedigree follow the pedigree record. Each individual record provides an individual ID, the IDs of the individual's parents if present in the pedigree (blank otherwise), his/her sex, identical twin status, and all his/her marker genotypes. The order of these items is fixed. Also, note that individual IDs must be unique within a pedigree. The term pedigree has certain technical connotations. To properly reconstruct relationships between individuals, individuals must often be included who are dead or otherwise unavailable for study. As a general rule of thumb, either both parents or neither parent of an individual must be listed in the pedigree. Preceding the pedigree and individual records are two FORTRAN format records for reading in the data. The first format record specifies the format for reading the pedigree records. It consists of an integer format (I) for reading the number of individuals in a pedigree and a character format (A) for reading the pedigree ID. (I2,1X,A8) gives a typical example. Note that since the number of pedigree members is read in I format, this number must be right justified in its field, in this example, in columns 1 and 2. The second format directs the reading of the individual records. It is very important that the format records match the corresponding pedigree data exactly. Note that a common source of error is mismatch between the number of markers specified in the format record and the actual number of genotypes present for each individual in the pedigree file. For example, the pedigree file below contains genotype data for 3 pedigrees (a total of 5 sib pairs) and 4 marker loci: (I2,1X,A8) (3A8,2A1,A3,3(1X,A3)) 6 FAMILY1 FATHER M MOTHER F MA F CHILD1 FATHER MOTHER M 1/2 1/1 1/3 3/5 CHILD2 FATHER MA M 3/3 1/3 2/2 1/5 CHILD3 FATHER MOTHER M 3/3 1/3 2/2 1/5 4 FAMILY2 DAD M MOM F KID1 DAD MOM M 1/1 2/2 1/1 4/4 KID2 DAD MOM F 2/3 1/2 2/3 1/5 4 FAMILY3 PA M MA F BROTHER PA MA M 2/2 3/4 2/3 SISTER PA MA F 1/2 2/4 2/2 1/4 End of File Symbol In this example, the individual ID and parental IDs are read in A8 format. Although the mother is listed as the second parent in the pedigree, there is no requirement that the first parent be the father and the second parent the mother. Sex and MZ twin status are read in A1 format while MARKERS 1-4 are each in A3 format. The 1X format item allows one space between reading two fields. All items or fields in an individual record should be read in character format (A), and with the exception of the marker genotypes, should consist of eight characters or fewer. Marker genotypes are represented by two allele labels separated by a slash (/), where the allele labels are identical those in the locus file. Recall (from Locus File), each allele should consist of eight characters or fewer. Missing values for any field are represented by blanks. For a genotype, either zero or two alleles are present, never one. In addition, we have used an 'F' for female and 'M' for male in the above example. Because we currently are not allowing for X-linkage, this variable is ignored in the analysis. Different female and male symbols may be specified, if desired (see Control File). Moreover, the MZ twin field is left blank in this example because there are no identical twins. If there are identical twins (triplets, etc.) in a pedigree, then each pair (trio, etc.) of them should be assigned the same unique-to-the- twinship non-blank identifier. Finally, the pedigree file must end with some kind of end of file symbol (see Control File). Results and Output The SIBMED result file begins with a short header, followed by the information used in the analysis and simulations, including the names of the locus and pedigree files, the prior error rate, and the prescribed false positive rate or posterior error probability cutoff constant. Optionally (see Control File), the marker map and/or pedigree data are then printed. The result file concludes with a summary of the results. The first section provides a detailed listing of the sib pairs identified with likely genotyping errors or mutations, including the pedigree ID, the individual sib IDs, the marker name and position (in cM), and the associated posterior error probability. The second section concludes with the total number of sib-pair-marker combinations identified with likely genotyping errors or mutations and the total number of sib pairs and sib-pair-marker combinations analyzed. An example result file is given below: ***************************************************** *************** SIBMED.F VERSION 1.0 **************** ************** WRITTEN BY J.A. DOUGLAS ************** ****************** APRIL 17, 2000 ******************* ***************************************************** LOCUS FILE: sample.loc PEDIGREE FILE: sample.ped ____________________________________________________________________________ PRIOR ERROR RATE = 0.0100 PRESCRIBED FALSE POSITIVE RATE = 0.0010 ____________________________________________________________________________ ***SUMMARY OF RESULTS*** ____________________________________________________________________________ PAIRS IDENTIFIED WITH LIKELY MARKER GENOTYPING ERRORS OR MUTATIONS: ____________________________________________________________________________ FAMILY SIB1 SIB2 MARKER POS(CM) PR.ERROR ____________________________________________________________________________ FAMILY1 CHILD1 CHILD3 MARKER3 5.0 0.282287 ____________________________________________________________________________ SIBMED IDENTIFIED 1 SIB-PAIR-MARKER COMBINATIONS ____________________________________________________________________________ 4 SIB PAIRS ANALYZED 16 SIB-PAIR-MARKER COMBINATIONS CONSIDERED ____________________________________________________________________________ The SIBMED output file provides the flagged sib-pair-marker combination information in column format, which can then be easily manipulated by other software packages. As in the result file, the first column corresponds to the pedigree ID, the second and third columns correspond to the sib IDs, the fourth and fifth columns list the marker name and position (in cM), respectively, and the last column gives the posterior error probability. For example, the output file corresponding to the previous result file is given below: FAMILY1 CHILD1 CHILD3 MARKER3 5.0 0.282287 Advice on Running SIBMED If the number of sib-pair-marker combinations identified with likely genotyping errors or mutations is uncomfortably large, we recommend using a smaller prescribed false positive rate. Again, we do not recommend using false positive rates greater than 0.001 unless re-scoring or retyping will be carried out. Even then, the advantage of more complete error detection must be balanced against the additional effort required to re-score and possibly retype a much larger subset of potential errors and mutations. Removing likely errors under less restrictive false positive rates can have the unfortunate consequence of introducing false linkage information. Inspecting results family by family may be helpful in sorting out who among several flagged sib-pair-marker combinations has a likely genotyping error or mutation. For example, in a sibship of size three, if two of the three possible pairings of sibs yield relatively large posterior error probabilities for the same marker, then the sib in common to both of these pairings may be inferred to have a likely genotyping error or mutation. Similarly, examining results marker by marker may be helpful in identifying problem markers. The output file can be used for these post-processing purposes. Default Array Dimensions The maximum array dimensions for SIBMED are initially set by the values of the following variables: Variable Name Initial Value Variable Description MAXALL 40 Maximum number of alleles for a genetic marker MAXFAM 2,000 Maximum number of families in the sample MAXLOC 500 Maximum number of genetic markers to analyze MAXPEO 30 Maximum number of people per family MXPTOT 8,000 Maximum total number of people in the sample MAXSIB 10,000 Maximum number of sib pairs in the sample NSIM 1,000,000 Maximum number of sib pairs in the simulation If you wish to choose your own array dimensions, modify the PARAMETER statement on the first non-comment lines of SIBMED (currently lines 58-59). This may be accomplished by using a file editor. Then recompile the program as described below. If the data being analyzed are too large for the current array dimensions, the program will issue an error message and will terminate. The error message will give the specific parameter whose value needs to be changed. Compiling and Running SIBMED SIBMED has been developed on microcomputers running in MS-DOS* mode (Pentium) and on workstations running UNIX (System V Release 4.0). Source code as well as executables for both these platforms is available on our web site. If you are using one of these platforms and if the default array dimensions with the shipped versions (see above) are sufficient for your problem, you can skip compiling and just do the following: (1) Copy the appropriate executable file to your hard disk. (2) Create the required input files as described above. (3) Run the program by typing SIBMED, where represents a carriage return. SIBMED will print a header to the screen and ask you to specify the control file name. Once this is done, SIBMED will attempt to carry out the analyses requested. The DOS executable files were compiled and linked using Leahy Fortran 90 Release 1.00i with the commands: lf90 SIBMED.FOR -bind -fix -tp -o3 -exe SIBMED These commands may be found in the file SIBMED.BAT. The UNIX executable files were compiled and linked using f77 with the commands: f77 SIBMED.F -s -dalign -xlibmopt -native -O4 -o SIBMED These commands may be found in the file SIBMED.MAK. Since SIBMED is written in standard FORTRAN 77, we anticipate that the program should be portable with little or no modification to other platforms as well. Please let us know if you have any difficulties with this. Also, we would appreciate learning of the changes required for such transfers. *Note that due to the way Windows 95 allocates memory, you will not likely be able to run SIBMED from the MS-DOS prompt window. Instead, you should re- boot to MS-DOS mode. Error Conditions and User Support When SIBMED stops without completing the desired analyses, error messages may be found either on the screen, in the result file, or both. ALWAYS CHECK THE RESULT FILE FOR ERROR MESSAGES IF THE PROGRAM STOPS UNEXPECTEDLY! After correcting the error(s) noted, the program can be re-run. While we have tried to carefully document SIBMED and to make it reasonably easy to use, problems will undoubtedly arise. Before calling us, please read the documentation carefully. It is our experience with other programs that many of the questions we are asked are answered in the documentation. If after reading the documentation you are still having problems, help on an as-available basis can be obtained by phone, letter, fax, or e-mail from either Julie A. Douglas or Michael Boehnke. If you think you have found a bug in our program, or an error in the documentation, we also would like very much to know about it. Since this is the first distribution version of SIBMED, errors may be found. We will do our best to correct any errors discovered as quickly as possible and inform all users immediately. Although SIBMED is distributed free of charge, please do not pass on a copy of the programs to others. Instead, please ask anyone wishing to use the programs to obtain them directly from our website (http://www.sph.umich.edu/csg/software). This procedure allows us to keep accurate records of software users and to make users aware of improvements as they are made and errors as they are corrected. Acknowledgements We thank Bill Duren for making subroutines from his computer program available to us. This research was supported by NIH grants T32 HG00040 (JAD), R01 HG00376 (MB), and a University of Michigan Rackham Predoctoral Fellowship (JAD). References Douglas JA, Boehnke M, Lange K (2000) A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am J Hum Genet 66: 1287-1297 Boehnke M, Duren WL (1997) RELPAIR: Software for assessing pairwise relationships using marker data, Version 0.90 Kosambi DD (1944) The estimation of map distances from recombination values. Ann Eugen 12:172-175 Lange K, Weeks D, Boehnke M (1988) Programs for pedigree analysis: MENDEL, FISHER, and dGENE. Genet Epidemiol 5:471-472 Weber JL, Wong C (1993) Mutation of human short tandem repeats. Hum Mol Genet 2: 1123-1128