% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vcf2diem.r
\name{vcf2diem}
\alias{vcf2diem}
\title{Convert vcf files to diem format}
\usage{
vcf2diem(
  SNP,
  filename,
  chunk = 1L,
  requireHomozygous = TRUE,
  ChosenInds = "all",
  bed = FALSE
)
}
\arguments{
\item{SNP}{A character vector with a path to the '.vcf' or '.vcf.gz' file, or an \code{vcfR}
object. Diploid data are currently supported.}

\item{filename}{A character vector with a path where to save the converted genotypes.}

\item{chunk}{Numeric indicating by how many markers should the result be split into
separate files.}

\item{requireHomozygous}{A logical or numeric vector indicating whether to require the marker
to have at least one or more
homozygous individual(s) for each allele.}

\item{ChosenInds}{A numeric or logical vector of indices of individuals to be included
in the analysis.}

\item{bed}{Logical. If \code{TRUE}, export \code{includedSites} and \code{omittedSites}
in 3-column BED format.}
}
\value{
No value returned, called for side effects.
}
\description{
Reads vcf files and writes genotypes of the most frequent alleles based on
chromosome positions to diem format.
}
\details{
Importing vcf files larger than 1GB, and those containing multiallelic
genotypes is not recommended. Instead, use the path to the
vcf file in \code{SNP}. \code{vcf2diem} then reads the file line by line, which is
a preferred solution for data conversion, especially for
very large and complex genomic datasets.

The number of files \code{vcf2diem} creates depends on the \code{chunk} argument
and class of the \code{SNP} object.
\itemize{
\item Values of \code{chunk < 100} are interpreted as the number of files into which to
split data in \code{SNP}. For \code{SNP} object of class \code{vcfR}, the number
of markers per file is calculated from the dimensions of \code{SNP}. When class
of \code{SNP} is \code{character}, the number of markers per file is approximated
from a model with a message. If this number of markers per file is inappropriate
for the expected
output, provide the intended number of markers per file in \code{chunk} greater
than 100 (values greater than 10000 are recommended for genomic data).
\code{vcf2diem} will scan the whole input specified in the \code{SNP} file, creating
additional output files until the last line in \code{SNP} is reached.
\item Values of \code{chunk >= 100} mean that each output file
in diem format will contain \code{chunk} number of lines with the data in \code{SNP}.
}

When the vcf file contains markers not informative for genome polarisation,
those are removed and listed in a file ending with \emph{omittedSites.txt} in the
directory specified in the \code{filename} argument or in the working directory.
The omitted loci are identified by their values in the CHROM and POS columns,
and include the QUAL column data. The last column is an integer specifying
the reason why the respective marker was omitted. The reasons markers are
not informative for genome polarisation using \code{diem} are:
\enumerate{
\item Marker has fewer than 2 alleles representing substitutions.
\item Required homozygous individuals for the two most frequent alleles are not present
(optional, controlled by the \code{requireHomozygous} argument).
\item The second most frequent allele is found only in one heterozygous individual.
\item Dataset is invariant for the most frequent allele.
\item Dataset is invariant for the allele listed as the first ALT in the vcf input.
}

The CHROM, POS, and QUAL information for loci included in the converted files is
listed in the file ending with \emph{includedSites.txt}. An additional column shows which
allele is encoded as 0 in its homozygous state and which is encoded as 2.

When \code{bed = TRUE}, both \emph{includedSites.txt} and \emph{omittedSites.txt} contain
simplified 0-based site coordinates in the standard 3-column BED format: chromosome,
start (POS - 1), and end (POS). All other columns described above are omitted in
this case.
}
\examples{
\dontrun{
# vcf2diem will write files to a working directory or a specified folder
# make sure the working directory or the folder are at a location with write permission
myofile <- system.file("extdata", "myotis.vcf", package = "diemr")

vcf2diem(SNP = myofile, filename = "test1")
vcf2diem(SNP = myofile, filename = "test2", chunk = 3)
}
}
\author{
Natalia Martinkova

Filip Jagos \href{mailto:521160@mail.muni.cz}{521160@mail.muni.cz}

Jachym Postulka \href{mailto:506194@mail.muni.cz}{506194@mail.muni.cz}
}
