Style Sampler

Layout Style

Patterns for Boxed Mode

Backgrounds for Boxed Mode

Search News Posts

  • General Inquiries 1-888-555-5555

  • Support 1-888-555-5555

anaplatform Data Consultancy
Genome Sequencing

Transforming healthcare through the power of data analytics and genetics

Genome Sequencing

Case Study: Genome Sequencing

In this case study, we provide an example problem with genome sequencing data in FASTA format, which is a common file format used for genome sequencing data.

Background:

A leading insurance company has been struggling to accurately predict the income of their potential customers. They have been relying on traditional methods such as credit scores and occupation, but have noticed that these metrics are not always reliable indicators of income. The company has a large dataset of customer information, including personal and professional details. They have approached a data science consulting firm to help them build a machine learning model that can predict income more accurately.

Problem Statement:

A team of researchers was interested in studying the genomes of a specific type of bacteria that is known to cause food poisoning in humans. They collected genome sequencing data from several strains of the bacteria and wanted to analyze the data to identify any genetic factors that may contribute to the bacteria's virulence.

The first step in their analysis was to assemble the genome sequences. Genome assembly involves piecing together the small fragments of DNA that are generated during sequencing into a complete genome. The researchers used specialized software to perform the assembly and generated high-quality genomes for each of the bacterial strains.

Next, the researchers used a variety of data analytics techniques to compare the genomes of the different strains. They identified genetic variations that were unique to the virulent strains of the bacteria and used statistical analysis to determine if these variations were significant. They also looked for patterns in the data that could help explain why some strains of the bacteria are more virulent than others.

One interesting finding was that the virulent strains of the bacteria had a unique set of genes that were not present in the non-virulent strains. These genes were related to the production of toxins that are known to cause food poisoning in humans. The researchers hypothesized that these genes may be responsible for the increased virulence of the bacteria. To further test their hypothesis, the researchers used machine learning algorithms to predict the virulence of new strains of the bacteria based on their genome sequences. They trained the algorithms on the genome sequences of the previously studied strains and used the resulting models to predict the virulence of new strains that had not yet been tested in the lab. The models were able to accurately predict the virulence of these new strains based solely on their genome sequences.

Data:

The data obtained from Genome research laboratory which includes real DNA sequencing reads derived from a human. Data includes record in a FASTA file is defined as a single-line header, followed by lines of sequence data.

First we will find answer to these questions
  • How many records are in the file?
  • What are the lengths of the sequences in the file?
  • What is the longest sequence and what is the shortest sequence?
  • Is there more than one longest or shortest sequence?
  • What are their identifiers?
  • A record in a FASTA file is defined as a single-line header, followed by lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is an optional description of the entry. There should be no space between the ">" and the first letter of the identifier.

    How many records are in the file?

    An example sequence in FASTA format is:

    AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368 ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA

    Numerical Variables:
    • Age
    • Income
    • Education level
    • Credit score
    • Number of dependents
    • Home value
    • Net worth
    • Debt-to-income ratio
    Methodology:

    reads genome sequencing data from a FASTA file, prints the number of sequences in the file, and writes a new FASTA file containing only sequences longer than 1000 base pairs.

    DNA sequencing algorithms

    We use several DNA sequencing algorithms including:

  • Generic Naive Approximate Matching allows up to n mismatches
  • Boyer-Moore Pre-processing with approximate matching
  • Pigeon Hole with Boyer-Moore and approximate matching
  • Pigeon Hole with K-Mer Index and approximate matching
    Results:

    The Boyer-Moore model was able to produce genome sequencing with a MAE of 0.0008458 and RMSE of 14,810. These values were significantly lower than the baseline model, which had a MAE of 0.0008458 and RMSE of 14,810. The firm also used feature importance techniques to identify the most important variables in predicting income, which were education level, occupation, credit score, and age.

    Model

    MAE

    RMSE

    Generic Naive Approximate Matching allows up to n mismatches

    0.0008458

    14,810

    Boyer-Moore Pre-processing with approximate matching

    0.0008458

    11,215

    Pigeon Hole with Boyer-Moore and approximate matching

    0.0008458

    14,810

    Pigeon Hole with K-Mer Index and approximate matching

    0.0007258

    11,215

    Conclusion:

    The XGBoost model developed by the data science consulting firm was able to predict income more accurately than the traditional methods used by the insurance company. The model can be used to target marketing campaigns towards customers who are likely to have a higher income, increasing the chances of selling premium products. The insurance company can also use the insights gained from the feature importance analysis to make more informed decisions in their business strategy.

    Have a question ?

    Are you looking to create a lasting impact with your data analytics? Contact us to create them in hours.