BI296: Linux Programming For Bioinformatics

2018 Spring

[Exercise 0]

  1. Follow the instructions to install CentOS on your laptop, together with Windows 7/8/10.
  2. If you use EFI mainboard instead of BIOS/MBR, refer to this tutorial.
  3. Or instead you can use virtual machine to install CentOS.
  4. If you want to install CentOS under MacOS, use Docker instead.

[Exercise 1](Due: Mar 23, 2018)

  1. LinuxFun 8.8, 9.9, 10.7, 11.10, 12.9
  2. Exercise 1

[Exercise 2: Regular Expression](Due: April 8, 2018)

[Exercise 3: BASH](Due: April 28, 2017.)

  1. Convert chipseq.txt into processed.txt.
  2. Write a bash script to calculate the sequence length for each sequence in regions.fas. You should output the sequence name as well as its corresponding length.
  3. Summarize the usage of (), [], {} in bash.
  4. Use some examples to describe the usage of the bash command bc.
  5. Write a bash program to generate random DNA sequence of length 100, and with nucleotide compositions: A(0.30), T(0.30), C(0.20), G(0.20) or whatever you want.
  6. Generate the random sequences of length 200, 500, 1000, and 2000, with the same composition as above.
  7. Use histogram or barplot to test whether the random sequences are consistent with the preset composition.
  • Project 1: Sorting an Array in Different Algorithms
  • [Tutorials][Source Code Templates]

    You are required to pack the sorting algorithms into a Python module, so that you can call all the algorithms by importing the module. Besides sorting the numerical array, the module should be able to sort a string array as well.

  • Project 2: Image Compression Algorithms
  • Image Compression algorithms can be compared under categories of lossy and lossless techniques. The best image compression algorithm known for lossy compression is 'WebP'. This algorithm became popular in May, 2014 when Facebook started using it. Google's this new image compression algorithm provides 25-35% more compression as compared to JPEG. Though the algorithm is lossy in nature but the loss of information is far beyond the scope of human visual perception.

    However, when we talk about lossless compression algorithm that produces the smallest output regardless of speed, the 'best' algorithm is probably one of the PAQ-based context mining algorithms, which use a large number of independent context models to predict the next pixel in an image from neighboring pixels, followed by weighted averaging o the predictions and arithmetic coding. Some also use color transformations like (R, G, B)->(G, G-R, G-B) to decrease the correlation between colors. Here are three of the benchmarks:

    1. SqueezeChart
    2. Silesia Compression Benchmark
    3. Bitmap file-type compression test

  • Project 3: Sparse Matrices Representation and Computation
  • A sparse matrix is a matrix which contains only a very small proportion of non-zero elements, such as a 100-by-100 matrix with only 500 non-zero elements. Can you figure out a way to store the matrix in order to reduce the storage demand? And furthermore, define some functions for sparse matrix computation, e.g., computing the sum, difference, product of two sparse matrices, or transpose the matrix, computing the determinant, eigenvalues and eigenvectors.

  • Project 4: Encryption algorithms - Comparison
  • Project 5: Biological Webcrawler System
  • Project 6: Email Spam Detection Algorithms
  • Project 7: Deep Learning Framework
  • Project 8: Stochastic Sampling Methods
  • There are two principal types of sampling methods: permutation and bootstrapping. That is, sampling without replacement and with replacement, respectively. Here we need you to conduct these two sampling method on a vector/matrix

  • Project 9: Four Arithmetic Operations
  • Write a program to simulate the four arithmetic operations on two integers of arbitrary length. Here you may treat the very long-integers as a list of characters, and save them in the stacks or queues you learn in the lecture. Then conduct the adding, subtracting, and multiplication following the procedures you learn when you were a pupil. Division is not required.

  • Project 10: MPI Parallel Computing