Saturday, November 12, 2011

Calicut University result summary using Shell & Python scripts

 I wanted to create the summary of the university exam results of my class with details like ranklist, the number of students between different SGPA limits and the subject wise performance of students. I developed Shell script for the first two procedures and a Python script for the first and last procedures. Both the scripts involves the process of crawling through individual results of students. Here is the link

 wget  & pdftotext  are the most important tools I have used in this shell script. wget is a tool that can be used to download web pages. Calicut university website prevents the user-agent (eg. wget ) programs from accessing it. So in order to access the result pages , spoofing is required. In that case we will make the website believe that its a browser that's actually accessing it instead of the wget program.This type of spoofing can easily done by the command : wget -U browsername URL. So that is how we downoad individual result pages...
Tip :    wget when used with -r switch will download the web pages recursively.

Calicut university  on its part publishes the individual results in PDF format. Since creating summary requires text processing, it is important that the PDF files are converted to text format. That is where pdftotext comes handy. As name suggests , it is used for converting PDF format to text format files. After downloading and converting the individual results  to text , the summary is just a matter of some amount of text processing.

                  In order to  crawl through individual result , I took the help of Python urllib module.Challenge in this script too was converting  PDF file to text file. Here Python pdfminer module helped me. Along with the PDF miner, a program gets installed. This program can be used to convert the pdf file in to text file. As stated earlier the remaining tasks is just all about text processing.