SHELL SCRIPT
wget & pdftotext are the most important tools I have used in this shell script. wget is a tool that can be used to download web pages. Calicut university website prevents the user-agent (eg. wget ) programs from accessing it. So in order to access the result pages , spoofing is required. In that case we will make the website believe that its a browser that's actually accessing it instead of the wget program.This type of spoofing can easily done by the command : wget -U browsername URL. So that is how we downoad individual result pages...
Tip : wget when used with -r switch will download the web pages recursively.
Calicut university on its part publishes the individual results in PDF format. Since creating summary requires text processing, it is important that the PDF files are converted to text format. That is where pdftotext comes handy. As name suggests , it is used for converting PDF format to text format files. After downloading and converting the individual results to text , the summary is just a matter of some amount of text processing.
PYTHON SCRIPT
Calicut university on its part publishes the individual results in PDF format. Since creating summary requires text processing, it is important that the PDF files are converted to text format. That is where pdftotext comes handy. As name suggests , it is used for converting PDF format to text format files. After downloading and converting the individual results to text , the summary is just a matter of some amount of text processing.
PYTHON SCRIPT
In order to crawl through individual result , I took the help of Python urllib module.Challenge in this script too was converting PDF file to text file. Here Python pdfminer module helped me. Along with the PDF miner, a program pdf2txt.py gets installed. This program can be used to convert the pdf file in to text file. As stated earlier the remaining tasks is just all about text processing.
great
ReplyDeleteThank you Kray!
Delete