Winter(2016) Project Report

Prof. Maya Ramanath


Introduction

This project aims to find/guess the definition of technical terms using e-books as source. Specifically the e-books are in PDF format. The language used is Java. PDFs are parsed using PDFBox library.

Progress

30th November

Assumptions:

The PDFs to be parsed have a bookmark referring to the pages like Preface, Contents, beginning of chapters, Index, etc.


Tasks completed:


Result-

Command Line Output


Some Challenges:


1st December

Tasks completed:

Result-

Command Line Output

What to do next

Remove Header and Footers from pages.

Identify the paragraph containing the term.


2nd December


Task Completed

Result-

Command Line Output.


3rd December


Task Completed

Result-

Command Line Output.



4th December


Task Completed

Result-

Command Line Output.


7th December


Task Completed

What to do next?

Find more patterns in which definitions can be found.

Try to generalize code to fit with multiple books.

Explore possibilities of using a POS (Part-of-speech) parser.


Result-

Command Line Output.