Getting Started with Apach Tika

Reference: https://tika.apache.org/

Apache Tika detects and extracts content & metadata from many file types (pdf, doc, txt, spss, ppt..etc). Here we will see an example of how to extract content and metadata from a pdf file using Apache Tika.

Prerequisite: JDK 1.5+, Intellij or Eclipse, Download tika from https://tika.apache.org/download.html then run: java -jar tika-app-1.14.jar

Above command will open gui window where we can add the pdf file for which content needs to be be extracted. If pdf file size is large then write a Java program using tika api, like the sample code below

Steps:

– Open Intellij/Eclipse IDE and create New Java based Project Tikasample
– Create new class “TikaPdfExtractor” as below

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class TikaPdfExtractor{

public static void main(final String[] args) throws IOException, TikaException {
try {
BodyContentHandler handler;
handler = new BodyContentHandler(10000000000);
Metadata metadata = new Metadata();

FileInputStream inputstream = new FileInputStream(new File(args[0]));
ParseContext pctx = new ParseContext();

//parsing the document
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata, pctx);
String plaintext = handler.toString();
System.out.print("Contents of the PDF:" + plaintext);

//metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();

for (String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
   }
  } catch (SAXException e) {e.printStackTrace();}
 }
}

– Include the tika-app-x.jar in the library, Click build to create the jar file
– “TikaPdfExtractor.jar” file will be created under out>>artifacts>>TikaPdfExtractor_jar folder
– Execute the “TikaPdfExtractor.jar” using command:       java -jar TikaPdfExtractor.jar >path of the pdf file<

Advertisements