# grobid-client-java **Repository Path**: sulifer/grobid-client-java ## Basic Information - **Project Name**: grobid-client-java - **Description**: No description available - **Primary Language**: Java - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-10-16 - **Last Updated**: 2024-06-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Simple Java client for GROBID REST services This Java project can be used to process in an efficient concurrent manner a set of PDF in a given directory by the [GROBID](https://github.com/kermitt2/grobid) service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF. ## Build and run You need first to install and start the *grobid* service, latest stable version, see the [documentation](http://grobid.readthedocs.io/). It is assumed that the server will run on the address `http://localhost:8070`. You can change the server address by editing the file `grobid-client.properties`. Build the client: > mvn clean install ## Usage Parameters: * `-in`: path to the directory containing the PDF files to process * `-out`: path to the directory where to put the results * `-exe`: GROBID service to be used, one of `[processFulltextDocument, processHeaderDocument, processReferences]` (default `processFulltextDocument`) * `-n`: concurrency for service usage (default 10) For example: > java -jar target/org.grobid.client-0.5.4-SNAPSHOT.one-jar.jar -in ~/tmp/in2 -out ~/tmp/out -exe processFulltextDocument This command will process all the PDF files present in the input directory (files with extension `.pdf` only) with the `processFulltextDocument` service of GROBID, and write the resulting XML TEI files under the output directory, reusing the file name with a different file extension (`.tei.xml`), using the default `10` concurrent workers. > java -jar target/org.grobid.client-0.5.4-SNAPSHOT.one-jar.jar -in ~/tmp/in2 -out ~/tmp/out -n 20 -exe processHeaderDocument This command will process all the PDF files present in the input directory (files with extension `.pdf` only) with the `processHeaderDocument` service of GROBID, and write the resulting XML TEI files under the output directory, reusing the file name with a different file extension (`.tei.xml`), using `20` concurrent workers. ## Benchmarking Full text processing of __136 PDF__ (total 3443 pages, in average 25 pages per PDF) on Intel Core i7-4790K CPU 4.00GHz, 4 cores (8 threads), 16GB memory, n being the concurrency parameter: | n | runtime (s)| s/PDF | PDF/s | |----|------------|-------|-------| | 1 | 210.9 | 1.55 | 0.64 | | 2 | 117.2 | 0.86 | 1.16 | | 3 | 85.5 | 0.63 | 1.59 | | 5 | 69.8 | 0.51 | 1.95 | | 8 | 60.9 | 0.45 | 2.23 | | 10 | 60.2 | 0.44 | 2.26 | ![Runtime Plot](resources/20180929081805.png) As complementary info, GROBID processing of header of the 136 PDF and with `n=10` takes 3.82 s (more than 15 times faster than the complete full text processing because only the two first pages of the PDF are considered), 35.60 PDF/s. In similar conditions, extraction and structuring of bibliographical references takes 27.9 s, 4.87 PDF/s. ## Todo Benchmarking with more files (e.g. million ISTEX PDF). Also implement existing GROBID services for text input (date, name, affiliation/address, raw bibliographical references, etc.). Better support for parameters (including elements where to put coordinates). ## License Distributed under [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). Main author and contact: Patrice Lopez ()