# table_ocr_java
**Repository Path**: vieri111/table_ocr_java
## Basic Information
- **Project Name**: table_ocr_java
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-29
- **Last Updated**: 2025-10-29
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
Yan Shi
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Guide
1. [Overview](#Overview)
2. [Requirements](#Requirements)
3. [Demo](#Demo)
4. [Modules](#Modules)
5. [Contact](#Contact)
6. [Reference](#Reference)
# Overview
This java package contains modules to help with finding and extracting tabular
data from a PDF or image into a CSV format.
Given an image that contains a table…
Extract the the text into a CSV format…
节次 星期,周—,周二,周三,周四,周五
一,语文,英语,英语,自然,数学
二,语文,英语,英语,语文,数学
三,数学,语文,数学,语文,英语
四,数学,语文,数学,体育,英语
五,体育,思想品德,语文,数学,手工
六,美术,音乐,语文,数学,写字
# Requirements
See maven dependency jar package.
- `pdfbox` 2.0.26
- `javacv` 1.5.7
- `djl` 0.17.0
- ...
# Demo
There is a demo module that will try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.
1.table/demo/MainDemo
That will run against the following image:
The following should be saved to your directory after running the class table/demo/MainDemo.java.
Extracted the following tables from the image:
[('/img_test/simple.png', ['/img_test/simple/table-0.png'])]
Extracted cells from /img_test/simple/table-0.png
Cells:
/img_test/simple/table-0/0-0.png
/img_test/simple/table-0/0-1.png
/img_test/simple/table-0/0-2.png
...
Here is the entire CSV output:
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
# Modules
The package is split into modules with narrow focuses.
- `pdf_to_images` uses pdfbox to extract images from a PDF.
- `extract_tables` finds and extracts table-looking things from an image.
- `extract_tables_dnn` finds and extracts table-looking things from an image by deep learning model.
- `extract_cells` extracts and orders cells from a table.
- `ocr_image` uses djl to OCR the text from an image of a cell.
- `ocr_to_csv` converts into a CSV the directory structure that `ocr_image` outputs.
The outputs of a previous module can be used by a subsequent module so that they
can be chained together to create the entire workflow.
# Contact
1、github:https://github.com/jiangnanboy
2、QQ:2229029156
3、Email:2229029156@qq.com
# Reference
https://github.com/jiangnanboy/doc_ai
https://github.com/deepjavalibrary/djl
https://github.com/jiangnanboy/java-springboot-paddleocr
https://github.com/jiangnanboy/layout_analysis4j