tessract-ocr文本识别系统网页搭建【PHP】

时间:2025-04-02 18:34:04

tessract-ocr文本识别系统网页搭建

下面是我搭的网站:

OCR文本识别系统


在安装后tesstact-ocr后,我找到了一个利用php调用ocr接口的方法,在github上有这样一个项目:

/thiagoalessio/tesseract-ocr-for-php

这个就是用php封装了一下命令行的ocr接口

我利用composer把源码下载下来,然后编写了一个简单的ocr识别网站

利用一下命令进行下载:

composer requir thiagoalessio/tesseract_ocr


下面展示各个函数具体用法

Tesseract OCR for PHP

A wrapper to work with Tesseract OCR inside PHP.

   

Installation

First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)

As a composer dependency

{
    "require": {
        "thiagoalessio/tesseract_ocr": "1.0.0-RC"
    }
}

Usage

Basic usage

Given the following image ():

And the following code:

<?php
echo (new TesseractOCR(''))
    ->run();

The output would be:

The quick brown fox
jumps over the lazy
dog.

Other languages

Given the following image ():

And the following code:

<?php
echo (new TesseractOCR(''))
    ->run();

The output would be:

griifien

Which is not good, but defining a language:

<?php
echo (new TesseractOCR(''))
    ->lang('deu')
    ->run();

Will produce:

grüßen

Multiple languages

Given the following image ():

And the following code ....

<?php
echo (new TesseractOCR(''))
    ->lang('eng', 'jpn', 'por')
    ->run();

The output would be:

I eat 寿司 de maçã

Inducing recognition

Given the following image ():

And the following code ....

<?php
echo (new TesseractOCR(''))
    ->whitelist(range('A', 'Z'))
    ->run();

The output would be:

BOSS

API

->executable('/path/to/tesseract')

Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH.

->tessdataDir('/path')

Specify a custom location for the tessdata directory.

->userWords('/path/to/')

Specify the location of user words file.

This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract.

Useful when dealing with contents that contain technical terminology, jargon, etc.

Example of a user words file:

$ cat /path/to/
foo
bar

->userPatterns('/path/to/')

Specify the location of user patterns file.

If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.

Example of a user patterns file:

$ cat /path/to/'
1-\d\d\d-GOOG-441
www.\n\\\*.com

->lang('lang1', 'lang2', 'lang3')

Define one or more languages to be used during the recognition. A complete list of available languages can be found at/tesseract-ocr/tesseract/blob/master/doc/tesseract.#languages

Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra') for proper recognition of Chinese.

->psm(6)

Specify the Page Segmentation Mode, which instructs tesseract how to interpret the given image.

Possible psm values are:

 0 = Orientation and script detection (OSD) only.
 1 = Automatic page segmentation with OSD.
 2 = Automatic page segmentation, but no OSD, or OCR.
 3 = Fully automatic page segmentation, but no OSD. (Default)
 4 = Assume a single column of text of variable sizes.
 5 = Assume a single uniform block of vertically aligned text.
 6 = Assume a single uniform block of text.
 7 = Treat the image as a single text line.
 8 = Treat the image as a single word.
 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

->config('configvar', 'value')

Tesseract offers incredible control to the user through its 660 configuration vars.

You can see the complete list by running the following command:

$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...

->whitelist(range('a', 'z'), range(0, 9), '-_@')

This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').

Where to get help

  • #tesseract-ocr-for-php on freenode IRC

License

Apache License 2.0.


下面就是编一个php网页了:


我用了bootstrap,和表单进行文件上传,

下面直接上代码:

<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<title>OCR文本识别系统</title>
<!-- 新 Bootstrap 核心 CSS 文件 -->
<link rel="stylesheet" href="/bootstrap/3.3.0/css/">
<!-- 可选的Bootstrap主题文件(一般不用引入) -->
<link rel="stylesheet" href="/bootstrap/3.3.0/css/">
<!-- jQuery文件。务必在 之前引入 -->
<script src="/jquery/1.11.1/"></script>

<!-- 最新的 Bootstrap 核心 JavaScript 文件 -->
<script src="/bootstrap/3.3.0/js/"></script>


<style type="text/css">
  .form{
    position:absolute;
    left:600px;
    top:100px 

  }
  .image{
    position:absolute;
    left:10px;
    top:60px 
  }
  .retext{
    position:absolute;
    top:370px 
  }

  .body{
    background-image: url("./img/");
  }
  .text{
    text-align: center;       
  }
</style>
</head>

<body class = "body">
<?php// $file_path = './img/'?>
<nav class="navbar navbar-inverse" role="navigation">
  <div class="container-fluid">
    <!-- Brand and toggle get grouped for better mobile display -->
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
        <span class="sr-only">Toggle navigation</span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <a class="navbar-brand" href="">OCR文本识别系统</a>
    </div>

    <!-- Collect the nav links, forms, and other content for toggling -->
    <div class="collapse navbar-collapse" >
      <!--<ul class="nav navbar-nav">
        <li class="active"><a href="#">系统介绍</a></li>
        <li class="dropdown">
          <a href="#" class="dropdown-toggle" data-toggle="dropdown">Dropdown <span class="caret"></span></a>
          <ul class="dropdown-menu" role="menu">
            <li><a href="#">Action</a></li>
            <li><a href="#">Another action</a></li>
            <li><a href="#">Something else here</a></li>
            <li class="divider"></li>
            <li><a href="#">Separated link</a></li>
            <li class="divider"></li>
            <li><a href="#">One more separated link</a></li>
          </ul>
        </li>
      </ul>
-->
      <!--<form class="navbar-form navbar-left" role="search">
        <div class="form-group">
          <input type="text" class="form-control" placeholder="Search">
        </div>
        <button type="submit" class="btn btn-default">Submit</button>
      </form>
      -->
      <ul class="nav navbar-nav navbar-right">
        <li><a href="#">系统介绍</a></li>
        <li><a href="#">实验室介绍</a></li>
        <li class="dropdown">
          <a href="#" class="dropdown-toggle" data-toggle="dropdown">参考资源 <span class="caret"></span></a>
          <ul class="dropdown-menu" role="menu">
            <li><a href="/thiagoalessio/tesseract-ocr-for-php">tesseract-ocr-for-php</a></li>
            <li><a href="/">Bootstrap</a></li>
            <li><a href="http:///">W3School</a></li>
            <li class="divider"></li>
            <li><a href="/tesseract-ocr/tesseract">tesseract-ocr</a></li>
          </ul>
        </li>
      </ul>
    </div><!-- /.navbar-collapse -->
  </div><!-- /.container-fluid -->
</nav>





<div class = "form">
  <form  action="" method="post" enctype="multipart/form-data">
    <label for="file">上传图片:</label>
      <input type="file" name="file"  /> 
      <br />
      <p>请选择语言类型(可多选):</p> 
      <div class="checkbox">
        <label>
         <input type="checkbox" value="English" name ="mrbook[]" >
         English
        </label>
      </div>
      <div class="checkbox">
       <label>
        <input type="checkbox" value="中文" name ="mrbook[]">
        中文
       </label>
     </div>

     <div class="checkbox">
     <label>
       <input type="checkbox" value="Deutsch" name ="mrbook[]">
       Deutsch
     </label>
     </div>
     <div class="checkbox">
       <label>
       <input type="checkbox" value="한국의" name ="mrbook[]">
         한국의
       </label>
     </div>
      <input type="submit" name="submit" class="btn btn-success" value="Submit" />
</form>
</div>


<?php

    if(!empty($_FILES['file']))
    {
        $file_path = sprintf("./upload/%s",$_FILES['file']['name']);
        if(!move_uploaded_file($_FILES["file"]["tmp_name"],
            $file_path))
            echo $_FILES["file"]["error"];
        for($i=0 ;$i<count($_POST[mrbook]);$i++)
            if(strcmp($_POST[mrbook][$i],"English") == 0)
                $lan_type = sprintf("%s %s",$lan_type,"eng");
            else if(strcmp($_POST[mrbook][$i],"中文") == 0)
                $lan_type = sprintf("%s %s",$lan_type,"chi_sim");

    }


    require '../vendor/';
    $ocr = new \TesseractOCR($file_path);
    $string = $ocr ->lang($lan_type) ->run();

?>
<div class = "image">
   <p>
    <img src=<?php if(!empty($file_path)) echo $file_path; else echo './img/';?> width="440" height="300" />
   </p>
</div>

<div class = "retext" style ="width :100%">
<p>识别结果:</p>
<textarea class="form-control" rows="10"><?php echo $string; ?></textarea>
</div>



<!--
<p>
<img src="./img/" width="128" height="128" />
</p>
-->
<?php
/*require '../vendor/';
$ocr = new \TesseractOCR('./img/');
$string = $ocr->run();
echo $string;
 */
?>


</body>
</html>


以后还会再优化,代码很简单,不再赘述。
写的不太好,请指出不足^.^!