Skip to content

Instantly share code, notes, and snippets.

@kun432
Last active August 23, 2024 09:08
Show Gist options
  • Save kun432/a8d7238c9c1fd738aed5f7d7771ba4a5 to your computer and use it in GitHub Desktop.
Save kun432/a8d7238c9c1fd738aed5f7d7771ba4a5 to your computer and use it in GitHub Desktop.
Zipstack/unstract issue #595

Environment Info

os

$ docker version
Client:
 Version:           27.1.1
 API version:       1.46
 Go version:        go1.21.12
 Git commit:        6312585
 Built:             Tue Jul 23 19:54:12 2024
 OS/Arch:           darwin/arm64
 Context:           desktop-linux

Server: Docker Desktop 4.33.0 (160616)
 Engine:
  Version:          27.1.1
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.12
  Git commit:       cc13f95
  Built:            Tue Jul 23 19:57:14 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.19
  GitCommit:        2bf793ef6dc9a18e00cb12efb64355c2c9d5eb41
 runc:
  Version:          1.7.19
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

$ docker images
REPOSITORY   TAG       IMAGE ID   CREATED   SIZE

$ docker volume ls
DRIVER    VOLUME NAME

$ 

Install

$ git clone https://github.com/Zipstack/unstract unstract-test && cd unstract-test

I don't think this involves when using Docker, but run-platform.sh use python inside of it and may somehow involves.

$ which python
/Users/kun432/.pyenv/shims/python

$ cat .python-version
3.9.6

$ python --version
Python 3.9.6
$ ./run-platform.sh
Once the services are up, visit http://frontend.unstract.localhost in your browser.

See logs with:
    docker compose -f docker/docker-compose.yaml logs -f
Configure services by updating corresponding <service>/.env files.
Make sure to restart the services with:
    docker compose -f docker/docker-compose.yaml up -d

###################### BACKUP ENCRYPTION KEY ######################
Copy the value of ENCRYPTION_KEY in any of the following env files
to a secure location:

- backend/.env
- platform-service/.env

Aapter credentials are encrypted by the platform using this key.
Its loss or change will make all existing adapters inaccessible!
###################################################################
$ docker ps
CONTAINER ID   IMAGE                              COMMAND                   CREATED          STATUS          PORTS                                            NAMES
68f14c14c6d7   unstract/frontend:latest           "/docker-entrypoint.…"   46 minutes ago   Up 46 minutes   80/tcp, 0.0.0.0:3000->3000/tcp                   unstract-frontend
16c373bdb4e4   unstract/backend:latest            "./entrypoint.sh mig…"   46 minutes ago   Up 46 minutes   0.0.0.0:8000->8000/tcp                           unstract-backend
988bf2eb71d7   unstract/worker:latest             "./entrypoint.sh"         46 minutes ago   Up 46 minutes   0.0.0.0:5002->5002/tcp                           unstract-worker
499be651d9c4   unstract/prompt-service:latest     "./entrypoint.sh"         46 minutes ago   Up 46 minutes   0.0.0.0:3003->3003/tcp                           unstract-prompt-service
271229a6ee7a   unstract/platform-service:latest   ".venv/bin/gunicorn …"   46 minutes ago   Up 46 minutes   0.0.0.0:3001->3001/tcp                           unstract-platform-service
b840e1f24ae1   unstract/backend:latest            ".venv/bin/celery -A…"   46 minutes ago   Up 46 minutes   8000/tcp                                         unstract-execution-consumer
56d8d66d308a   unstract/backend:latest            ".venv/bin/celery -A…"   46 minutes ago   Up 44 minutes   8000/tcp                                         unstract-celery-beat
175c9b041457   unstract/x2text-service:latest     ".venv/bin/gunicorn …"   46 minutes ago   Up 46 minutes   0.0.0.0:3004->3004/tcp                           unstract-x2text-service
47cb9c1efc57   redis:7.2.3                        "docker-entrypoint.s…"   46 minutes ago   Up 46 minutes   0.0.0.0:6379->6379/tcp                           unstract-redis
5a0635356bfc   pgvector/pgvector:pg15             "docker-entrypoint.s…"   46 minutes ago   Up 46 minutes   0.0.0.0:5432->5432/tcp                           unstract-db
e624a063571b   minio/minio:latest                 "/usr/bin/docker-ent…"   46 minutes ago   Up 46 minutes   0.0.0.0:9000-9001->9000-9001/tcp                 unstract-minio
506d5437fcdb   qdrant/qdrant:v1.8.3               "./entrypoint.sh"         46 minutes ago   Up 46 minutes   0.0.0.0:6333->6333/tcp, 6334/tcp                 unstract-vector-db
304f2756a3ac   flipt/flipt:v1.34.0                "./flipt"                 46 minutes ago   Up 46 minutes   0.0.0.0:8082->8080/tcp, 0.0.0.0:9005->9000/tcp   unstract-flipt
adfd91dcdea9   traefik:v2.10                      "/entrypoint.sh --ap…"   46 minutes ago   Up 46 minutes   0.0.0.0:80->80/tcp, 0.0.0.0:8080->8080/tcp       unstract-proxy

Initial Setup After 1st Login

LLM

Choose OpenAI with the following settings:

Params Value
Name openai gpt-4o-mini
API Key ********
Model gpt-4o-mini

NOTES: Set to default values except above

VECTOR DATAVBASE

Choose Qdrant with the following settings:

Params Value
Name https://XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.europe-west3-0.gcp.cloud.qdrant.io:6333
URL ********
API Key ********

EMBEDDING MODEL

Choose OpenAI with the following settings:

Params Value
Name openai embedding
API Key ********

NOTES: Set to default values except above

TEXT EXTRACTOR

Choose LlamaParse with the following settings:

Params Value
Name llama parse
API Key ********

NOTES: Set to default values except above

set REMOVE_CONTAINER_ON_EXIT=False

$ docker compose -f docker/docker-compose.yaml down
$ vi worker/.env
REMOVE_CONTAINER_ON_EXIT=False
$ ./run-platform.sh

Setup Prompt Studio

Create a New Project

Params Value Notes
Tool Name 請求書パース menas "invoice parser"
Author/Org Name kun432
Description 請求書をパースするツール means "a tool for parsing invoice"
Icon 📄

Project Settings

LLM Profiles

Params Value
Name 請求書パース プロファイル1
LLM openai gpt-4o-mini
Vector Database qdrant cloud
Embedding Model openai embedding
Text Extractor llama parse
Chunk Size 0
Overlap 0

NOTES: Set to default values except above

Preamble

Your ability to extract and summarize this information accurately is essential for effective Japanese invoice analysis. Pay close attention to the invoice's language, structure, and any cross-references to ensure a comprehensive and precise extraction of information. Do not use prior knowledge or information from outside the context to answer the questions. Only use the information provided in the context to answer the questions.

Postemble

Do not include any explanation in the reply. Only include the extracted information in the reply.

NOTES: the same as default

Manage Documents

upload the following 2 PDFs.

Prompts

Field Prompts Type
invoice_issuer_name この請求書を発行した発行者または会社の名称は何ですか? Text
invoice_customer_name この請求書に記載されているお客様名または会社の名前はなんですか?敬称は不要です。 Text
invoice_customer_address 提供された文脈には複数の住所が記載されている可能性があるため、まずすべての住所を収集してください。次に、この請求書が誰宛てに送られているのか、つまり請求先お客様の名前を理解するようにしてください。そして、その名前が合致する住所を見つけてください。お客様の住所を常に返すようにしてください。他の住所を返さないでください。

お客様の住所については、以下のフィールドを持つシンプルなJSONオブジェクトを作成してください。
- full_address: お客様の完全な住所である必要がある
- prefecture: 住所から取得した都道府県名のみである必要がある
- city: 住所から取得した市区町村名のみである必要がある
- zip: 郵便番号のみである必要がある
json
invoice_payment_info 請求金額は請求書において重要な部分であり、支払い方法・小計(税抜)・消費税額・合計請求額(税込)で構成される。

以下のフィールドを含むJSONオブジェクトを返してください。
- payment_method: 支払い方法。以下の3つから選択。"請求書"、"小切手"、"クレジットカード"
- total_wo_tax: 税抜の小計金額
- tax:消費税額
- total_w_tax:税込の合計請求額
json
invoice_line_items この請求書には請求内容の内訳が記載されており、内訳に記載された各請求項目は与えられたコンテキスト全体にわたって分割することができる。常に全体的なコンテキストを確認し、すべての請求項目の詳細を回答してください。

各請求項目について、以下のフィールドを含むシンプルなJSONオブジェクトを作成してください。
- item_name: 請求項目の項目名
- item_num: 請求項目の個数
- price_per_unit: 請求項目のユニットあたりの単価
- price_per_item: 請求項目ごとの金額

これらの項目を含むオブジェクトをJSON配列に格納し、それを返してください。
json

※In English (for description purpose):

Field Prompts Type
invoice_issuer_name What is the name of the issuer or company that issued this invoice? Text
invoice_customer_name What is the name of the customer or company on this invoice? Honorific titles are not required. Text
invoice_customer_address First collect all addresses, as the context provided may contain more than one address. Next, try to understand to whom this invoice is being sent, i.e., the name of the billing customer. Then find the address that matches that name. Always return the customer's address. Do not return other addresses.

For customers' address, create a simple JSON object with the following fields
- full_address: Must be the complete address of the customer
- prefecture: Must be only the name of the prefecture taken from the address
- city: Must be only the name of the municipality obtained from the address
- zip: Must be zip code only
json
invoice_payment_info The invoice amount is an important part of the invoice and consists of the payment method, subtotal (excluding tax), sales tax amount, and total invoice amount (including tax).

Return a JSON object containing the following fields
- payment_method: Payment Method. Choose from the following three options.” “Bill”, ‘Check’, or ‘Credit Card’.
- total_wo_tax: Subtotal amount excluding tax
- tax:amount of consumption tax
- total_w_tax:Total billing amount including tax
json
invoice_line_items The invoice contains a breakdown of the billing details, and each billing item listed in the breakdown can be broken down over the entire given context. Always check the overall context and respond with details for all billing items.

For each billing item, create a simple JSON object containing the following fields
- item_name: Item name of the billing item
- item_num: Number of billing items
- price_per_unit: Unit price per unit for billed items
- price_per_item: Amount per billing item

Store the object containing these items in a JSON array and return it.
json

"Run All" for each propmpt and check Output Analyzer

SS_ 2024-08-22 15 14 34

Also check Raw View to make sure if text extractor works. (This means Llama Parse works for Japanese docs, IMO.)

SS_ 2024-08-22 15 16 24

Then, export Prompt Studio project as tool.

Setup a Workflow

create a new workflow.

Params Value
Workflow Name 請求書パースAPIワークフロー
Description 請求書パースAPIワークフロー

Workflow Settings:

Params Value Notes
Input Setting API
Workflow Chain 請求書パース the tool exported in previous section.
Output Setting API

Then, "Deploy as API".

Params Value Notes
Display Name 請求書パースAPI
Description 請求書パースAPI
API Name parse_japanese_invoice

Testing API

Deployed API is below:

Params Value
API Name 請求書パースAPI
API Endpoint http://frontend.unstract.localhost/deployment/api/mock_org/parse_japanese_invoice/
API Key XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Using postman and test API:

Params Value
URL http://frontend.unstract.localhost/deployment/api/mock_org/parse_japanese_invoice/
Method POST

Authorization

Params Value
Auth Type Bearer Token
Token XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Body(form-data)

Key Type Value
files File 請求書サンプル3.pdf
timeout Text 300

NOTES: the file used above is https://drive.google.com/file/d/1St9it9cj3SY0GnamZkBjDO3tyVMkA5GZ/view

results:

{
    "message": {
        "execution_status": "ERROR",
        "status_api": "/deployment/api/mock_org/parse_japanese_invoice/?execution_id=3c37e236-a1f0-49c8-8fb8-09960fffb72b",
        "error": null,
        "result": [
            {
                "file": "請求書サンプル3.pdf",
                "status": "Failed",
                "error": "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'"
            }
        ]
    }
}
@kun432
Copy link
Author

kun432 commented Aug 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment