Skip to content

Instantly share code, notes, and snippets.

@kun432
Last active August 23, 2024 09:10
Show Gist options
  • Save kun432/f148203dbef58eb5b2cf0b575ad96ca8 to your computer and use it in GitHub Desktop.
Save kun432/f148203dbef58eb5b2cf0b575ad96ca8 to your computer and use it in GitHub Desktop.
Zipstack/unstract issue #595-02 using LLMWhisperer

Following the procedures below

https://gist.github.com/kun432/a8d7238c9c1fd738aed5f7d7771ba4a5

except:

  • using LLMWhisperer as text extractor instead of Llama Parse

Checking exisiting Unstract

$ docker ps
CONTAINER ID   IMAGE                              COMMAND                   CREATED        STATUS        PORTS                                            NAMES
9aa84bd0c233   unstract/frontend:latest           "/docker-entrypoint.…"   25 hours ago   Up 25 hours   80/tcp, 0.0.0.0:3000->3000/tcp                   unstract-frontend
d91cbc6302f3   unstract/backend:latest            "./entrypoint.sh mig…"   25 hours ago   Up 25 hours   0.0.0.0:8000->8000/tcp                           unstract-backend
c1a7ba34c88f   unstract/x2text-service:latest     ".venv/bin/gunicorn …"   25 hours ago   Up 25 hours   0.0.0.0:3004->3004/tcp                           unstract-x2text-service
32b85581e254   unstract/platform-service:latest   ".venv/bin/gunicorn …"   25 hours ago   Up 25 hours   0.0.0.0:3001->3001/tcp                           unstract-platform-service
c28d054c6869   unstract/worker:latest             "./entrypoint.sh"         25 hours ago   Up 25 hours   0.0.0.0:5002->5002/tcp                           unstract-worker
a269e2548fae   unstract/backend:latest            ".venv/bin/celery -A…"   25 hours ago   Up 25 hours   8000/tcp                                         unstract-celery-beat
9fe29b9c09ab   unstract/prompt-service:latest     "./entrypoint.sh"         25 hours ago   Up 25 hours   0.0.0.0:3003->3003/tcp                           unstract-prompt-service
f7efea0227e0   unstract/backend:latest            ".venv/bin/celery -A…"   25 hours ago   Up 25 hours   8000/tcp                                         unstract-execution-consumer
d3debd931b0a   redis:7.2.3                        "docker-entrypoint.s…"   25 hours ago   Up 25 hours   0.0.0.0:6379->6379/tcp                           unstract-redis
20d7ef9b2b64   pgvector/pgvector:pg15             "docker-entrypoint.s…"   25 hours ago   Up 25 hours   0.0.0.0:5432->5432/tcp                           unstract-db
b4a52603a022   minio/minio:latest                 "/usr/bin/docker-ent…"   25 hours ago   Up 25 hours   0.0.0.0:9000-9001->9000-9001/tcp                 unstract-minio
200d6d7c951c   qdrant/qdrant:v1.8.3               "./entrypoint.sh"         25 hours ago   Up 25 hours   0.0.0.0:6333->6333/tcp, 6334/tcp                 unstract-vector-db
d59b6e0f04de   flipt/flipt:v1.34.0                "./flipt"                 25 hours ago   Up 25 hours   0.0.0.0:8082->8080/tcp, 0.0.0.0:9005->9000/tcp   unstract-flipt
ccd35f916287   traefik:v2.10                      "/entrypoint.sh --ap…"   25 hours ago   Up 25 hours   0.0.0.0:80->80/tcp, 0.0.0.0:8080->8080/tcp       unstract-proxy
$ docker images
REPOSITORY                  TAG       IMAGE ID       CREATED        SIZE
unstract/backend            latest    086b0b10b430   28 hours ago   3.01GB
unstract/frontend           latest    940e8980cc7c   28 hours ago   305MB
unstract/prompt-service     latest    0c54fad5890d   28 hours ago   2.69GB
unstract/worker             latest    4138fa33df6d   28 hours ago   1.04GB
unstract/platform-service   latest    507e52fa0589   28 hours ago   481MB
unstract/x2text-service     latest    d765e5bbb1f6   28 hours ago   413MB
unstract/tool-structure     0.0.39    f063fc2ce9e8   47 hours ago   2.97GB
minio/minio                 latest    6f23072e3e22   4 days ago     205MB
pgvector/pgvector           pg15      6688455f2364   2 weeks ago    627MB
qdrant/qdrant               v1.8.3    15bd3cee31b3   5 months ago   251MB
traefik                     v2.10     6341b98aec5e   6 months ago   193MB
flipt/flipt                 v1.34.0   369cf32903cb   7 months ago   89MB
redis                       7.2.3     a7cee7c8178f   8 months ago   223MB
$ docker volume ls
DRIVER    VOLUME NAME
local     docker_minio_data
local     docker_postgres_data
local     docker_prompt_studio_data
local     docker_qdrant_data
local     docker_redis_data

Removing exisiting Unstract component

$ docker compose -f docker/docker-compose.yaml down
$ docker ps -a | grep "unstract/" | awk '{ print $1 }' | xargs docker rm
$ docker images | awk '{ print $3 }' | grep -v "IMAGE" | xargs docker rmi
$ docker volume ls | awk '{ print $2 }' | grep -v "VOLUME" | xargs docker volume rm

then removed cloned repo.

Also, removed the cluster in Qdrant Cloud used and recreated a new one.

Install

the same as previous.

Initial Setup After 1st Login

the same as previous except:

TEXT EXTRACTOR

Choose LLMWhisperer with the following settings:

Params Value
Name llmwhisperer
Unstract Key ********
Processing Mode ocr

NOTES: Set to default values except above

set REMOVE_CONTAINER_ON_EXIT=False

the same as previous.

Setup Prompt Studio

the same as previous except:

Project Settings

LLMProfiles

Params Value
Name 請求書パース プロファイル1
LLM openai gpt-4o-mini
Vector Database qdrant cloud
Embedding Model openai embedding
Text Extractor llmwhisperer
Chunk Size 0
Overlap 0

NOTES: Set to default values except above

Setup a Workflow

the same as previous.

Testing API

the same as previous.

result:

{
    "message": {
        "execution_status": "COMPLETED",
        "status_api": "/deployment/api/mock_org/parse_japanese_invoice/?execution_id=bf11826a-f129-4e84-911a-f21abd3dbde7",
        "error": null,
        "result": [
            {
                "file": "請求書サンプル3.pdf",
                "status": "Success",
                "result": {
                    "output": {
                        "invoice_customer_address": {
                            "city": "千代田区",
                            "full_address": "〒 100-0001 東 京 都 千 代 田 区 見 本 町 1-1",
                            "prefecture": "東京都",
                            "zip": "100-0001"
                        },
                        "invoice_customer_name": "範 例 工 業 株 式 会 社",
                        "invoice_issuer_name": "模範商事株式会社",
                        "invoice_line_items": [
                            {
                                "item_name": "特選和紙 (A4サイズ)",
                                "item_num": 1000,
                                "price_per_item": 50000,
                                "price_per_unit": 50
                            },
                            {
                                "item_name": "高級墨 (松煙)",
                                "item_num": 20,
                                "price_per_item": 40000,
                                "price_per_unit": 2000
                            },
                            {
                                "item_name": "筆セット (各種)",
                                "item_num": 50,
                                "price_per_item": 50000,
                                "price_per_unit": 1000
                            }
                        ],
                        "invoice_payment_info": {
                            "payment_method": "請求書",
                            "tax": 14000,
                            "total_w_tax": 154000,
                            "total_wo_tax": 140000
                        }
                    }
                },
                "metadata": {
                    "source_name": "請求書サンプル3.pdf",
                    "source_hash": "0a362e7b1825f8c507b2306d88451ef83e5a7390065a770184865824cce55e7b",
                    "organization_id": "mock_org",
                    "workflow_id": "1b9fa3f9-bbf2-4565-a121-4fd51c9695e0",
                    "execution_id": "bf11826a-f129-4e84-911a-f21abd3dbde7",
                    "total_elapsed_time": 56.497058,
                    "tool_metadata": [
                        {
                            "tool_name": "structure_tool",
                            "elapsed_time": 56.497032,
                            "output_type": "JSON"
                        }
                    ]
                }
            }
        ]
    }
}
@kun432
Copy link
Author

kun432 commented Aug 23, 2024

how status of workflow_data dir changes are below:

before API call

docker/workflow_data
└── tool_registry_config

during API call (just before API call finished)

docker/workflow_data
├── api
│   └── mock_org
│       └── 1b9fa3f9-bbf2-4565-a121-4fd51c9695e0
│           └── bf11826a-f129-4e84-911a-f21abd3dbde7
│               └── 請求書サンプル3.pdf
├── execution
│   └── mock_org
│       └── 1b9fa3f9-bbf2-4565-a121-4fd51c9695e0
│           └── bf11826a-f129-4e84-911a-f21abd3dbde7
│               ├── COPY_TO_FOLDER
│               │   └── 請求書サンプル3.json
│               ├── EXTRACT
│               ├── INFILE
│               ├── METADATA.json
│               ├── SOURCE
│               └── metadata
│                   └── EXTRACT.json
└── tool_registry_config

after API call

docker/workflow_data
├── api
│   └── mock_org
│       └── 1b9fa3f9-bbf2-4565-a121-4fd51c9695e0
├── execution
│   └── mock_org
│       └── 1b9fa3f9-bbf2-4565-a121-4fd51c9695e0
└── tool_registry_config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment