Deployment as server?
Hi, congrats for the release of the model :)
Just out of curiosity, I tried deploying the model via lmdeploy
The idea is to expose the access as an openai-compatible API. An easy deployment will allow many enterprises to adopt this model, making it even more popular.
pip install lmdeploy timm
pip install flash-attn --no-build-isolation
lmdeploy serve api_server 5CD-AI/Vintern-1B-v3_5 --backend turbomind --server-port 23333
Then access it using openai
API:
from openai import OpenAI
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://127.0.0.1:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'mô tả hình ảnh này',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8,
max_tokens=64)
print(response)
But run into an error:
...
async for res in result_generator:
File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 528, in generate
prompt_input = await self._get_prompt_input(prompt,
File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/serve/vl_async_engine.py", line 100, in _get_prompt_input
results = await self.vl_encoder.wrap_for_turbomind(
File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/engine.py", line 110, in wrap_for_turbomind
result = self.model.to_turbomind(messages, chat_template, tokenizer,
File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/model/internvl.py", line 289, in to_turbomind
return self.to_turbomind_aux(messages, prompt, IMAGE_TOKEN, tokenizer,
File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/model/base.py", line 219, in to_turbomind_aux
segs = prompt.split(IMAGE_TOKEN)
AttributeError: 'NoneType' object has no attribute 'split'
Not sure how to debug this (unfortunately, outside of my knowledge), but let me know if I can help!
Oh, good idea Son ! I will fix it ASAP !
I tried deploying the model with lmdeploy but faced similar issues, even when using Docker and the offline interface. This suggests there might be a deeper issue with the deployment interfaces for these tools. It's interesting to hear that switching to vLLM worked seamlessly for both Docker and offline modes—this could indeed point to a problem specific to lmdeploy's handling of certain configurations.
Now I still try some ways to deploy with lmdeploy
from openai import OpenAI
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://172.18.249.58:8000/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Trích thông tin từ ảnh trả về dạng Json',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://huggingface.co/erax-ai/EraX-VL-7B-V1.5/resolve/main/images/trinhquangduy_front.jpg',
},
}],
}],
temperature=0.8,
top_p=0.8,
max_tokens=64)
print(response)
(Aphrodite) gitlab@AIMACHINE:~/whisperx_test$ python test2.py
ChatCompletion(id='chatcmpl-1cf4eb1850d649eda72c49cc3289c581', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='```json\n{\n "Họ và tên": "TRỊNH QUANG DUY",\n "Ngày sinh": "04/09/1994",\n "Giới tính": "Nam",\n "Quốc tịch": "Việt Nam",\n "Nơi thường trú', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736909146, model='5CD-AI/Vintern-1B-v3_5', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=1327, total_tokens=1391, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Good new ! v3.5 model now can work better in MRZ task as I promised @thanhtan2136
User: Liệt kê toàn bộ văn bản có trong ảnh.
Assistant: Đặc điểm nhân dạng / Personal identification:
Sẹo chấm C: 2cm dưới trước
đuôi mắt phải
Ngày, tháng, năm / Date, month, year: 04/08/2022
CỤC TRƯỞNG CỤC CẢNH SÁT
QUẢN LÝ HÀNH CHÍNH VỀ TRẤT TỰ XÃ HỘI
DIRECTOR GENERAL OF THE POLICE DEPARTMENT FOR ADMINISTRATIVE MANAGEMENT和社会 ORDER
Tô Văn Huệ
Ngón trỏ trái / Left index finger
Ngón trỏ phải / Right index finger
IDVNM1890129363001189012936<<4
8911207F2911205VNM<<<<<<<<<<<< NGUYENKIM<ANH<<<<<<<<<<<<
Thank you for your positivity! I'm glad to have your support. Your model has been incredibly helpful in my work and experiments. If I encounter any issues or come up with ideas for improvement, I’ll be sure to share them. Your contributions to the community are greatly appreciated. Wishing you continued success with your projects!
Hi
@khang119966
@ngxson
,
I've been testing this model. I downloaded it to my local computer and renamed it to InternVL2_5-1B. While running it with Lmdeploy, it runs normally, but the results are no good. Therefore, I believe the issue lies with Lmdeploy rather than the 5CD-AI/Vintern-1B-v3_5 model itself.
Log:
(deloyllm) gitlab@ai-server:~/models$ lmdeploy serve api_server ./InternVL2_5-1B --server-port 23333
/home/gitlab/miniconda3/envs/deloyllm/lib/python3.11/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'):
decorator, or load the model with the torch_dtype
argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)
[WARNING] gemm_config.in is not found; using default GEMM algo
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [16361]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
INFO: 192.168.1.136:37562 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 192.168.1.136:37562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 192.168.1.136:33436 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 192.168.1.136:33436 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ChatCompletion(id='1', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="EE\n{'họEEEE': 'TRỊNH QUANGEE', 'ngàyBS': '04EE91994', 'ngàyquýEE': '04EEEE12351EE', 'ngàyquyếE': '04091994', 'ngàyngBSEEEE': '04091994EE', 'ngàyngBSEEEE': '04091994E', 'ngàyngBSEE': '04EE91994EEE', 'ngàyngBSEEEEEEEE': '040919EE4', 'ngàyngBS': '040919EE4', 'ngàyngBSEE': '04091994E', 'ngàyngBSEEEEEEEE': '04091994E', 'ngàyngBSEE': '040919EE4', 'ngàyngBSEEEEEEEEEEEEEEE': '04091994E', 'ngàyngBSEEEEEE': '040919EE4', 'ngàyngBSEEEEEEEEEEEEE': '040919EE4EEEE', 'ngàyngBSEEEEEEE': '04091994E', 'ngàyngBSEEEEEEEEEEEEEEEEEE': '040919EE4', 'ngàyngBSEEEEE': '040919EEEEEEEE4', 'ngàyngBSEEEEEEEEEEE': '04091994EE4', 'ngàyngBSEEEEEEE': '040919EE4', 'ngàyngBSEEEEEEEEeeEEEEEEE': '040919EE4E', 'ngàyngBSEEEEE': '04091994EEEEEE4', 'ngàyngBSEEEEEEEEEEEEEEEEE': '040919EE4EE4', 'ngàyngBSEEEEEEE': '040919EEEE4EEEEEE', 'ngàyngBSEEEEEEEEeeEeeEeeEeeEee': '04091994E4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEEE': '04091994E4E', 'ngàyngBSEEEEEEEEE': '040919EE4EE4EE4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEEEEEEE': '04091994E4E4EEE4', 'ngàyngBSEEEEEEEEEE': '04091994E4E4E4', 'ngàyngBSEEEEE': '04EE91994EEEEEE4', 'ngàyngBSEEEEEEEEEEEEEEEEEEEEEEEE': '040919EE4EEE4E4', 'ngàyngBSEEEEEEEEeeEeeEeeEEE': '040919EE4EEE4EE4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEEE', 'ngàyngBSEEEEEEE': '04EE91994E4EEE4EE4EE4', 'ngàyngBSEEEEEEEEEEEEEEEE': '04EEEEEEEE4E4E4E4EE4', 'ngàyngBSEEEEEeeEeeEEE': '04091994E4EEE4EE4EE4E', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEEE': '040919EE", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1736916605, model='./InternVL2_5-1B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=1025, prompt_tokens=1338, total_tokens=2363, completion_tokens_details=None, prompt_tokens_details=None))
@thanhtan2136 Wow thanks for the hint about vllm, I completely forgot to check if vllm has support for intern-vl.
I tried with vllm and it works magically 😮
Here I'm using HF space (notebook) with L4 GPU
pip install vllm
vllm serve 5CD-AI/Vintern-1B-v3_5
Then send the request to 127.0.0.1:8000
ChatCompletion(id='chatcmpl-b4e6424360454ba7b8dad973112660b0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Bức ảnh chụp một con hổ đứng trên một thảm cỏ xanh mướt. Hổ có màu sắc cam và đen, với những đường vân đen trắng xen kẽ. Hổ đang nhìn thẳng vào ống kính, tạo cảm giác vui vẻ và tự tin. Bức ảnh', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736932228, model='5CD-AI/Vintern-1B-v3_5', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=811, total_tokens=875, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
Nice!
@khang119966 It would be nice to mention this somewhere in the README. I believe that many vietnamese enterprises are looking for a solution like this!
Btw with vllm working I think we can ignore lmdeploy for now haha. I tried lmdeploy because that was the only thing I could find online
yeah I can deploy this with vllm but I don't known how to control this.... My testing(with Vllm) not have a good result so hope @khang119966 will have update a solution
@thanhtan2136 @ngxson yeah, I will test the vllm and tune the best config. All of my servers are busy today :) . the next few days.