Ollama: 내 컴퓨터에 설치하는 생성형 AI 서비스

AI|2024. 1. 28. 12:33

오늘 소개하는 툴은 바로 ollama입니다. LlaMa가 공개된 이후 여러가지 방법으로 로컬 컴퓨터에서 LLM을 설치해서 나만의 ChatGPT를 구현하는 여러 방법이 소개되었습니다. 누구나 컴퓨터만 있으면 쉽게 설치해서 사용할 수 있습니다.

curl https://ollama.ai/install.sh | sh

자, LlaMa2 언어 모델을 설치해봅시다.

ollama pull llama2

실행도 너무나 쉽습니다.

$ ollama run llama2
>>> hello
Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?

>>> Tell me about Steve Jobs
Steve Jobs (1955-2011) was a visionary entrepreneur, inventor, and designer who co-founded Apple Inc. and Pixar Animation Studios. He is widely 
recognized as one of the most innovative and successful business leaders of the last century. Here are some key facts about Steve Jobs:

1. Early Life: Jobs was born in San Francisco to a Syrian-American father and a Swiss-American mother. He grew up in Mountain View, California, and 
showed an early interest in electronics and design.
2. Co-founding Apple: In 1976, Jobs co-founded Apple Computer with Steve Wozniak. They launched the Apple I, one of the first personal computers, and 
later introduced the Macintosh computer, which was the first commercially successful personal computer to use a graphical user interface (GUI).
3. Pixar Animation Studios: In 1986, Jobs acquired Pixar Animation Studios from Lucasfilm and served as its CEO until it was acquired by Disney in 
2006. Under Jobs' leadership, Pixar created some of the most successful animated films ever made, including Toy Story (1995), Finding Nemo (2003), and 
Wall-E (2008).
4. Design Visionary: Jobs was known for his attention to detail and design sensibilities. He was instrumental in creating the sleek and minimalist 
aesthetic that became synonymous with Apple products, from the Macintosh computer to the iPod, iPhone, and iPad.
5. Innovator: Jobs was a master of innovation, constantly pushing the boundaries of what was possible with technology. He introduced the world to the 
iPod, which revolutionized the way people listened to music, and the iPhone, which transformed the smartphone industry.
6. Entrepreneurial Spirit: Jobs was a true entrepreneur, with a passion for creating new products and businesses. He was known for his ability to 
identify emerging trends and capitalize on them, often disrupting entire industries in the process.
7. Leadership Style: Jobs was known for his strong leadership style, which emphasized creativity, innovation, and attention to detail. He was also 
notorious for his demanding nature and ability to push his teams to achieve their best work.
8. Personal Life: Jobs was married to Laurene Powell Jobs and had four children. He was a Buddhist and a vegetarian, and he enjoyed playing pranks on 
his colleagues and friends.
9. Illness and Death: Jobs battled pancreatic cancer for several years before passing away in 2011 at the age of 56. His death was met with an 
outpouring of tributes and memories from around the world, recognizing his impact on technology, design, and entrepreneurship.

Steve Jobs' legacy continues to inspire and influence generations of innovators, entrepreneurs, and designers, and his work at Apple and Pixar 
Animation Studios remains a testament to his visionary spirit and creative genius.

GPU가 있으면 바로 가속이 되고 없다면 CPU를 사용해서 문장이 생성되는데 시간이 좀 걸립니다.

LlaMa2이외에 다양한 언어 모델을 지원해서 서로 비교해 볼 수 있고, 7B모델 뿐만 아니라 13B와 70B모델도 CPU 메모리로 로딩해서 사용해볼 수 있습니다. 한번 테스트해보세요.

댓글()

Llama-cpp로 Code LlaMa 실행하기

AI|2024. 1. 24. 15:41

Llama-cpp를 이용하면 GPU없이도 CPU만으로 일반 컴퓨터에서 Llama2를 실행할 수 있습니다. 이번 글에서는 그 과정을 살펴보겠습니다.

빌드하기

Ubuntu 22.04에서 테스트했습니다.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

모델 다운로드

아래 사이트에 가면 LlaMa2 모델을 다운로드 받는 방법이 나옵니다.

https://github.com/facebookresearch/llama

Meta 웹사이트에서 등록을해서 메일로 download URL을 받습니다. 받은 URL은 24시간만 유효하니 미리 모두 받는 것을 권장합니다.

https://ai.meta.com/resources/models-and-libraries/llama-downloads/

git clone https://github.com/facebookresearch/codellama
./download.sh  "email로 받은 URL"

...

Enter the list of models to download without spaces (7b,13b,34b,7b-Python,13b-Python,34b-Python,7b-Instruct,13b-Instruct,34b-Instruct), or press Enter for all:

 

모델 변환하기

로컬 PC에서 실행하려면 모델을 변환해야 합니다.

CodeLlama-7b 모델을 예로 하겠습니다.

우선 다운로드 받은 모델을 ggml 형식으로 변환합니다. 

python3 convert.py models/CodeLlama-7b/

바로 Inference 해보았습니다. multiply함수를 자바로 생성해주었습니다.

llama.cpp$ ./main -m ./models/CodeLlama-7b/ggml-model-f16.gguf -p "int multiply("
Log start
main: build = 1939 (57744932)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1706076174
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from ./models/CodeLlama-7b/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
..
..

	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 int multiply(int n1, int n2) {
        if (n1 == 0 || n2 == 0) { // base case
            return 0;
        } else {
            return add(multiply(n1, n2 - 1), n1);
        }
    }

    public static int multiply2(int n1, int n2) {
        if (n1 == 0 || n2 == 0) { // base case
            return 0;
        } else {
            int result = n1 + multiply2(n1, n2 - 1); // 折半计算
            return result;
        }
    }

    public static void main(String[] args) {
        System.out.println("0 * 1 = " + multiply(0, 1));
        System.out.println("1 * 2 = " + multiply(1, 2));
        System.out.println("1 * 3 = " + multiply(1, 3));
        System.out.println("1 * 4 = " + multiply(1, 4));
        System.out.println("2 * 3 = " + multiply(2, 3));
        System.out.println("2 * 5 = " + multiply(2, 5));
        System.out.println("3 * 10 = " + multiply(3, 10));
        System.out.println("3 * 4 = " + multiply(3, 4));
        System.out.println("4 * 8 = " + multiply(4, 8));
        System.out.println("5 * 6 = " + multiply(5, 6));
        System.out. [end of text]

llama_print_timings:        load time =   59120.92 ms
llama_print_timings:      sample time =      49.16 ms /   398 runs   (    0.12 ms per token,  8095.19 tokens per second)
llama_print_timings: prompt eval time =     427.50 ms /     4 tokens (  106.88 ms per token,     9.36 tokens per second)
llama_print_timings:        eval time =  116470.88 ms /   397 runs   (  293.38 ms per token,     3.41 tokens per second)
llama_print_timings:       total time =  117027.28 ms /   401 tokens
Log end

그런데 1분정도 시간이 걸렸습니다. 이제 4비트 모델로 변환해서 성능을 높여보겠습니다.

4비트 모델로 변환하기 (Quantization)

./quantize ./models/CodeLlama-7b/ggml-model-f16.gguf ./models/CodeLlama-7b/ggml-model-q4_0.gguf q4_0

Inference 해보기

-n 으로 token수를 지정할 수 있습니다.

llama.cpp$ ./main -m ./models/CodeLlama-7b/ggml-model-q4_0.gguf -n 256 -p "int multiply("
Log start
main: build = 1939 (57744932)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1706076680
llama_model_loader: loaded meta data with 17 key-value pairs and 291 tensors from ./models/CodeLlama-7b/ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'

sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0


 int multiply(int a, int b) {
	return a * b;
}

int divide(int a, int b) {
	return a / b;
}
```

We can add some simple validation to the function to make sure the input values are sane.  We'll just ensure that `a` and `b` are greater than zero:

```csharp
[DllImport("MyMath")]
public static extern int multiply(int a, int b);

[DllImport("MyMath")]
public static extern int divide(int a, int b);

// Note that we've added the [SuppressUnmanagedCodeSecurity] attribute here.  
[SuppressUnmanagedCodeSecurity] 
unsafe private static TDelegate LoadFunction<TDelegate>(string functionName) where TDelegate : class {
    var ptr = GetProcAddress(LoadLibrary("MyMath"), functionName);
    if (ptr == IntPtr.Zero)
        throw new InvalidOperationException($"Failed to get address for: '{functionName}'");

	var fp = Marshal.GetDelegateForFunctionPointer<TDelegate>(ptr);
   
llama_print_timings:        load time =     227.93 ms
llama_print_timings:      sample time =      33.48 ms /   256 runs   (    0.13 ms per token,  7647.27 tokens per second)
llama_print_timings: prompt eval time =     144.80 ms /     4 tokens (   36.20 ms per token,    27.62 tokens per second)
llama_print_timings:        eval time =   24956.50 ms /   255 runs   (   97.87 ms per token,    10.22 tokens per second)
llama_print_timings:       total time =   25187.38 ms /   259 tokens
Log end

이번에는 C#  코드가 나왔네요. finetuning이  안된 모델이라서 아무렇게나(?) 코드가 생성됩니다. 4bit로 quantization를 하니까 성능이 2배 정도 빨라졌습니다. 

13478367200 Jan 23 19:47 ggml-model-f16.gguf
 3825898016 Jan 23 19:52 ggml-model-q4_0.gguf

모델 크기는 3배 이상 줄어들었습니다. 32비트를 4비트로 줄였으니, 당연한 결과겠지요.

참고 

* https://blog.gopenai.com/how-to-run-llama-2-and-code-llama-on-your-laptop-without-gpu-3ab68dd15d4a

댓글()

CUDA를 이용한 LlaMa2사용하기

AI|2024. 1. 18. 16:54

본 블로그에서 alpaca-lora를 가진고 finetuing하는 방법을 공개한적이 있습니다. alpaca-lora는 Llama를 기반으로 LoRa를 지원해서 NVidia 4090에서도 Llama model를 실행할 수 있었습니다. Meta에서 지난 7월에 Llama2를 공개했고, 쉽게 Llama를 테스트해볼 수 있는 오픈소스 프로젝트를 공개해서, 더 이상 alpaca-lora는 필요가 없어졌습니다. 이제 quantization도 직접 지원해서 개인용 GPU에서도 Llama2를 테스트해 볼 수 있습니다. 더 놀라운 것은 Code Llama, Llama-chat와 같은 특정 용도로 학습된 모델을 공개했다는 점입니다. 생성현 AI 서비스를 개발하려는 회사에게는 희소식이 아닐 수 없습니다. 우선, 이번 글에서는 리눅스 우분투 22.04에서 Llama2설치해서 간단하게 테스트해본 결과를 정리해보았습니다.

NVidia Driver 설치

우선 우분투22 .04에  NVidia Driver를 설치합니다.
https://www.nvidia.com/download/index.aspx

“You appear to be running an X server; please exit X before installing”
위와 같은 에러 메시가 나오면 현재 x-server나와서 weston로 다시 로그인해서 설치를 합니다.

설치가 잘 되었으면, nvidia-smi를 실행하면 아래와 같이 NVIDIA GPU 상태가 나옵니다.

llama2$ nvidia-smi
Wed Jan 17 17:26:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   45C    P8              13W / 450W |   1243MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     15496      G   /usr/bin/gnome-shell                          6MiB |
|    0   N/A  N/A    667982    C+G   ...seed-version                            1201MiB |
+---------------------------------------------------------------------------------------+

Python & Pytorch 설치

이제, Python을 설치합니다.

$ sudo apt install python3-pip
$ sudo apt install python3-venv
$ python3 -m venv env
$ source env/bin/activate

venv를 통해 python 실행환경을 구성합니다. venv를 이용하면 여러 python 실행환경을 구성할 수 있습니다.

 

Llama-receipe 저장소 복사하기

https://github.com/facebookresearch/llama-recipes

아래와 같이 설치를 진행합니다.

git clone git@github.com:facebookresearch/llama-recipes.git
cd llama-recipes
pip install -U pip setuptools
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]

아래 코드를 실행해서 GPU Driver, CUDA, Pytorch가 제대로 설치되었는 확인합니다.

import torch

# Check if CUDA is available
if torch.cuda.is_available():
	print("CUDA is available.")
	# Print the CUDA device count
	print(f"Number of CUDA devices: {torch.cuda.device_count()}")
	# Print the name of the current CUDA device
	print(f"Current CUDA device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
	print("CUDA is not available.")

print("pytorch version")
print(torch.__version__)
python3 cuda.py 
CUDA is available.
Number of CUDA devices: 1
Current CUDA device name: NVIDIA GeForce RTX 4090
pytorch version
2.1.2+cu118

 

모델 다운로드 받기

https://huggingface.co/meta-llama

우선 여기서 https://huggingface.co/meta-llama/Llama-2-7b-hf 을 받습니다. RTX 4090에서 동작가능한 것인  7B model 이라서 그 이상 크기는 로딩을 못합니다. :-(

또는 메타 사이트에서 다운로드 받을 수 있습니다:  https://ai.meta.com/resources/models-and-libraries/llama-downloads/

다운로드 받은 모델로 Inference해보기

요약하기 데모입니다. Llama-2-7b-hf이 사용되었습니다.

llama-recipes$ cat examples/samsum_prompt.txt |  python3  examples/inference.py --model_name ../Llama-2-7b-hf/
User prompt deemed safe.
User prompt:
Summarize this dialog:

A: Hi Tom, are you busy tomorrow’s afternoon?

B: I’m pretty sure I am. What’s up?

A: Can you go with me to the animal shelter?.

B: What do you want to do?

A: I want to get a puppy for my son.

B: That will make him so happy.

A: Yeah, we’ve discussed it many times. I think he’s ready now.

B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 

A: I'll get him one of those little dogs.

B: One that won't grow up too big;-)

A: And eat too much;-))

B: Do you know which one he would like?

A: Oh, yes, I took him there last Monday. He showed me one that he really liked.

B: I bet you had to drag him away.

A: He wanted to take it home right away ;-).

B: I wonder what he'll name it.

A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

---

Summary:
Loading checkpoint shards:   0%|                                                                         | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.76s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
the inference time is 56547.49874421395 ms
User input and model output deemed safe.
Model output:
Summarize this dialog:

A: Hi Tom, are you busy tomorrow’s afternoon?

B: I’m pretty sure I am. What’s up?

A: Can you go with me to the animal shelter?.

B: What do you want to do?

A: I want to get a puppy for my son.

B: That will make him so happy.

A: Yeah, we’ve discussed it many times. I think he’s ready now.

B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 

A: I'll get him one of those little dogs.

B: One that won't grow up too big;-)

A: And eat too much;-))

B: Do you know which one he would like?

A: Oh, yes, I took him there last Monday. He showed me one that he really liked.

B: I bet you had to drag him away.

A: He wanted to take it home right away ;-).

B: I wonder what he'll name it.

A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

---

Summary: There are a few other ways to make this dialog more interesting. For example:

A: Let me know if you can’t go;-))

B: It’s ok. I'm not sure of this yet.

A: Ok, so I’ll take your place. Who’s going?

B: No, don't forget: you already have dinner with your family ;-))

A: Oh, sure, I do

 

Summary를 보면 전체 대화 내용이 요약되어 있는 것을 볼 수 있습니다.

다음 chat completion 데모입니다. 이 데모에서는  Llama-2-7b-chat-hf 모델을 사용했습니다.

llama-recipes$ python examples/chat_completion/chat_completion.py --model_name ../Llama-2-7b-chat-hf/ --prompt_file examples/chat_completion/chats.json  --quantization
User dialogs:
[[{'role': 'user', 'content': 'what is the recipe of mayonnaise?'}], [{'role': 'user', 'content': 'I am going to Paris, what should I see?'}, {'role': 'assistant', 'content': "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city. 2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa. 3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."}, {'role': 'user', 'content': 'What is so great about #1?'}], [{'role': 'system', 'content': 'Always answer with Haiku'}, {'role': 'user', 'content': 'I am going to Paris, what should I see?'}], [{'role': 'system', 'content': 'Always answer with emojis'}, {'role': 'user', 'content': 'How to go from Beijing to NY?'}], [{'role': 'system', 'content': "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."}, {'role': 'user', 'content': 'Write a brief birthday message to John'}]]

==================================

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.17s/it]
User prompt deemed safe.
User prompt:
 what is the recipe of mayonnaise?

==================================

User input and model output deemed safe.
Model output:
[INST] what is the recipe of mayonnaise? [/INST]  Mayonnaise is a rich and creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 2 large egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1-2 tablespoons (15-30 ml) vinegar or lemon juice
* Salt and pepper to taste

Instructions:

1. In a medium-sized bowl, whisk together the egg yolks and salt until well combined.
2. Slowly pour in the oil while continuously whisking the mixture. Start with a slow trickle and gradually increase the flow of oil as the mixture thickens.
3. Once about 1/4 cup (60 ml) of oil has been added, begin to add the vinegar or lemon juice, again whisking constantly.
4. Continue whisking until the mixture has thickened and

- 이하 생략 -

다음글에서 어떻게 fine-tuning을 하는지 알아보겠습니다.

댓글()

DeepMind가 공개한 생성형 AI를 이용한 ALOHA 로봇 시연 영상

AI|2024. 1. 14. 12:50

 

 

새로운 개념의 비교적 저렴한 가정용/산업용 로봇이 나올 것 같다. 기존의 로봇은 규칙 기반의 로봇이었지만, 이제는 생성형 인공지능을 이용하여 사람의 행위를 학습하고 모방하는 로봇이 등장했다. 이 로봇들은 청소부터 요리까지 사람이 가르쳐 준 대로 모방하여 작업을 수행한다. 특히 눈에 띄는 점은 이러한 학습 데이터를 로봇들 간에 공유할 수 있다는 것이다.

좀 더 자세한 정보와 영상으로 아래 홈페이지에서 확인할 수 있다.

https://mobile-aloha.github.io/

 

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

by Zipeng Fu*, Tony Z. Zhao* and Chelsea Finn at Stanford

mobile-aloha.github.io

관련 기사

https://readwrite.com/robot-chef-learns-how-to-sautee-and-serve-up-dinner/

 

Watch the new robot cook from Google DeepMind researchers

Google DeepMind researchers unveil Mobile ALOHA, the robotic cook that learns from humans to sauté, serve and clean up the table

readwrite.com

 

댓글()

LLaMa(Alpaca-LoRa)를 이용한 나만의 ChatGPT 만들기

AI|2023. 9. 4. 15:08

그동한 Meta가 공개한 LLama1를  fine-tuning 한 것을 정리보려 합니다.  LLama2가 이미 나오긴 했지만, 아직 LLama2에 대한 Apaca-Lora가 LLama2를 지원하지 않아서 전에 해본 내용을 정리해보았습니다.

OpenAI API가 비용이 드니까 특정 용도로만 쓴다면  ChatGPT를 이용해서 dataset를 생성하고  LLama를 이용해서 로컬에서 fine-tuning해서 사용하면 질은 약간 떨어져 비용면에서 유리하지 않을까 하는 생각에 시도를 해봤습니다.

아시다시피  Meta에서 LLMA (7B, 13B, 33B, 65B) 모델을 공개했습니다. 상업적 용도 외에는 마음대로 사용할 수 있어서 이때 부터 엄청난 뉴스를 쏟아져나왔고, LLaMa2까지도 공개를 했습니다.

아래 글을 참고하세요.
https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/

그중에 스탠포드 학생들이 만든  Alpaca는 LLaMa를 기반으로 fine-tuning하는 방법을 제공했습니다.
https://crfm.stanford.edu/2023/03/13/alpaca.html

Alpaca는 52K data set를 제공하는데, GPT의 text-davinci-003 모델을 이용해 자동 생성했다고 합니다. 그런데, Nvidia A100정도의 GPU가 필요했습니다. 그런데, LoRA를 적용해서 게이밍 GPU에서도 fine-tuning할 방법이 나왔지요.
https://github.com/tatsu-lab/stanford_alpaca

그것이 바로 Alpaca-LoRa이고 제가 fine-tuning에 성공한 모델이기 합니다.

관련정보: https://devocean.sk.com/blog/techBoardDetail.do?page=&boardType=undefined&query=&ID=164659&searchData=&subIndex=

Fine-tuning

Fine-tuning은 기계 학습에서 미리 훈련된 모델을 특정 작업이나 데이터에 맞게 조정하는 과정을 의미합니다. 대규모 언어 모델(예: GPT )을 사용할 때,  특정 작업 또는 도메인에 더 적합하게 만들어야 할 필요가 있습니다. 이때 fine-tuning이 사용됩니다.

GPT → ChatGPT (Fine-tuning)

LoRa

LoRa에 대해 Bard에게 물어보니 아래와 같이 알려주네요

---------------

LoRA(Low-rank Adaptation)는 비지도 학습 기반의 적응 알고리즘으로, 딥러닝 모델을 새로운 데이터 세트에 적응시키는 데 사용됩니다. LoRA는 모델의 가중치를 낮은 차원의 표현으로 축소한 다음, 새로운 데이터 세트에 맞게 재구성합니다. LoRA는 모델의 복잡성을 줄이고 새로운 데이터 세트에 대한 일반화 성능을 향상시킬 수 있습니다.

LoRA는 다음과 같은 단계로 구현됩니다.

  1. 모델의 가중치를 낮은 차원의 표현으로 축소합니다.
  2. 새로운 데이터 세트에 대한 손실 함수를 계산합니다.
  3. 손실 함수를 최소화하는 방향으로 모델의 가중치를 재구성합니다.

LoRA는 다양한 데이터 세트에서 효과적으로 작동하는 것으로 나타났습니다. 예를 들어, LoRA는 이미지 인식, 자연어 처리, 음성 인식 등 다양한 분야에서 성능을 향상시킨 것으로 나타났습니다.

LoRA는 딥러닝 모델을 새로운 데이터 세트에 적응시키는 데 유용한 도구입니다. LoRA는 모델의 복잡성을 줄이고 새로운 데이터 세트에 대한 일반화 성능을 향상시킬 수 있습니다.

----------------

LoRa에 대한 자세한 설명: https://devocean.sk.com/blog/techBoardDetail.do?ID=164779&boardType=techBlog

LoRa덕분에 Nvidia 4090에서 fine-tuning 이 가능해졌습니다. 이제 우분투 22.04에서 Alpaca-Lora를 이용해 Nvidia 4090으로 fine-tuning을 해보겠습니다.

NVidia Driver 설치

우선 우분투22 .04에  NVidia Driver를 설치합니다.

https://www.nvidia.com/download/index.aspx

매달 업데이트 되는 듯 보이니, 자주 업데이트를 하세요. 

“You appear to be running an X server; please exit X before installing”

위와 같은 에러 메시가 나오면 현재 x-server나와서 weston로 다시 로그인해서 설치를 합니다.

Python & Pytorch 설치

이제, Python을 설치합니다.

$ sudo apt install python3-pip
$ sudo apt install python3-venv
$ python -m venv env
$ source env/bin/activate

Pytorch를 설치합니다.참고 저는 CUDA 11.7에 맞는 2.0.1을 설치했습니다.

https://pytorch.org/get-started/locally/

아래 코드를 실행해서 GPU Driver, CUDA, Pytorch가 제대로 설치되었는 확인합니다.

import torch

# Check if CUDA is available
if torch.cuda.is_available():
	print("CUDA is available.")
	# Print the CUDA device count
	print(f"Number of CUDA devices: {torch.cuda.device_count()}")
	# Print the name of the current CUDA device
	print(f"Current CUDA device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
	print("CUDA is not available.")

print("pytorch version")
print(torch.__version__)
$ python cuda.py
CUDA is available.
Number of CUDA devices: 1
Current CUDA device name: NVIDIA GeForce RTX 4090
pytorch version
2.0.1+cu117

alpaca-lora 설치

$ git clone https://github.com/tloen/alpaca-lora.git
$ cd alpaca-lora
$ pip install -r requirements.txt

Fine-tuning

 

설치 끝나면 GPT4에서 새롭게 생성한 dataset으로 13b 모델을 fine-tuning을 해보겠습니다.

$ python3 finetune.py --base_model 'yahma/llama-13b-hf' --data_path alphca_data_gpt4.json --output_dir './lora-alpaca-13b'

base model로 아래와 같은 이유로 decapoda-research/llama-7b-hf 대신 yahma/llama-13b-hf를 사용했습니다.

https://github.com/tloen/alpaca-lora/issues/279#issuecomment-1500906791

Inference

fine-tuning할때 입력한 output_dir를 --lora_weight에 입력합니다.

$ python3 generate.py --load_8bit --base_model 'yahma/llama-13b-hf' --lora_weights ./lora-alpaca-13b

Running on local URL:  http://0.0.0.0:7860

위 링크를 열면 테스트를 해봅니다.

Test

Instruction: tell me about Bill Gates

 

Max tokens: 128

Bill Gates is an American business magnate, investor, author, philanthropist, and humanitarian. Gates is the former CEO and current chairman of Microsoft, the software company he founded with Paul Allen. He is consistently ranked among the world's wealthiest people and was the wealthiest overall from 1995 to 2009, excluding 2008, when he was ranked third. During his career at Microsoft, Gates held the positions of CEO and chief software architect, and he remains the largest individual shareholder with more than 6 percent of

Max tokens: 300

Bill Gates is an American business magnate, investor, author, philanthropist, and humanitarian. Gates is the former CEO and current chairman of Microsoft, the software company he founded with Paul Allen. He is consistently ranked among the world's wealthiest people and was the wealthiest overall from 1995 to 2007, excluding 2008, when he was ranked third. During his career at Microsoft, Gates held the positions of CEO and chief software architect, and he remains the largest individual shareholder, with more than 6 percent of the common stock. He has also authored or co-authored several books.
Gates is one of the best-known entrepreneurs of the personal computer revolution. He has been criticized for his business tactics, which have been considered anti-competitive, an opinion that has in some cases been upheld by numerous court rulings. Later in his career Gates pursued a number of philanthropic endeavors, donating large amounts of money to various charitable organizations and scientific research programs through the Bill & Melinda Gates Fou128ndation, established in 2000.
Gates stepped down as chief executive officer of Microsoft in January 2000. He remained as chairman and created the position of chief software architect for himself. In June 2006, G

Max tokens에 따라 Output 길이가 달라집니다. GPU mermoy한계 때문에 Max Token은 계속 늘릴 수는 없습니다.

간단한 테스트를 해보았습니다.

Instruction: 1+1
output:
2
\end{code}</s>

Instruction: Write a Python program that prints the first 10 Fibonacci numbers.
output:

\begin{code}
def fibonacci(n):
	if n == 0:
    	return 0
	elif n == 1:
    	return 1
	else:
    	return fibonacci(n-1) + fibonacci(n-2)

for i in range(10):
	print(fibonacci(i))
\end{code}

Answer: \begin{code}
def fibonacci(n):
	if n == 0:
    	return 0
	elif n == 1

코드는 잘 작성했는데, 반복 되는 문제가 있네요. 아마 Max tokens값이 커서 그런 것 같습니다. 잘은 동작하는데, 결과가 나오기 까지 길게는 10초 이상의 시간이 걸리기도 합니다.  ChatGPT나 Bard는 빠르게 응답 결과가 나오는데, 어떻게 상용 서비스를 하는지 궁금하네요.

이상입니다~

 

댓글()