Large Language Model
A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other NLP tasks such as classification. Transformer, the state-of-the-art LLM architecture, is based on the multi-head attention mechanism. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head softmax-based attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. GPTs (Generative pre-trained transformers) are based on the decoder-only transformer architecture. Each generation of GPT models is significantly more capable than the previous, due to increased model size (number of trainable parameters) and larger training data.
Llama 3
Smile provides an inference implementation of Llama 3 from Meta AI, the latest version of Llama that is accessible to individuals, organizations, and businesses of all sizes.
To build a model instance, call Llama.build()
with the directory
path of checkpoint files, the path of tokenizer model file, maximum batch size, maximum
sequence length, and optionally CUDA device id:
import smile.llm.llama.*;
var model = Llama.build("model/Llama3-8B-Instruct",
"model/Llama3-8B-Instruct/tokenizer.model",
4, // maximum batch size
8192, // maximum sequence length for input text
0 // CUDA device id
);
For a pretrained model, one should call generate()
method with a batch
of prompots to generate the text. For fine-tuned chat models, we should instead
call chat()
method to generate conversation responses.
// List of conversational dialogs, where each dialog is a list of messages.
Message[][] dialogs = {
{
new Message(Role.user, "what is the recipe of mayonnaise?"),
},
{
new Message(Role.user, "I am going to Paris, what should I see?"),
new Message(Role.assistant, """
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:
1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.
These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."""),
new Message(Role.user, "What is so great about #1?"),
},
{
new Message(Role.system, "Always answer with Haiku"),
new Message(Role.user, "I am going to Paris, what should I see?"),
},
{
new Message(Role.system, "Always answer with emojis"),
new Message(Role.user, "How to go from Beijing to NY?"),
},
};
model.chat(dialogs,
2048, // maximum length of the generated text sequence
0.6, // temperature for controlling randomness in sampling
0.9, // top_p probability threshold for nucleus sampling
false, // true to compute token log probabilities
null, // the optional random number generation seed
null // streaming publisher
);
Serving
Smile also provides an LLM inference server for quick start. The script bin/serve.sh
builds and starts the inference server. You should have Node.js and npm installed to build the
frontend.
$ bin/serve.sh --help
SmileServe - Large Language Model (LLM) Inference Server
Usage: smile-serve [options]
--model <value> The model checkpoint directory path
--tokenizer <value> The tokenizer model file path
--max-seq-len <value> The maximum sequence length
--max-batch-size <value> The maximum batch size
--device <value> The CUDA device ID
--help Display the usage information
By default, SmileServe binds at localhost:3801. If you prefer a different port and/or
want to expose the server to other hosts, you may set the binding interface and port
with -J-Dakka.http.server.interface=0.0.0.0
and
-J-Dakka.http.server.port=8000
, for example. The service exposes the API
/v1/chat/completions
that is compatible with OpenAI API.