Intel Weight-Only Quantization
Weight-Only Quantization for Huggingface Models with Intel Extension for Transformers Pipelines
Hugging Face models can be run locally with Weight-Only quantization through the WeightOnlyQuantPipeline
class.
The Hugging Face Model Hub hosts over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.
These can be called from LangChain through this local pipeline wrapper class.
To use, you should have the transformers
python package installed, as well as pytorch, intel-extension-for-transformers.
%pip install transformers --quiet
%pip install intel-extension-for-transformers
Model Loading
Models can be loaded by specifying the model parameters using the from_model_id
method. The model parameters include WeightOnlyQuantConfig
class in intel_extension_for_transformers.
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from langchain_community.llms.weight_only_quantization import WeightOnlyQuantPipeline
conf = WeightOnlyQuantConfig(weight_dtype="nf4")
hf = WeightOnlyQuantPipeline.from_model_id(
model_id="google/flan-t5-large",
task="text2text-generation",
quantization_config=conf,
pipeline_kwargs={"max_new_tokens": 10},
)
They can also be loaded by passing in an existing transformers
pipeline directly
from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
pipe = pipeline(
"text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
)
hf = WeightOnlyQuantPipeline(pipeline=pipe)