Skip to content

vllm.transformers_utils.configs.hyperclovax

HyperCLOVA X model configuration.

HyperCLOVAXConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [HyperCLOVAXModel]. It is used to instantiate a HyperCLOVAX model according to the specified arguments, defining the model architecture. Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

Parameters:

Name Type Description Default
vocab_size `int`, *optional*, defaults to 32000

Vocabulary size of the HyperCLOVAX model. Defines the number of different tokens that can be represented by the input_ids passed when calling [HyperCLOVAXModel]

32000
hidden_size `int`, *optional*, defaults to 4096

Dimension of the hidden representations.

4096
intermediate_size `int`, *optional*, defaults to 11008

Dimension of the MLP representations.

11008
num_hidden_layers `int`, *optional*, defaults to 32

Number of hidden layers in the Transformer decoder.

32
num_attention_heads `int`, *optional*, defaults to 32

Number of attention heads for each attention layer in the Transformer decoder.

32
num_key_value_heads `int`, *optional*

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to num_attention_heads.

None
hidden_act `str` or `function`, *optional*, defaults to `"silu"`

The non-linear activation function (function or string) in the decoder.

'silu'
max_position_embeddings `int`, *optional*, defaults to 2048

The maximum sequence length that this model might ever be used with.

2048
initializer_range `float`, *optional*, defaults to 0.02

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

0.02
rms_norm_eps `float`, *optional*, defaults to 1e-06

The epsilon used by the rms normalization layers.

1e-06
use_cache `bool`, *optional*, defaults to `True`

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

True
pad_token_id `int`, *optional*

Padding token id.

None
bos_token_id `int`, *optional*, defaults to 1

Beginning of stream token id.

1
eos_token_id `int`, *optional*, defaults to 2

End of stream token id.

2
pretraining_tp `int`, *optional*, defaults to 1

Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue.

1
tie_word_embeddings `bool`, *optional*, defaults to `False`

Whether to tie weight embeddings

False
rope_theta `float`, *optional*, defaults to 10000.0

The base period of the RoPE embeddings.

10000.0
rope_scaling `Dict`, *optional*

Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type and you expect the model to work on longer max_position_embeddings, we recommend you to update this value accordingly. Expected contents: rope_type (str): The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', 'llama3'], with 'default' being the original RoPE implementation. factor (float, optional): Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In most scaling types, a factor of x will enable the model to handle sequences of length x * original maximum pre-trained length. original_max_position_embeddings (int, optional): Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during pretraining. attention_factor (float, optional): Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention computation. If unspecified, it defaults to value recommended by the implementation, using the factor field to infer the suggested value. beta_fast (float, optional): Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear ramp function. If unspecified, it defaults to 32. beta_slow (float, optional): Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear ramp function. If unspecified, it defaults to 1. short_factor (List[float], optional): Only used with 'longrope'. The scaling factor to be applied to short contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 long_factor (List[float], optional): Only used with 'longrope'. The scaling factor to be applied to long contexts (< original_max_position_embeddings). Must be a list of numbers with the same length as the hidden size divided by the number of attention heads divided by 2 low_freq_factor (float, optional): Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE high_freq_factor (float, optional): Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE

None
attention_bias `bool`, *optional*, defaults to `False`

Whether to use a bias in the query, key, value and output projection layers during self-attention.

False
attention_dropout `float`, *optional*, defaults to 0.0

The dropout ratio for the attention probabilities.

0.0
mlp_bias `bool`, *optional*, defaults to `False`

Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.

False
head_dim `int`, *optional*

The attention head dimension. If None, it will default to hidden_size // num_heads

None
embedding_multiplier `float`, *optional*, defaults to `None`

Multiplier applied to the embedding weights. If None, it is equivalent to 1.0.

None
logits_scaling `float`, *optional*, defaults to `None`

Scaling factor for logits. If None, it is equivalent to 1.0.

None
attention_multiplier `float`, *optional*, defaults to `None`

Multiplier applied to the attention weights. If None, it is equivalent to self.head_dim ** -0.5.

None
residual_multiplier `float`, *optional*, defaults to `None`

Scaling factor for residual connections. If None, it is equivalent to 1.0.

None
use_post_norm `bool`, *optional*, defaults to `True`

Determines whether to apply Peri-Layer Normalization. Set to False to disable this feature.

True
rope_parameters `dict`, *optional*

Dictionary containing the RoPE parameters used by vLLM's get_rope. When provided, takes precedence over rope_theta and rope_scaling. If None, it is derived from rope_theta and rope_scaling automatically.

None
Source code in vllm/transformers_utils/configs/hyperclovax.py
class HyperCLOVAXConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a
    [`HyperCLOVAXModel`]. It is used to instantiate a HyperCLOVAX model
    according to the specified arguments, defining the model architecture.
    Configuration objects inherit from [`PretrainedConfig`] and can be used
    to control the model outputs. Read the documentation from
    [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the HyperCLOVAX model. Defines the number of
            different tokens that can be represented by the `input_ids`
            passed when calling [`HyperCLOVAXModel`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the
            Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to
            implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use
            Multi Head Attention (MHA), if `num_key_value_heads=1` the model
            will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each
            group key and value head should be constructed by meanpooling all
            the original heads within that group. For more details checkout
            [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not
            specified, will default to `num_attention_heads`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the
            decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used
            with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for
            initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values
            attentions (not used by all models). Only relevant if
            `config.is_decoder=True`.
        pad_token_id (`int`, *optional*):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 1):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 2):
            End of stream token id.
        pretraining_tp (`int`, *optional*, defaults to 1):
            Experimental feature. Tensor parallelism rank used during
            pretraining. Please refer to [this document](https://huggingface.
            co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism)
            to understand more about it. This value is necessary to ensure
            exact reproducibility of the pretraining results. Please refer to
            [this issue](https://github.com/pytorch/pytorch/issues/76232).
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        rope_scaling (`Dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE
            embeddings. NOTE: if you apply new rope type and you expect the
            model to work on longer `max_position_embeddings`, we recommend
            you to update this value accordingly.
            Expected contents:
                `rope_type` (`str`):
                    The sub-variant of RoPE to use. Can be one of ['default',
                    'linear', 'dynamic', 'yarn', 'longrope', 'llama3'], with
                    'default' being the original RoPE implementation.
                `factor` (`float`, *optional*):
                    Used with all rope types except 'default'. The scaling
                    factor to apply to the RoPE embeddings. In most scaling
                    types, a `factor` of x will enable the model to handle
                    sequences of length x * original maximum pre-trained
                    length.
                `original_max_position_embeddings` (`int`, *optional*):
                    Used with 'dynamic', 'longrope' and 'llama3'. The
                    original max position embeddings used during pretraining.
                `attention_factor` (`float`, *optional*):
                    Used with 'yarn' and 'longrope'. The scaling factor to be
                    applied on the attention computation. If unspecified, it
                    defaults to value recommended by the implementation, using
                    the `factor` field to infer the suggested value.
                `beta_fast` (`float`, *optional*):
                    Only used with 'yarn'. Parameter to set the boundary for
                    extrapolation (only) in the linear ramp function. If
                    unspecified, it defaults to 32.
                `beta_slow` (`float`, *optional*):
                    Only used with 'yarn'. Parameter to set the boundary for
                    interpolation (only) in the linear ramp function. If
                    unspecified, it defaults to 1.
                `short_factor` (`List[float]`, *optional*):
                    Only used with 'longrope'. The scaling factor to be
                    applied to short contexts (<
                    `original_max_position_embeddings`). Must be a list of
                    numbers with the same length as the hidden size divided
                    by the number of attention heads divided by 2
                `long_factor` (`List[float]`, *optional*):
                    Only used with 'longrope'. The scaling factor to be
                    applied to long contexts (<
                    `original_max_position_embeddings`). Must be a list of
                    numbers with the same length as the hidden size divided
                    by the number of attention heads divided by 2
                `low_freq_factor` (`float`, *optional*):
                    Only used with 'llama3'. Scaling factor applied to low
                    frequency components of the RoPE
                `high_freq_factor` (`float`, *optional*):
                    Only used with 'llama3'. Scaling factor applied to high
                    frequency components of the RoPE
        attention_bias (`bool`, *optional*, defaults to `False`):
            Whether to use a bias in the query, key, value and output
            projection layers during self-attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        mlp_bias (`bool`, *optional*, defaults to `False`):
            Whether to use a bias in up_proj, down_proj and gate_proj layers
            in the MLP layers.
        head_dim (`int`, *optional*):
            The attention head dimension. If None, it will default to
            hidden_size // num_heads
        embedding_multiplier (`float`, *optional*, defaults to `None`):
            Multiplier applied to the embedding weights. If `None`, it is
            equivalent to `1.0`.
        logits_scaling (`float`, *optional*, defaults to `None`):
            Scaling factor for logits. If `None`, it is equivalent to `1.0`.
        attention_multiplier (`float`, *optional*, defaults to `None`):
            Multiplier applied to the attention weights. If `None`, it is
            equivalent to `self.head_dim ** -0.5`.
        residual_multiplier (`float`, *optional*, defaults to `None`):
            Scaling factor for residual connections. If `None`, it is
            equivalent to `1.0`.
        use_post_norm (`bool`, *optional*, defaults to `True`):
            Determines whether to apply Peri-Layer Normalization. Set to
            False to disable this feature.
        rope_parameters (`dict`, *optional*):
            Dictionary containing the RoPE parameters used by vLLM's
            `get_rope`. When provided, takes precedence over `rope_theta`
            and `rope_scaling`. If `None`, it is derived from `rope_theta`
            and `rope_scaling` automatically.
    """

    model_type = "hyperclovax"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=32000,
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=2048,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        pad_token_id=None,
        bos_token_id=1,
        eos_token_id=2,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        rope_scaling=None,
        attention_bias=False,
        attention_dropout=0.0,
        mlp_bias=False,
        head_dim=None,
        embedding_multiplier=None,  # mup
        logits_scaling=None,  # mup
        attention_multiplier=None,  # mup
        residual_multiplier=None,  # mup
        use_post_norm=True,  # post-norm(peri-LN)
        rope_parameters=None,
        auto_map=None,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        self.mlp_bias = mlp_bias
        self.head_dim = (
            head_dim
            if head_dim is not None
            else self.hidden_size // self.num_attention_heads
        )
        # Derive rope_parameters for vLLM's get_rope() from rope_theta /
        # rope_scaling, unless the caller already provided rope_parameters.
        if rope_parameters is None:
            if rope_scaling is not None:
                # Shallow-copy to avoid mutating the caller's dict.
                rope_parameters = dict(rope_scaling)
                # BC: 'type' field -> 'rope_type', remove stale key.
                if "type" in rope_parameters:
                    rope_parameters.setdefault("rope_type", rope_parameters.pop("type"))
            else:
                rope_parameters = {"rope_type": "default"}
            if "rope_theta" not in rope_parameters:
                rope_parameters["rope_theta"] = rope_theta
        self.rope_parameters = rope_parameters

        # BC: keep self.rope_scaling consistent for HF serialization.
        if self.rope_scaling is not None and "type" in self.rope_scaling:
            self.rope_scaling["rope_type"] = self.rope_scaling["type"]

        # mup
        self.embedding_multiplier = (
            embedding_multiplier if embedding_multiplier is not None else 1.0
        )
        self.logits_scaling = logits_scaling if logits_scaling is not None else 1.0
        self.attention_multiplier = (
            attention_multiplier
            if attention_multiplier is not None
            else self.head_dim**-0.5
        )
        self.residual_multiplier = (
            residual_multiplier if residual_multiplier is not None else 1.0
        )

        # post-norm (Peri-LN)
        self.use_post_norm = use_post_norm

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            auto_map=auto_map,
            **kwargs,
        )