Class: Clacky::Media::DashScope

Inherits:

Base

Object
Base
Clacky::Media::DashScope

show all

Defined in:: lib/clacky/media/dashscope.rb

Overview

Alibaba DashScope (Qwen-Image / CosyVoice / HappyHorse) media generation provider.

DashScope is NOT an OpenAI-compatible API. It has its own endpoint, request envelope and response schema for image, speech (TTS), and video generation.

Routing: Generator sends any base_url under *.aliyuncs.com here. We derive the real generation endpoint from the host so users can paste the compatible-mode base_url (…/compatible-mode/v1) they already use for Qwen text models and still get working media generation.

--- Endpoint migration TODO (2026-06) --------------------------------- Aliyun is gradually deprecating the shared dashscope.aliyuncs.com host in favor of the per-workspace MaaS domain https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com (intl: {WorkspaceId}.dashscope-intl.aliyuncs.com). Docs have already moved to the new domain; the old host still works for most models but is expected to be sunset eventually.

Current stance: keep accepting the old shared host as the default (zero-config for users + compatibility with third-party aggregators that don't use aliyuncs.com at all). The new MaaS domain already works today via endpoint_base derivation. Non-real-time TTS (qwen3-tts) does NOT work on the shared host and already emits a hint pointing users at the MaaS domain — see the "url error" branch in generate_speech.

Action when Aliyun announces the sunset of compatible-mode:

1. Flip the default expectation to the WorkspaceId MaaS domain.
2. Add a setup flow / docs explaining how to find WorkspaceId.
3. Keep accepting aggregator base_urls unchanged.

Do NOT pre-emptively migrate before an official sunset notice — it would break zero-config UX and aggregator users for no current gain.

Constant Summary collapse

GENERATION_PATH =

"/api/v1/services/aigc/multimodal-generation/generation"

SPEECH_PATH_COSY =

"/api/v1/services/audio/tts/SpeechSynthesizer"

VIDEO_PATH =

"/api/v1/services/aigc/video-generation/video-synthesis"

TASK_PATH =

"/api/v1/tasks/"

DEFAULT_SPEECH_VOICE_COSY = Default voice per TTS model family. CosyVoice defaults to longanyang; Qwen3-TTS defaults to Cherry (most common Chinese female voice).

"longanyang"

DEFAULT_SPEECH_VOICE_QWEN =

"Cherry"

ASPECT_TO_SIZE_V2 = aspect_ratio -> "" (DashScope uses '' not 'x'). qwen-image-2.0 / -plus / -max share these recommended resolutions; the 2.0 series accepts arbitrary sizes within 512512..20482048, the max/plus series only accept a fixed set, so we stick to values that are valid for every family.

{
  "landscape" => "2688*1536", # 16:9
  "square"    => "2048*2048", # 1:1
  "portrait"  => "1536*2688"  # 9:16
}.freeze

ASPECT_TO_SIZE_MAX_PLUS =

{
  "landscape" => "1664*928",  # 16:9
  "square"    => "1328*1328", # 1:1
  "portrait"  => "928*1664"   # 9:16
}.freeze

DEFAULT_ASPECT =

"landscape"

PROVIDER_ID =

"qwen"

Instance Method Summary collapse

#generate_image(prompt:, aspect_ratio: DEFAULT_ASPECT, output_dir: nil, n: 1, **_kwargs) ⇒ Object
#generate_speech(input:, voice: nil, output_dir: nil, language_type: nil, **_kwargs) ⇒ Hash
Synthesizes speech (TTS) using Alibaba CosyVoice models (e.g. cosyvoice-v3-flash).
#generate_video(prompt:, aspect_ratio: "landscape", duration_seconds: nil, output_dir: nil, **_kwargs) ⇒ Hash
Generates a video using Alibaba HappyHorse or Wanx models.

Methods inherited from Base

#generate_transcription, #initialize, #understand_video

Constructor Details

This class inherits a constructor from Clacky::Media::Base

Instance Method Details

#generate_image(prompt:, aspect_ratio: DEFAULT_ASPECT, output_dir: nil, n: 1, **_kwargs) ⇒ `Object`

# File 'lib/clacky/media/dashscope.rb', line 73

def generate_image(prompt:, aspect_ratio: DEFAULT_ASPECT, output_dir: nil, n: 1, **_kwargs)
  aspect = size_table.key?(aspect_ratio) ? aspect_ratio : DEFAULT_ASPECT
  size   = size_table[aspect]

  if prompt.to_s.strip.empty?
    return error_response(
      error: "Prompt is required and must be a non-empty string",
      error_type: "invalid_argument",
      provider: PROVIDER_ID,
      aspect_ratio: aspect
    )
  end

  if @api_key.to_s.empty?
    return error_response(
      error: "api_key not configured for image model '#{@model}'",
      error_type: "auth_required",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  payload = {
    model: @model,
    input: {
      messages: [
        { role: "user", content: [{ text: prompt }] }
      ]
    },
    parameters: {
      size: size,
      n: n,
      prompt_extend: true,
      watermark: false
    }
  }

  begin
    response = connection.post(GENERATION_PATH) do |req|
      req.headers["Content-Type"]  = "application/json"
      req.headers["Authorization"] = "Bearer #{@api_key}"
      req.body = JSON.generate(payload)
    end
  rescue Faraday::Error => e
    return error_response(
      error: "HTTP request failed: #{e.message}",
      error_type: "network_error",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  body = parse_json(response.body)
  unless body.is_a?(Hash)
    return error_response(
      error: "Invalid JSON response from upstream",
      error_type: "invalid_response",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  # DashScope reports business failures via top-level code/message,
  # sometimes alongside a non-2xx status, sometimes 200.
  if body["code"] && !body["code"].to_s.empty?
    return error_response(
      error: "Upstream error #{body["code"]}: #{body["message"]}",
      error_type: "api_error",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  unless response.success?
    return error_response(
      error: "Upstream #{response.status}: #{truncate(response.body, 500)}",
      error_type: "api_error",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  image_url = extract_image_url(body)
  if image_url.nil?
    return error_response(
      error: "Upstream returned no image data",
      error_type: "empty_response",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  local_path = save_image_from_url(image_url, output_dir: output_dir || Dir.pwd, prefix: "img")
  if local_path.nil?
    return error_response(
      error: "Failed to download generated image from #{image_url}",
      error_type: "download_failed",
      provider: PROVIDER_ID,
      prompt: prompt,
      aspect_ratio: aspect
    )
  end

  usage = body["usage"]
  success_response(
    image: local_path,
    prompt: prompt,
    aspect_ratio: aspect,
    provider: PROVIDER_ID,
    extra: {
      "size"      => size,
      "usage"     => usage,
      "request_id" => body["request_id"]
    }.compact
  )
end

#generate_speech(input:, voice: nil, output_dir: nil, language_type: nil, **_kwargs) ⇒ `Hash`

Synthesizes speech (TTS) using Alibaba CosyVoice models (e.g. cosyvoice-v3-flash). This is a synchronous call.

Parameters:

input (String) —
the text to synthesize
voice (String, nil) (defaults to: nil) —
the voice name; defaults to "longanyang" for CosyVoice or "Cherry" for Qwen3-TTS
output_dir (String, nil) (defaults to: nil) —
the directory to save the output audio
language_type (String, nil) (defaults to: nil) —
language hint for Qwen3-TTS (default "Chinese"); ignored by CosyVoice

Returns:

(Hash) —
audio_success_response or audio_error_response

# File 'lib/clacky/media/dashscope.rb', line 204

def generate_speech(input:, voice: nil, output_dir: nil, language_type: nil, **_kwargs)
  if input.to_s.strip.empty?
    return audio_error_response(
      error: "Input text is required and must be a non-empty string",
      error_type: "invalid_argument",
      provider: PROVIDER_ID,
      voice: voice.to_s
    )
  end

  if @api_key.to_s.empty?
    return audio_error_response(
      error: "api_key not configured for audio model '#{@model}'",
      error_type: "auth_required",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  # Pick endpoint and payload shape based on model family. CosyVoice
  # uses the dedicated TTS endpoint and accepts format/sample_rate;
  # Qwen3-TTS is a multimodal-generation model and expects
  # language_type instead.
  endpoint     = speech_endpoint
  chosen_voice = voice || default_speech_voice
  payload      = speech_payload(input: input, voice: chosen_voice, language_type: language_type)

  begin
    response = connection.post(endpoint) do |req|
      req.headers["Content-Type"]  = "application/json"
      req.headers["Authorization"] = "Bearer #{@api_key}"
      req.body = JSON.generate(payload)
    end
  rescue Faraday::Error => e
    return audio_error_response(
      error: "HTTP request failed: #{e.message}",
      error_type: "network_error",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  body = parse_json(response.body)
  unless body.is_a?(Hash)
    return audio_error_response(
      error: "Invalid JSON response from upstream",
      error_type: "invalid_response",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  # Inspect any business level errors from DashScope
  if body["code"] && !body["code"].to_s.empty?
    err_msg = body["message"].to_s
    if err_msg.include?("url error") && @base_url.to_s.include?("dashscope.aliyuncs.com")
      err_msg += " (Note: Alibaba Model Studio non-real-time TTS does not support the public shared endpoint. " \
                 "Set the model's Base URL to your dedicated MaaS domain, e.g. " \
                 "https://{WorkspaceId}.cn-beijing.maas.aliyuncs.com)"
    end
    return audio_error_response(
      error: "Upstream error #{body["code"]}: #{err_msg}",
      error_type: "api_error",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  unless response.success?
    return audio_error_response(
      error: "Upstream #{response.status}: #{truncate(response.body, 500)}",
      error_type: "api_error",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  audio_url = body.dig("output", "audio", "url")
  if audio_url.nil? || audio_url.empty?
    return audio_error_response(
      error: "Upstream returned no audio data",
      error_type: "empty_response",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  # Download the audio file from OSS and save it locally in the target output directory
  local_path = save_image_from_url(audio_url, output_dir: output_dir || Dir.pwd, prefix: "tts", extension: "wav")
  if local_path.nil?
    return audio_error_response(
      error: "Failed to download generated audio from #{audio_url}",
      error_type: "download_failed",
      provider: PROVIDER_ID,
      input: input,
      voice: voice.to_s
    )
  end

  audio_success_response(
    audio: local_path,
    input: input,
    voice: chosen_voice,
    provider: PROVIDER_ID,
    extra: {
      "request_id" => body["request_id"]
    }.compact
  )
end

#generate_video(prompt:, aspect_ratio: "landscape", duration_seconds: nil, output_dir: nil, **_kwargs) ⇒ `Hash`

Generates a video using Alibaba HappyHorse or Wanx models. This is a mandatory asynchronous API. We submit the task, and poll the task status until it succeeds, fails, or times out.