Innopolis University DevOps Playground
Skip to content
Snippets Groups Projects
Anton Kudryavtsev's avatar
2eef96fa

Follow My Reading

Overview

Follow My Reading is a game-changer for individuals who may struggle with proper pronunciation while reading. Follow My Reading provides a API service for our users to upload an image and audio of their reading session, and our service checks whether there are any mistakes in pronunciation in the audio.

Here's how it works - users can take a photo of the text, read it aloud, and record their audio simultaneously using their device. Our platform reviews the audio against the text on the image and provides feedback on areas that need improvement. Our technology uses several deep neural network models to detect mispronunciation of words.

Moreover, the "Follow My Reading" project has been designed to be highly customizable and easily configurable to meet the needs of different users and applications. The system administrator has the flexibility to add or remove models for audio and image processing as needed, making it a very versatile system.

Adding or removing models from the platform can be accomplished quickly thanks to the plugin system. This approach allows administrator to create a custom audio or image processing plugin. Thus, the plugin system allows administrator to add custom models, extend the functionality of existing models, integrate third-party models, or even train their own processing models making "Follow My Reading" an even more powerful

Documentation Overview

  • Overview

    This section provides an introduction to the product and its features, along with a summary of the content and structure of the documentation.

  • Installation

    This section covers the steps required to install the product, including prerequisites, system requirements, installation options, and troubleshooting tips. This section may also include information on how to update or uninstall the product.

  • Deployment

    This section provides guidance on deploying the product in different environments or scenarios, such as on-premise, cloud, or hybrid deployments. It may cover topics such as scaling, fault-tolerance, security, and monitoring.

  • Plugins

    This section explains how to and manage plugins, which provide additional functionalities to the product. This include information on how to create or customize plugins, as well as best practices for using plugins effectively.

  • Algorithms

    This section explains algorithms that are used for the product to work. This include descriptions of algorithms, what they accept and what they return.

  • API

    This section documents the product's API and provides guidance on how to use it. This may include information on supported protocols, authentication, rate limiting, and error handling. Sample code snippets and use cases may also be provided.

  • Advanced

    This section covers more advanced topics, such as performance optimization, customization, integration with other systems, and troubleshooting complex issues. This section also include an explanation of the task system functionality.

Full list of Features

  • ✅ Image and audio upload
  • ✅ Audio Processing
  • ✅ Splitting audio by words or by phrases
  • ✅ Image Processing
  • ✅ Reporting text coordinates on the image
  • ✅ Comparing audio and image
  • ✅ Comparing audio and text
  • ✅ Extracting audio by given phrases
  • ✅ Plugin Support
  • ✅ Distributed computing using Task System
  • ✅ Authentication

Installation

Before locally using Follow My Reading, there are a few prerequisites that need to be installed first.

Prerequisites

1. Install Python 3.10

Python is required for the installation of Follow My Reading. If Python is not already installed on your device, download and install it from the official Python website.

2. Install pip

Pip is a package manager for Python packages. It allows you to install and manage additional packages that are not included with Python by default. To install Pip, follow the instructions below:

  • For Windows:
py -m ensurepip --upgrade
  • For Linux/MacOS:
python -m ensurepip --upgrade

3. Install Poetry

Poetry is a Python packaging and dependency management tool. You can install it by running the following command:

pip install poetry

4. Get the source code:

Once you have the access to the code, to get it, use the following command:

git clone https://gitlab.pg.innopolis.university/a.kudryavtsev/follow-my-reading.git

5. Install project dependencies

To install all project dependencies, use the following command

poetry install

These steps will ensure that you have everything required to be able to install and use Follow My Reading.

6. [Optional] Model Dependencies

Several models require additional steps to set up.

Tesseract

Deployment

Follow My Reading can be deployed in several ways depending on your requirements. Below are instructions for deploying Follow My Reading in different ways.

Stand-alone

If you want to run Follow My Reading as a stand-alone Docker container, you can run the following command:

make standalone

This will build and run the Follow My Reading Docker container.

Launch

If you want to run Follow My Reading locally with Redis and Huey, you need to run the following commands:

  • Run the Redis server:
redis-server
  • Run the Huey consumer:
huey_consumer.py core.task_system.scheduler -n -k thread
  • Run the server:
uvicorn main:app

Scalability

Follow My Reading can be scaled horizontally by running multiple Huey consumers with the following command:

huey_consumer.py core.task_system.scheduler -n -k thread -w NUMBER

Where NUMBER is the number of workers you want to run. You can run this command on multiple machines to run a worker on each of them, as long as they are connected to Redis.

NOTE! Right now executing task on multiple machines is unstable

Plugins

Quick Start

First Plugin

Plugins in our system are described as Python files in the /plugins directory. There are several requirements for the format of these plugins. To implement a new plugin, create a file with a name that ends in _plugin.py. In this file, you should include the following imports:

For Image processing models:

from core.plugins import (
    ImageProcessingResult,
    ImageTextBox,
    Point,
    Rectangle,
    register_plugin,
)

For Audio processing models:

from core.plugins import AudioChunk, AudioProcessingResult, register_plugin

The register_plugin function is a decorator that you should use to register your custom plugin. This function takes a single parameter which is the class of your plugin.

Image Processing Example

Here is an example of how to create and register a custom plugin for image processing:

import easyocr

from core.plugins import (
    ImageProcessingResult,
    ImageTextBox,
    Point,
    Rectangle,
    register_plugin,
)


@register_plugin
class EnArEasyOCRPlugin:
    name = "en_ar_easyocr"
    description = (
        "An open source library for certain languages and alphabets,"
        "mainly used for working with text on an image"
    )

    # List of supported languages can be found here: https://www.jaided.ai/easyocr/
    languages = ["en", "ar"]

    reader = easyocr.Reader(languages, gpu=False)

    @staticmethod
    def process_image(filename: str) -> ImageProcessingResult:
        model_response = EnArEasyOCRPlugin.reader.readtext(filename)
        boxes = []
        for coordinates, text, _ in model_response:
            lt, rt, rb, lb = coordinates
            boxes.append(
                ImageTextBox(
                    text=text,
                    coordinates=Rectangle(
                        left_top=Point(x=lt[0], y=lt[1]),
                        right_top=Point(x=rt[0], y=rt[1]),
                        right_bottom=Point(x=rb[0], y=rb[1]),
                        left_bottom=Point(x=lb[0], y=lb[1]),
                    ),
                )
            )
        result_text = " ".join(map(lambda x: x[1], model_response))

        return ImageProcessingResult(text=result_text, boxes=boxes)

Audio Processing Example

And here is an example of how to create and register a custom plugin for audio processing:

import whisper

from core.plugins import AudioChunk, AudioProcessingResult, register_plugin


@register_plugin
class WhisperPlugin:
    name = "whisper"
    languages = ["en", "ru", "ar"]
    description = "Robust Speech Recognition via Large-Scale Weak Supervision By OpenAI"

    model = whisper.load_model("base")  # large-v2

    @staticmethod
    def process_audio(filename: str) -> AudioProcessingResult:
        model_response = WhisperPlugin.model.transcribe(filename)
        chunks = [
            AudioChunk(start=seg["start"], end=seg["end"], text=seg["text"])
            for seg in model_response["segments"]
        ]

        return AudioProcessingResult(text=model_response["text"], segments=chunks)

Requirements

In our system, each plugin file must be named in the format *_plugin.py and located in the /plugins directory. Each plugin must also contain the following static variables:

  • name: A string that specifies the name of the plugin.
  • languages: A list of strings specifying the natural languages that the plugin can process.
  • description: A string that provides a description of the plugin.

Additionally, each plugin must implement one of the following static methods:

  • process_audio(filename: str): Must accept an argument of type string and return an object of type AudioProcessingResult.
  • process_image(filename: str): Must accept an argument of type string and return an object of type ImageProcessingResult.

Algorithms

Audio Algorithms

dbfs_to_fraction

The dbfs_to_fraction function accepts a decibel value relative to full scale (dbfs) and returns the corresponding fraction of the maximum volume as a float.

fraction_to_dbfs

The fraction_to_dbfs function accepts a fraction of the maximum volume and returns the corresponding decibels relative to full scale (dbfs) as a float.

split_audio

The split_audio function accepts the path to an audio file or a pydub AudioSegment object and a list of tuples representing the timestamps for the beginning and end of each desired segment (in seconds). The function returns the UUIDs of the cut-up files in the order they appeared in the intervals.

split_silence

The split_silence function accepts the path to an audio file, the maximum length of a desired segment (in seconds), and the percentage of the maximum volume at which a segment is considered "silent". The function cuts the file only by silence, not by words, and adds a 50 ms buffer around each segment. The function returns a list of the UUIDs of all the cut-up segments and the intervals at which they were cut.

Text Algorithms

match_words

The match_words function accepts two texts and returns a list of changes that need to be made to the first text in order to get the second one. The comparison takes place using whole words, and the function returns the list of changes in the following format: Tuple(Index in the first text where the difference was found, The segment of the first text which is to be removed, The segment of the second text which is to be substituted in).

match_phrases

match_phrases is a function that takes in two arguments, phrases and text. phrases is a list of phrases or string fragments to be checked against text, which is the correct text. It returns a list of error tuples for each phrase in the phrases list, indicating the index at which the error occurred, the incorrect phrase, and the correct phrase.

The function first prepares the input texts by ignoring capital letters and non-letter symbols. It then uses levenshtein distance to calculate the full answer between the phrases and text. Finally, it cross-references the indices in the full answer to distribute the errors by phrases.

find_phrases

find_phrases is a function that takes in three arguments, phrases, to_find, and margin (default 1.05). phrases is a list of phrases; to_find is the piece of text to be found within the phrases. It returns a list of indices of the phrases in which the text appears in.

The function first prepares the input to_find and phrases to ignore multiple spaces and non-letter symbols by calling on the helper function __prep_text. It then computes the size of a window to compare to the text and finds the window that best fits the string via the __match_symbols helper function.

The function then trims the window to exclude unnecessary symbols (trims using full words) and transforms the indices from the prepared text to initial text. Lastly, it iterates through the phrases to compute the final answer.

API

FastAPI v0.1.0

Scroll down for code samples, example requests and responses. Select a language for code samples from the tabs above or the mobile navigation menu.

Authentication

Scope Scope Description

audio

The endpoint /upload allows clients to upload audio files and returns a unique file ID.

Code samples

POST /v1/audio/upload

The endpoint validates file based on MIME types specification. The endpoint converts audio file into .mp3 format.

Parameters:

  • upload_file: The audio file to upload

List of the most important allowed extensions:

  • .acc
  • .mp3
  • .m4a
  • .oga, .ogv
  • .ogg
  • .opus
  • .wav

Body parameter

upload_file: string

Parameters

Name In Type Required Description
body body Body_upload_audio_file_v1_audio_upload_post true none
» upload_file body string(binary) true none

Example responses

200 Response

{
  "file_id": "8a0cfb4f-ddc9-436d-91bb-75133c583767"
}

422 Response

{
  "detail": "Only audio files uploads are allowed"
}

Responses

Status Meaning Description Schema
200 OK The file is uploaded successfully UploadFileResponse
422 Unprocessable Entity The file was not sent or the file has unallowed extension None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /download allows to download audio file by given uuid.

Code samples

GET /v1/audio/download

The endpoint /download takes a file UUID as input, checks if the file exists in the audio directory, and returns the file as bytes (.mp3 format). If file does not exist, returns 404 HTTP response code

Responses:

  • 200, file bytes (.mp3 format)

Parameters

Name In Type Required Description
file query string(uuid) true none

Example responses

404 Response

{
  "detail": "File not found"
}

422 Response

{
  "detail": [
    {
      "loc": [
        "string"
      ],
      "msg": "string",
      "type": "string"
    }
  ]
}

Responses

Status Meaning Description Schema
200 OK Successful Response None
404 Not Found The specified file was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /models returns available (loaded) audio models.

Code samples

GET /v1/audio/models

Returns list of models, which are loaded into the worker and available for usage.

Example responses

200 Response

{
  "models": [
    {
      "name": "string",
      "languages": [
        "string"
      ],
      "description": "string"
    }
  ]
}

Responses

Status Meaning Description Schema
200 OK List of available models ModelsDataReponse
To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /process/task creates an audio processing task based on the given request parameters.

Code samples

POST /v1/audio/process/task

Parameters:

  • audio_file: an uuid of file to process
  • audio_model: an audio processing model name (check '/models' for available models)

Responses:

  • 404, No such audio file available
  • 404, No such audio model available

Body parameter

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "audio_model": "string"
}

Parameters

Name In Type Required Description
body body AudioProcessingRequest true none
» audio_file body string(uuid) true none
» audio_model body string true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

404 Response

{
  "detail": "No such audio file available"
}

Responses

Status Meaning Description Schema
200 OK Task was successfully created and scheduled TaskCreateResponse
404 Not Found The specified file or model was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /process/result retrieves the result of an audio

processing task from task system and returns it.

Code samples

GET /v1/audio/process/result

Responses:

  • 200, returns a processing result in the format:
{
    "text": "string", // total extracted text
    "segments": [ // list of audio segments
        {
        "start": 0.0, // absolute timecode (in seconds) of the beginning of the segment
        "end": 10.0,  // absolute timecode (in seconds) of the beginning of the segment
        "text": "string", // text, which was extracted from the segment
        "file": "3fa85f64-5717-4562-b3fc-2c963f66afa6" // file uuid of the audio segment (for downloading)
        }
    ]
}
  • 406, is impossible to get task result (task does not exist or it has not finished yet).
  • 422, if the task was not created as audio processing task

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "text": "string",
  "segments": [
    {
      "start": 0,
      "end": 0,
      "text": "string",
      "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
    }
  ]
}

406 Response

{
  "detail": "The job is non-existent or not done"
}

422 Response

{
  "detail": "There is no such audio processing task"
}

Responses

Status Meaning Description Schema
200 OK Successful Response AudioProcessingResponse
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity The specified task is not audio processing task. None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /extract/task extract specified phrases from given audio

file using specified given audio model

Code samples

POST /v1/audio/extract/task

Parameters:

  • audio_file: an uuid of file to process
  • audio_model: an audio processing model name (check '/models' for available models)

Responses:

  • 404, No such audio file available
  • 404, No such audio model available

Body parameter

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "audio_model": "string",
  "phrases": [
    "string"
  ]
}

Parameters

Name In Type Required Description
body body AudioExtractPhrasesRequest true none
» audio_file body string(uuid) true none
» audio_model body string true none
» phrases body [string] true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

404 Response

{
  "detail": "No such audio file available"
}

Responses

Status Meaning Description Schema
200 OK Task was successfully created and scheduled TaskCreateResponse
404 Not Found The specified file or model was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /extract/result retrieves the result of an audio

extracting task from task system and returns it.

Code samples

GET /v1/audio/extract/result

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "data": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "found": true,
      "phrase": "string"
    }
  ]
}

406 Response

{
  "detail": "The job is non-existent or not done"
}

422 Response

{
  "detail": "There is no such audio extraction task"
}

Responses

Status Meaning Description Schema
200 OK Successful Response AudioExtractPhrasesResponse
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity The specified task is not audio extraction task. None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

image

The endpoint /upload allows clients to upload image files and returns a unique file ID.

Code samples

POST /v1/image/upload

The endpoint validates file based on MIME types specification. The endpoint converts image file into .png format.

Parameters:

  • upload_file: The file to upload

Allowed extension:

  • .avif
  • .bmp
  • .gif
  • .ico
  • .jpeg, .jpg
  • .png
  • .svg
  • .tif, .tiff
  • .webp

Body parameter

upload_file: string

Parameters

Name In Type Required Description
body body Body_upload_image_v1_image_upload_post true none
» upload_file body string(binary) true none

Example responses

200 Response

{
  "file_id": "8a0cfb4f-ddc9-436d-91bb-75133c583767"
}

422 Response

{
  "detail": "Only image files uploads are allowed"
}

Responses

Status Meaning Description Schema
200 OK The file is uploaded successfully UploadFileResponse
422 Unprocessable Entity The file was not sent or the file has unallowed extension None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /download allows to download audio file by given uuid.

Code samples

GET /v1/image/download

The endpoint /download takes a file UUID as input, checks if the file exists in the image directory, and returns the file as bytes. If file does not exist, returns 404 HTTP response code

Responses:

  • 200, file bytes

Parameters

Name In Type Required Description
file query string(uuid) true none

Example responses

404 Response

{
  "detail": "File not found"
}

422 Response

{
  "detail": [
    {
      "loc": [
        "string"
      ],
      "msg": "string",
      "type": "string"
    }
  ]
}

Responses

Status Meaning Description Schema
200 OK Successful Response None
404 Not Found The specified file was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /models returns available (loaded) image models.

Code samples

GET /v1/image/models

Returns list of models, which are loaded into the worker and available for usage.

Example responses

200 Response

{
  "models": [
    {
      "name": "string",
      "languages": [
        "string"
      ],
      "description": "string"
    }
  ]
}

Responses

Status Meaning Description Schema
200 OK List of available models ModelsDataReponse
To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /process/task creates an image processing task based on the given request parameters.

Code samples

POST /v1/image/process/task

Parameters:

  • image_file: an uuid of file to process
  • image_model: an image processing model name (check '/models' for available models)

Responses:

  • 404, No such image file available
  • 404, No such image model available

Body parameter

{
  "image_file": "89f23c23-fe12-4935-b746-3bbc447c7a72",
  "image_model": "string"
}

Parameters

Name In Type Required Description
body body ImageProcessingRequest true none
» image_file body string(uuid) true none
» image_model body string true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

404 Response

{
  "detail": "No such image file available"
}

Responses

Status Meaning Description Schema
200 OK Task was successfully created and scheduled TaskCreateResponse
404 Not Found The specified file or model was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /process/result retrieves the result of an image

processing task from task system and returns it.

Code samples

GET /v1/image/process/result

Responses:

  • 200, returns a processing result in the format:
{
    "text": "string", // total extracted text
    "boxes": [ // list of boxes with text
        {
        "text": "string", // text, which was extracted from the box
        "coordinates": { // coordinates of the box on image
            "left_top": { // four points defining the rectangle
            "x": 0,
            "y": 0
            },
            "right_top": {
            "x": 0,
            "y": 0
            },
            "left_bottom": {
            "x": 0,
            "y": 0
            },
            "right_bottom": {
            "x": 0,
            "y": 0
            }
        }
        }
    ]
}
  • 406, is impossible to get task result (task does not exist or it has not finished yet).
  • 422, if the task was not created as audio processing task

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "text": "string",
  "boxes": [
    {
      "text": "string",
      "coordinates": {
        "left_top": {
          "x": 0,
          "y": 0
        },
        "right_top": {
          "x": 0,
          "y": 0
        },
        "left_bottom": {
          "x": 0,
          "y": 0
        },
        "right_bottom": {
          "x": 0,
          "y": 0
        }
      }
    }
  ]
}

406 Response

{
  "detail": "The job is non-existent or not done"
}

422 Response

{
  "detail": "There is no such image processing task"
}

Responses

Status Meaning Description Schema
200 OK Successful Response ImageProcessingResponse
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity The specified task is not image processing task. None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

auth

The endpoint /register registers a new user by storing their username, password, email, and

full name in a Redis database.

Code samples

PUT /v1/auth/register

Parameters:

  • username: The "username: parameter is a string representing the username of the user being registered
  • password: The "password" parameter is a string that represents the user's password
  • email: The "email" parameter is an optional string that represents the email address of the user
  • full_name: The "full_name" parameter is an optional parameter that represents the full name of the user

Parameters

Name In Type Required Description
username query string true none
password query string true none
email query string false none
full_name query string false none

Example responses

200 Response

{
  "text": "string"
}

422 Response

{
  "detail": "Username is already taken"
}

Responses

Status Meaning Description Schema
200 OK Successful Response RegisterResponse
422 Unprocessable Entity The specified username is already taken None

Response Schema

This operation does not require authentication

The endpoint /token handles the login process and returns an

access token for the authenticated user.

Code samples

POST /v1/auth/token

Parameters:

  • username - unique username, which the client has provided while registering
  • password - client's password

Responses:

  • 401, incorrect username or password
  • 200, token

Body parameter

grant_type: string
username: string
password: string
scope: ""
client_id: string
client_secret: string

Parameters

Name In Type Required Description
body body Body_login_for_access_token_v1_auth_token_post true none
» grant_type body string false none
» username body string true none
» password body string true none
» scope body string false none
» client_id body string false none
» client_secret body string false none

Example responses

200 Response

{
  "access_token": "string",
  "token_type": "string"
}

401 Response

{
  "detail": "Incorrect username or password"
}

Responses

Status Meaning Description Schema
200 OK Successful Response Token
401 Unauthorized Incorrect username or password. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

This operation does not require authentication

The endpoint /users/me returns the current user.

Code samples

GET /v1/auth/users/me

Example responses

200 Response

{
  "username": "string",
  "email": "string",
  "full_name": "string",
  "disabled": true
}

400 Response

{
  "detail": "Inactive user"
}

401 Response

{
  "detail": "Could not validate credentials"
}

Responses

Status Meaning Description Schema
200 OK Successful Response User
400 Bad Request User is inactive None
401 Unauthorized Could not validate credentials None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

comparison

The endpoint /audio/image/task creates a task to compare an audio against image file using specified

models and returns the task ID.

Code samples

POST /v1/comparison/audio/image/task

Parameters:

  • audio_file: an uuid of file to process
  • audio_model: an audio processing model name (check '/audio/models' for available models)
  • image_file: an uuid of file to process
  • image_model: an image processing model name (check '/image/models' for available models)

Responses:

  • 200, Task created
  • 404, No such audio file available
  • 404, No such audio model available
  • 404, No such image file available
  • 404, No such image model available

Body parameter

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "image_file": "89f23c23-fe12-4935-b746-3bbc447c7a72",
  "audio_model": "string",
  "image_model": "string"
}

Parameters

Name In Type Required Description
body body AudioToImageComparisonRequest true none
» audio_file body string(uuid) true none
» image_file body string(uuid) true none
» audio_model body string true none
» image_model body string true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

404 Response

{
  "detail": "No such image model available"
}

Responses

Status Meaning Description Schema
200 OK Task was successfully created and scheduled TaskCreateResponse
404 Not Found The specified file or model was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /audio/image/result retrieves the results of a task with a given task ID, and returns the

results.

Code samples

GET /v1/comparison/audio/image/result

Parameters:

  • task_id: The task_id is the uuid of the task to fetch results of

Responses:

  • 200, job results in the format
{
"image": { // image proccessing result
    "text": "string", // total extracted text
    "boxes": [ // list of boxes with text
    {
        "text": "string", // text extracted from the box
        "coordinates": { // coordinates of the box on the image
        "left_top": { // four points defining a rectangle
            "x": 0,
            "y": 0
        },
        "right_top": {
            "x": 0,
            "y": 0
        },
        "left_bottom": {
            "x": 0,
            "y": 0
        },
        "right_bottom": {
            "x": 0,
            "y": 0
        }
        }
    }
    ]
},
"audio": { // audio processing results
    "text": "string", // total extracted text
    "segments": [ // audio segments, that were processed
    {
        "start": 0, // absolute time code of the beginning of the segment
        "end": 0, // absolute time code of the ending of the segment
        "text": "string", // text extracted from the segment
        "file": "3fa85f64-5717-4562-b3fc-2c963f66afa6" // audio segment
    }
    ]
},
"errors": [ // results of comparing
    {
    "audio_segment": { // audio segment where error was made
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
    },
    "at_char": 0, // chat, at which an error stats
    "found": "string", // found word (based on audio)
    "expected": "string" // exptected word (suggetion for improvement based on image)
    }
]
}
  • 406, Results are not ready yet or no task with such id exist
  • 422, There is no such audio processing task

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "image": {
    "text": "string",
    "boxes": [
      {
        "text": "string",
        "coordinates": {
          "left_top": {
            "x": 0,
            "y": 0
          },
          "right_top": {
            "x": 0,
            "y": 0
          },
          "left_bottom": {
            "x": 0,
            "y": 0
          },
          "right_bottom": {
            "x": 0,
            "y": 0
          }
        }
      }
    ]
  },
  "audio": {
    "text": "string",
    "segments": [
      {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      }
    ]
  },
  "errors": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "at_char": 0,
      "found": "string",
      "expected": "string"
    }
  ]
}

406 Response

{
  "detail": "Results are not ready yet or no task with such id exist"
}

422 Response

{
  "detail": "There is no such task consists of the both image and audio"
}

Responses

Status Meaning Description Schema
200 OK Successful Response AudioImageComparisonResultsResponse
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity There is no such task consists of the both image and audio. None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint '/audio/text/task' creates a task to compare audio against text from user input

using specified models and returns the task ID.

Code samples

POST /v1/comparison/audio/text/task

Parameters:

  • audio_file: an uuid of file to process
  • audio_model: an audio processing model name (check '/audio/models' for available models)
  • text: a list of strings to compare audio against

Responses:

  • 200, Task created
  • 404, No such audio file available
  • 404, No such audio model available

Body parameter

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "text": [
    "string"
  ],
  "audio_model": "string"
}

Parameters

Name In Type Required Description
body body AudioToTextComparisonRequest true none
» audio_file body string(uuid) true none
» text body [string] true none
» audio_model body string true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

404 Response

{
  "detail": "No such audio model available"
}

Responses

Status Meaning Description Schema
200 OK Task was successfully created and scheduled TaskCreateResponse
404 Not Found The specified file or model was not found. None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /audio/text/result retrieves the results of a task with a given task ID, and returns the

results.

Code samples

GET /v1/comparison/audio/text/result

Parameters:

  • task_id: The task_id is the uuid of the task to fetch results of

Responses:

  • 200, job results in the format
{
"audio": { // audio processing results
    "text": "string", // total extracted text
    "segments": [ // audio segments, that were processed
    {
        "start": 0, // absolute time code of the beginning of the segment
        "end": 0, // absolute time code of the ending of the segment
        "text": "string", // text extracted from the segment
        "file": "3fa85f64-5717-4562-b3fc-2c963f66afa6" // audio segment
    }
    ]
},
"errors": [ // results of comparing
    {
    "audio_segment": { // audio segment where error was made
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
    },
    "at_char": 0, // chat, at which an error stats
    "found": "string", // found word (based on audio)
    "expected": "string" // exptected word (suggetion for improvement based on text)
    }
]
}
  • 406, Results are not ready yet or no task with such id exist
  • 422, There is no such audio processing task

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "audio": {
    "text": "string",
    "segments": [
      {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      }
    ]
  },
  "errors": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "at_char": 0,
      "found": "string",
      "expected": "string"
    }
  ]
}

406 Response

{
  "detail": "Results are not ready yet or no task with such id exist"
}

422 Response

{
  "detail": "There is no such task consists of the both audio and text"
}

Responses

Status Meaning Description Schema
200 OK Successful Response AudioTextComparisonResultsResponse
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity There is no such task consists of the both audio and text. None

Response Schema

To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

task

The endpoint status returns the status of a task identified by its task_id.

Code samples

GET /v1/task/status

Parameters:

  • task_id: The task_id is the uuid of the task to fetch status of

Responses:

  • 200, Job status

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb",
  "status": "string",
  "ready": true
}

Responses

Status Meaning Description Schema
200 OK Successful Response TaskStatusResponse
422 Unprocessable Entity Validation Error HTTPValidationError
To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

The endpoint /result retrieves the results of a task with a given task ID, and returns the

results.

Code samples

GET /v1/task/result

Parameters:

  • task_id: The task_id is the uuid of the task to fetch results of

Responses:

  • 200, job results
  • 406, Results are not ready yet or no task with such id exist

Parameters

Name In Type Required Description
task_id query string(uuid) true none

Example responses

200 Response

{}

406 Response

{
  "detail": "Results are not ready yet or no task with such id exist"
}

Responses

Status Meaning Description Schema
200 OK Successful Response Inline
406 Not Acceptable It is impossible to get task result (task does not exist or it has not finished yet). None
422 Unprocessable Entity Validation Error HTTPValidationError

Response Schema

Status Code 200

Response Get Job Result V1 Task Result Get

Name Type Required Restrictions Description
To perform this operation, you must be authenticated by means of one of the following methods: OAuth2PasswordBearer

Schemas

AudioChunk

{
  "start": 0,
  "end": 0,
  "text": "string",
  "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
}

AudioChunk

Properties

Name Type Required Restrictions Description
start number true none none
end number true none none
text string true none none
file string(uuid) true none none

AudioExtractPhrasesRequest

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "audio_model": "string",
  "phrases": [
    "string"
  ]
}

AudioExtractPhrasesRequest

Properties

Name Type Required Restrictions Description
audio_file string(uuid) true none none
audio_model string true none none
phrases [string] true none none

AudioExtractPhrasesResponse

{
  "data": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "found": true,
      "phrase": "string"
    }
  ]
}

AudioExtractPhrasesResponse

Properties

Name Type Required Restrictions Description
data [AudioPhrase] true none none

AudioImageComparisonResultsResponse

{
  "image": {
    "text": "string",
    "boxes": [
      {
        "text": "string",
        "coordinates": {
          "left_top": {
            "x": 0,
            "y": 0
          },
          "right_top": {
            "x": 0,
            "y": 0
          },
          "left_bottom": {
            "x": 0,
            "y": 0
          },
          "right_bottom": {
            "x": 0,
            "y": 0
          }
        }
      }
    ]
  },
  "audio": {
    "text": "string",
    "segments": [
      {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      }
    ]
  },
  "errors": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "at_char": 0,
      "found": "string",
      "expected": "string"
    }
  ]
}

AudioImageComparisonResultsResponse

Properties

Name Type Required Restrictions Description
image ImageProcessingResponse true none none
audio AudioProcessingResponse true none none
errors [TextDiff] true none none

AudioPhrase

{
  "audio_segment": {
    "start": 0,
    "end": 0,
    "text": "string",
    "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
  },
  "found": true,
  "phrase": "string"
}

AudioPhrase

Properties

Name Type Required Restrictions Description
audio_segment AudioChunk false none none
found boolean true none none
phrase string true none none

AudioProcessingRequest

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "audio_model": "string"
}

AudioProcessingRequest

Properties

Name Type Required Restrictions Description
audio_file string(uuid) true none none
audio_model string true none none

AudioProcessingResponse

{
  "text": "string",
  "segments": [
    {
      "start": 0,
      "end": 0,
      "text": "string",
      "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
    }
  ]
}

AudioProcessingResponse

Properties

Name Type Required Restrictions Description
text string true none none
segments [AudioChunk] true none none

AudioTextComparisonResultsResponse

{
  "audio": {
    "text": "string",
    "segments": [
      {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      }
    ]
  },
  "errors": [
    {
      "audio_segment": {
        "start": 0,
        "end": 0,
        "text": "string",
        "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
      },
      "at_char": 0,
      "found": "string",
      "expected": "string"
    }
  ]
}

AudioTextComparisonResultsResponse

Properties

Name Type Required Restrictions Description
audio AudioProcessingResponse true none none
errors [TextDiff] true none none

AudioToImageComparisonRequest

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "image_file": "89f23c23-fe12-4935-b746-3bbc447c7a72",
  "audio_model": "string",
  "image_model": "string"
}

AudioToImageComparisonRequest

Properties

Name Type Required Restrictions Description
audio_file string(uuid) true none none
image_file string(uuid) true none none
audio_model string true none none
image_model string true none none

AudioToTextComparisonRequest

{
  "audio_file": "732b10bd-0006-4780-8f48-4319d2791290",
  "text": [
    "string"
  ],
  "audio_model": "string"
}

AudioToTextComparisonRequest

Properties

Name Type Required Restrictions Description
audio_file string(uuid) true none none
text [string] true none none
audio_model string true none none

Body_login_for_access_token_v1_auth_token_post

{
  "grant_type": "string",
  "username": "string",
  "password": "string",
  "scope": "",
  "client_id": "string",
  "client_secret": "string"
}

Body_login_for_access_token_v1_auth_token_post

Properties

Name Type Required Restrictions Description
grant_type string false none none
username string true none none
password string true none none
scope string false none none
client_id string false none none
client_secret string false none none

Body_upload_audio_file_v1_audio_upload_post

{
  "upload_file": "string"
}

Body_upload_audio_file_v1_audio_upload_post

Properties

Name Type Required Restrictions Description
upload_file string(binary) true none none

Body_upload_image_v1_image_upload_post

{
  "upload_file": "string"
}

Body_upload_image_v1_image_upload_post

Properties

Name Type Required Restrictions Description
upload_file string(binary) true none none

HTTPValidationError

{
  "detail": [
    {
      "loc": [
        "string"
      ],
      "msg": "string",
      "type": "string"
    }
  ]
}

HTTPValidationError

Properties

Name Type Required Restrictions Description
detail [ValidationError] false none none

IPRPoint

{
  "x": 0,
  "y": 0
}

IPRPoint

Properties

Name Type Required Restrictions Description
x integer true none none
y integer true none none

IPRRectangle

{
  "left_top": {
    "x": 0,
    "y": 0
  },
  "right_top": {
    "x": 0,
    "y": 0
  },
  "left_bottom": {
    "x": 0,
    "y": 0
  },
  "right_bottom": {
    "x": 0,
    "y": 0
  }
}

IPRRectangle

Properties

Name Type Required Restrictions Description
left_top IPRPoint true none none
right_top IPRPoint true none none
left_bottom IPRPoint true none none
right_bottom IPRPoint true none none

IPRTextBox

{
  "text": "string",
  "coordinates": {
    "left_top": {
      "x": 0,
      "y": 0
    },
    "right_top": {
      "x": 0,
      "y": 0
    },
    "left_bottom": {
      "x": 0,
      "y": 0
    },
    "right_bottom": {
      "x": 0,
      "y": 0
    }
  }
}

IPRTextBox

Properties

Name Type Required Restrictions Description
text string true none none
coordinates IPRRectangle true none none

ImageProcessingRequest

{
  "image_file": "89f23c23-fe12-4935-b746-3bbc447c7a72",
  "image_model": "string"
}

ImageProcessingRequest

Properties

Name Type Required Restrictions Description
image_file string(uuid) true none none
image_model string true none none

ImageProcessingResponse

{
  "text": "string",
  "boxes": [
    {
      "text": "string",
      "coordinates": {
        "left_top": {
          "x": 0,
          "y": 0
        },
        "right_top": {
          "x": 0,
          "y": 0
        },
        "left_bottom": {
          "x": 0,
          "y": 0
        },
        "right_bottom": {
          "x": 0,
          "y": 0
        }
      }
    }
  ]
}

ImageProcessingResponse

Properties

Name Type Required Restrictions Description
text string true none none
boxes [IPRTextBox] true none none

ModelData

{
  "name": "string",
  "languages": [
    "string"
  ],
  "description": "string"
}

ModelData

Properties

Name Type Required Restrictions Description
name string true none none
languages [string] true none none
description string true none none

ModelsDataReponse

{
  "models": [
    {
      "name": "string",
      "languages": [
        "string"
      ],
      "description": "string"
    }
  ]
}

ModelsDataReponse

Properties

Name Type Required Restrictions Description
models [ModelData] true none none

RegisterResponse

{
  "text": "string"
}

RegisterResponse

Properties

Name Type Required Restrictions Description
text string true none none

TaskCreateResponse

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb"
}

TaskCreateResponse

Properties

Name Type Required Restrictions Description
task_id string(uuid) true none none

TaskStatusResponse

{
  "task_id": "736fde4d-9029-4915-8189-01353d6982cb",
  "status": "string",
  "ready": true
}

TaskStatusResponse

Properties

Name Type Required Restrictions Description
task_id string(uuid) true none none
status string true none none
ready boolean true none none

TextDiff

{
  "audio_segment": {
    "start": 0,
    "end": 0,
    "text": "string",
    "file": "00bd29cf-1ab3-4825-b15f-d80a4a0e1cbb"
  },
  "at_char": 0,
  "found": "string",
  "expected": "string"
}

TextDiff

Properties

Name Type Required Restrictions Description
audio_segment AudioChunk true none none
at_char integer true none none
found string true none none
expected string true none none

Token

{
  "access_token": "string",
  "token_type": "string"
}

Token

Properties

Name Type Required Restrictions Description
access_token string true none none
token_type string true none none

UploadFileResponse

{
  "file_id": "8a0cfb4f-ddc9-436d-91bb-75133c583767"
}

UploadFileResponse

Properties

Name Type Required Restrictions Description
file_id string(uuid) true none none

User

{
  "username": "string",
  "email": "string",
  "full_name": "string",
  "disabled": true
}

User

Properties

Name Type Required Restrictions Description
username string true none none
email string false none none
full_name string false none none
disabled boolean false none none

ValidationError

{
  "loc": [
    "string"
  ],
  "msg": "string",
  "type": "string"
}

ValidationError

Properties

Name Type Required Restrictions Description
loc [anyOf] true none none

anyOf

Name Type Required Restrictions Description
» anonymous string false none none

or

Name Type Required Restrictions Description
» anonymous integer false none none

continued

Name Type Required Restrictions Description
msg string true none none
type string true none none

Advanced

Audio

Audio Conversion

Our system uses the pydub python package to work with audio files. The pydub package is a high-level audio library that simplifies the process of audio file manipulation. This package relies on FFmpeg framework. FFmpeg is a multimedia framework that enables the operation of various audio and video file formats.

The pydub package and FFmpeg framework, together, support various audio file formats, including MP3, WAV, FLAC, M4A, among others. However, it is important to note that uploading of audio files to our system is restricted to the most general audio formats specified by MIMO. This is to ensure convenience and prevent errors when processing the uploaded files.

List of the most important allowed extensions:

  • .acc
  • .mp3
  • .m4a
  • .oga, .ogv
  • .ogg
  • .opus
  • .wav

For a comprehensive list of the supported formats, please refer to: Full list of FFmpeg supported formats

Audio Models

Our system fetches audio models from a worker that loads plugins. This process is carried out by sending a request to the worker, which then returns the loaded plugins. The worker is responsible for loading audio processing plugins, which include machine learning models for audio analysis and other related functionalities.

To initiate this process, our system sends a request to the worker to retrieve the list of loaded plugins that are ready to use. This helps ensure that the audio models used in the system are up-to-date.

Image

Image

Image Models

Our system fetches image models from a worker that loads plugins. This process is carried out by sending a request to the worker, which then returns the loaded plugins. The worker is responsible for loading image processing plugins, which include machine learning models for image analysis and other related functionalities.

To initiate this process, our system sends a request to the worker to retrieve the list of loaded plugins that are ready to use. This helps ensure that the image models used in the system are up-to-date.

Task System

This is a set of functions and methods used in Follow My Reading task system.

_plugin_class_method_call

_plugin_class_method_call() is a helper function that searches each plugin for class_name object. If the object is found, it loads the function from it and calls it with the filepath argument. It returns the result of the function.

dynamic_plugin_call

dynamic_plugin_call() is a scheduled job that accepts class_name, function, and filepath as parameters. It calls _plugin_class_method_call() with these parameters.

load_plugins_into_memories

load_plugins_into_memories() is a startup function that loads plugins.

audio_processing_call

audio_processing_call() is a scheduled job that accepts audio_class, audio_function, and audio_path as parameters. It calls _audio_process() with these parameters.

image_processing_call

image_processing_call() is a scheduled job that accepts image_class, image_function, and image_path as parameters. It calls _image_process() with these parameters.

compare_audio_image

compare_audio_image() is a scheduled job that accepts audio_class, audio_function, audio_path, image_class, image_function, and image_path as parameters. It calls _audio_process() and _image_process() with these parameters. It matches resulted texts and returns the difference.

compare_audio_text

compare_audio_text() is a scheduled job that accepts audio_class, audio_function, audio_path, and text as parameters. It calls _audio_process() with these parameters. It matches resulted texts and returns the difference.

_get_audio_plugins

_get_audio_plugins() is a scheduled job that returns information about loaded audio plugins.

_get_image_plugins

_get_image_plugins() is a scheduled job that returns information about loaded image plugins.

_extact_phrases_from_audio

_extact_phrases_from_audio() is a helper function that extracts text from audio and searches for each phrase. It splits the audio by non-none intervals and assigns the splitted files. It returns the result of audio phrases extraction.

extact_phrases_from_audio

extact_phrases_from_audio() is a scheduled job that accepts audio_class, audio_path, and phrases as parameters