Atlos uses a sub-set of the JSON schema to extract data from unstructured data.

Basics

A JSON schema is made up of Fields.

A field has the following components:

  • name: The name of the field
  • type: The type of the field
  • description: A description of the field

⚠️ Important: Only these three components are supported. Additional components will fail.

Field Types

  • string: Text data like names, addresses, and emails
  • integer: Whole numbers like page count or household size
  • number: Decimal numbers like prices or measurements
  • boolean: True or False
  • enum: List of predefined values like colors or sizes
  • array: List of items of the same type
  • object: Groups related data together

Best Practices

Naming

  • Provide specific field names, e.g prefer full_name over name
  • Use consistent naming convention such as snake_case.

Utilize Enums

Enums should always be used when there is a known set of possible values.

For example, prefer:

{
  "type": "object",
  "properties": {
    "shipping_status": {
      "type": "enum",
      "enum": ["pending", "shipped", "delivered", "cancelled"]
    }
  },
  "required": ["shipping_status"]
}

Over just using a string type:

{
  "type": "object",
  "properties": {
    "shipping_status": {
      "type": "string"
    }
  }
}

Specify Required Fields

  • Make use of the required property to specify which fields are needed so that data is extracted correctly.

Example Schemas

Simple

A simple schema which extracts the required first and last name values from a document.

{
  "type": "object",
  "properties": {
    "first_name": {
      "type": "string"
    },
    "last_name": {
      "type": "string"
    }
  },
  "required": ["first_name", "last_name"]
}

Enums

A schema which extracts the the shipping status of an order.

{
  "type": "object",
  "properties": {
    "shipping_status": {
      "type": "enum",
      "enum": ["pending", "shipped", "delivered", "cancelled"]
    }
  },
  "required": ["shipping_status"]
}

Objects & Arrays

An object schema which extracts data about a Person. Take note of the nested address object.

{
  "type": "object",
  "properties": {
    "first_name": {
      "type": "string"
    },
    "last_name": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "hobbies": {
      "type": "array",
      "items": {
        "type": "string"
      }
    },
    "address": {
      "type": "object",
      "properties": {
        "street": {
          "type": "string"
        },
        "city": {
          "type": "string"
        },
        "state": {
          "type": "string"
        },
        "zip": {
          "type": "string"
        }
      },
      "required": ["street", "city", "state", "zip"]
    }
  },
  "required": ["first_name", "last_name", "age", "address"]
}

Schemas can be generated within our Playground, or programmatically using our API.

Something missing?

If you need help with something that is not covered in the documentation, please let us know by sending a message to alex@atlos.dev