Skip to content

DOC-730 | Data Science Suite #696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions site/content/3.13/components/platform.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: The ArangoDB Platform
menuTitle: Platform
weight: 169
description: >-
The ArangoDB Platform brings everything ArangoDB offers together to a single
solution that you can deploy on-prem or use as a managed service
---
The ArangoDB Platform is a technical infrastructure that acts as the umbrella
for hosting the entire ArangoDB offering of products. The Platform makes it easy
to deploy and operate the core ArangoDB database system along with any additional
ArangoDB products for machine learning, data explorations, and more. You can
run it on-premise or in the cloud yourself on top of Kubernetes, as well as use
ArangoDB's managed service, the [ArangoGraph Insights Platform](../arangograph/_index.md)
to access all of the platform features.

## Requirements for self-hosting

- **Kubernetes**: Orchestrates the selected services that comprise the
ArangoDB Platform, running them in containers for safety and scalability.
- **Licenses**: If you want to use any paid features, you need to purchase the
respective packages.

## Products available in the ArangoDB Platform

- **Core database system**: The ArangoDB graph database system for storing
interconnected data. You can use the free Community Edition or the commercial
Enterprise Edition.
- **Graph visualizer**: A web-based tool for exploring your graph data with an
intuitive interface and sophisticated querying capabilities.
- **Data-science suite**: A set of paid machine learning services, APIs, and
user interfaces that are available as a package as well as individual products.
- **Vector embeddings**: You can train machine learning models for later use
in vector search in conjunction with the core database system's `vector`
index type. It allows you to find similar items in your dataset. <!-- TODO: GraphRAG importer/retriever -->
- **GraphRAG solutions**: Leverage ArangoDB's Graph, Document, Key-Value,
Full-Text Search, and Vector Search features to streamline knowledge
extraction and retrieval.
- **Txt2AQL**: Unlock natural language querying with a service that converts
user input into ArangoDB Query Language (AQL), powered by fine-tuned
private or public LLMs. <!-- TODO: GenAI -->
- **GraphRAG Importer**: Extract entities and relationships from large
text-based files, converting unstructured data into a knowledge graph
stored in ArangoDB. <!-- TODO: Change to RagLoader? -->
- **GraphRAG Retriever**: Perform semantic similarity searches or aggregate
insights from graph communities with global and local queries.
- **GraphML**: A turnkey solution for graph machine learning for prediction
use cases such as fraud detection, supply chain, healthcare, retail, and
cyber security.
- **Graph Analytics**: A suite of graph algorithms including PageRank,
community detection, and centrality measures with support for GPU
acceleration thanks to Nvidia cuGraph.
- **Jupyter notebooks**: Run a Jupyter kernel in the platform for hosting
interactive notebooks for experimentation and development of applications
that use ArangoDB as their backend.

<!-- TODO: Which product requires what license, free trial -->

## Get started with the ArangoDB Platform

### Use the ArangoDB Platform as a managed service

<!-- TODO: Sign up at https://dashboard.arangodb.cloud -->

### Self-host the ArangoDB Platform

<!-- TODO: Adam's installer -->

## Interfaces

<!-- TODO: UIs, APIs (with links to generated docs) -->
150 changes: 83 additions & 67 deletions site/content/3.13/data-science/_index.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,91 @@
---
title: Data Science and GenAI
menuTitle: Data Science & GenAI
title: Generative Artificial Intelligence (GenAI) and Data Science
menuTitle: GenAI & Data Science
weight: 115
description: >-
ArangoDB lets you apply analytics and machine learning to graph data at scale
ArangoDB's set of tools and technologies enables analytics, machine learning,
and GenAI applications powered by graph data
aliases:
- data-science/overview
---
ArangoDB provides a wide range of functionality that can be utilized for
data science applications. The core database system includes multi-model storage
of information with scalable graph and information retrieval capabilities that
you can directly use for your research and product development.

ArangoDB also offers a dedicated GenAI Suite, using the database core
as the foundation for higher-level features. Whether you want to turbocharge
generative AI applications with a GraphRAG solution or apply analytics and
machine learning to graph data at scale, ArangoDB covers these needs.

<!--
ArangoDB's Graph Analytics and GraphML capabilities provide various solutions
in data science and data analytics. Multiple data science personas within the
engineering space can make use of ArangoDB's set of tools and technologies that
enable analytics and machine learning on graph data.
-->

## GenAI Suite

The GenAI Suite is comprised of two major components:

- [**GraphRAG**](#graphrag): A complete solution for extracting entities
from text files to create a knowledge graph that you can then query with a
natural language interface.
- [**GraphML**](#graphml): Apply machine learning to graphs for link prediction,
classification, and similar tasks.

Each component has an intuitive graphical user interface integrated into the
ArangoDB Platform web interface, guiding you through the process.
<!-- TODO: Not Graph Analytics? -->

Alongside these components, you also get the following additional features:

- **Graph visualizer**: A web-based tool for exploring your graph data with an
intuitive interface and sophisticated querying capabilities.
- **Jupyter notebooks**: Run a Jupyter kernel in the platform for hosting
interactive notebooks for experimentation and development of applications
that use ArangoDB as their backend.
- **MLflow integration**: Built-in support for the popular management tool for
the machine learning lifecycle.
- **Adapters**: Use ArangoDB together with cuGraph, NetworkX, and other tools.
- **Application Programming Interfaces**: Use the underlying APIs of the
GenAI Suite services and build your own integrations.

## Other tools and features

<!-- TODO: Should this and the above section somehow be combined? -->

The ArangoDB Platform includes the following features independent of the
GenAI Suite:

- [**Graph Analytics**](#graph-analytics): Run graph algorithms such as PageRank
on dedicated compute resources.

## From graph to AI

This section classifies the complexity of the queries you can answer with
ArangoDB and gives you an overview of the respective feature.

It starts with running a simple query that shows what is the path that goes from
one node to another, continues with more complex tasks like graph classification,
link prediction, and node classification, and ends with generative AI solutions
powered by graph relationships and vector embeddings.

ArangoDB, as the foundation for GraphML, comes with the following key features:
### Foundational features

- **Scalable**: designed to support true scalability with high performance for
ArangoDB comes with the following key features:

- **Scalable**: Designed to support true scalability with high performance for
enterprise use cases.
- **Simple Ingestion**: easy integration in existing data infrastructure with
- **Simple Ingestion**: Easy integration in existing data infrastructure with
connectors to all leading data processing and data ecosystems.
- **Source-Available**: extensibility and community.
- **NLP Support**: built-in text processing, search, and similarity ranking.

![ArangoDB Machine Learning Architecture](../../images/machine-learning-architecture.png)
- **Source-Available**: Extensibility and community.
- **NLP Support**: Built-in text processing, search, and similarity ranking.

## Graph Analytics vs. GraphML
<!-- TODO: This is actually GraphML specific... -->

This section classifies the complexity of the queries we can answer -
like running a simple query that shows what is the path that goes from one node
to another, or more complex tasks like node classification,
link prediction, and graph classification.
![ArangoDB Machine Learning Architecture](../../images/machine-learning-architecture.png)

### Graph Queries

Expand Down Expand Up @@ -71,63 +128,22 @@ GraphML can answer questions like:
For ArangoDB's enterprise-ready, graph-powered machine learning offering,
see [ArangoGraphML](graphml/_index.md).

## Use Cases

This section contains an overview of different use cases where Graph Analytics
and GraphML can be applied.

### GraphML

GraphML capabilities of using more data outperform conventional deep learning
methods and **solve high-computational complexity graph problems**, such as:
- Drug discovery, repurposing, and predicting adverse effects.
- Personalized product/service recommendation.
- Supply chain and logistics.

With GraphML, you can also **predict relationships and structures**, such as:
- Predict molecules for treating diseases (precision medicine).
- Predict fraudulent behavior, credit risk, purchase of product or services.
- Predict relationships among customers, accounts.

ArangoDB uses well-known GraphML frameworks like
[Deep Graph Library](https://www.dgl.ai)
and [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
and connects to these external machine learning libraries. When coupled to
ArangoDB, you are essentially integrating them with your graph dataset.

## Example: ArangoFlix

ArangoFlix is a complete movie recommendation application that predicts missing
links between a user and the movies they have not watched yet.

This [interactive tutorial](https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/Integrate_ArangoDB_with_PyG.ipynb)
demonstrates how to integrate ArangoDB with PyTorch Geometric to
build recommendation systems using Graph Neural Networks (GNNs).

The full ArangoFlix demo website is accessible from the ArangoGraph Insights Platform,
the managed cloud for ArangoDB. You can open the demo website that connects to
your running database from the **Examples** tab of your deployment.

{{< tip >}}
You can try out the ArangoGraph Insights Platform free of charge for 14 days.
Sign up at [dashboard.arangodb.cloud](https://dashboard.arangodb.cloud/home?utm_source=docs&utm_medium=cluster_pages&utm_campaign=docs_traffic).
{{< /tip >}}
### GraphRAG

The ArangoFlix demo uses five different recommendation methods:
- Content-Based using AQL
- Collaborative Filtering using AQL
- Content-Based using ML
- Matrix Factorization
- Graph Neural Networks
GraphRAG is ArangoDB's turn-key solution to turn your organization's data into
a knowledge graph and let everyone utilize the knowledge by asking questions in
natural language.

![ArangoFlix demo](../../images/data-science-arangoflix.png)
GraphRAG combines vector search for retrieving related text snippets
with graph-based retrieval augmented generation for context expansion
and relationship discovery. This lets a large language model (LLM) generate
answers that are accurate, context-aware, and chronologically structured.
This approach combats the common problem of hallucination.

The ArangoFlix website not only offers an example of how the user recommendations might
look like in real life, but it also provides information on a recommendation method,
an AQL query, a custom graph visualization for each movie, and more.
To learn more, see the [GraphRAG](graphrag/_index.md) documentation.

## Sample datasets

If you want to try out ArangoDB's data science features, you may use the
[`arango_datasets` Python package](../components/tools/arango-datasets.md)
[`arango-datasets` Python package](../components/tools/arango-datasets.md)
to load sample datasets into a deployment.
2 changes: 1 addition & 1 deletion site/content/3.13/data-science/arangograph-notebooks.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: ArangoGraph Notebooks
menuTitle: ArangoGraph Notebooks
weight: 130
weight: 40
description: >-
Colocated Jupyter Notebooks within the ArangoGraph Insights Platform
---
Expand Down
2 changes: 1 addition & 1 deletion site/content/3.13/data-science/graphml/_index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: ArangoDB GraphML
menuTitle: GraphML
weight: 125
weight: 15
description: >-
Boost your machine learning models with graph data using ArangoDB's advanced GraphML capabilities
aliases:
Expand Down
2 changes: 1 addition & 1 deletion site/content/3.13/data-science/graphml/ui.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,4 +241,4 @@ the Web Interface or export them for downstream use.

- **Edge Attributes**: The current version of GraphML does not support the use of edge attributes as features.
- **Dangling Edges**: Edges that point to non-existent vertices ("dangling edges") are not caught during the featurization analysis. They may cause errors later, during the Training phase.
- **Memory Usage**: Both featurization and training can be memory-intensive. Out-of-memory errors can occur on large graphs with insufficient system resources.
- **Memory Usage**: Both featurization and training can be memory-intensive. Out-of-memory errors can occur on large graphs with insufficient system resources.
Loading