Vector database: Difference between revisions

Revision as of 10:08, 11 May 2025

Vector databases like Neo4j have been important for quite some time now. They are ever more important now that Artificial Intelligence is mainstream. A vector database is a collection of data stored as mathematical representations. Vector databases make it possible for computer programs to draw comparisons, identify relationships, and understand context. They enable Semantic Search which is search based on meaning rather than exact text matching. While semantic searching has been around for decades, tagging and ontologies have morphed into LLMs. Vector databases enable the creation of advanced artificial intelligence (AI) programs like large language models (LLMs).

There are many open source vector databases^[1] such as Apache Cassandra, Elasticsearch, Meilisearch and MongoDB. (Even traditional Relational Databases MariaDB^[2] and Postgres^[3] offer vector capability now.)

Commercial

Pinecone is a commercial vector database product. They have a page What is a Vector Database & How Does it Work? Use Cases + Examples that goes quite a bit into describing what a vector database is, compared with simple vector indexes or traditional scalar databases, and how a vector database works. The even get into algorithms and discuss things like random projection (the Wikipedia article needs more work but it's the tip of an iceberg of statistics, data science and artificial intelligence).

Random Projection is a topic all unto itself. Professor Michael Pyrcz at the University of Texas, Austin posts his coursework and code online for all to learn. Here's PGE 383 - Feature Projection which he starts off with Random Projection. (Note too that he puts all materials onto his GitHub account; e.g. https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/SubsurfaceDataAnalytics_Multidimensional_Scaling.ipynb) It's one of many algorithms in "dimensionality reduction" that have applications in neuroscience, artificial intelligence and recommender systems.

Open Source

One interesting open source vector database is Memgraph. Memgraph is like Neo4j without the cost. Memgraph uses the same Cypher query language as Neo4j. However, it is written in C++ and integrates better with Python than Neo4j, which uses Java to build applications. An interesting case study is how NASA is building a People Knowledge Graph with LLMs and Memgraph^[4]. In the NASA case study, they use Ollama which is a locally deployed AI model runner which can be thought of as like Docker Desktop for running Docker images.

References

↑ wp:Vector database
↑ It was Hugo Wen of the Amazon RDS team who contributed heavily to this feature of MariaDB. Why add vectors to a relational database? There are already databases designed specifically for vectors, but they will lead to additional layers in a database architecture and increase overall costs for the service. Supporting vectors inside the relational database will reduce complexity and simplify maintenance, thus making it more cost efficient. With MariaDB Server, users can have data and vectors in two tables of the same database and access them both with one single query. Postgres offers similar (free) open source capabilities. MySQL's implementation is reserved for their enterprise version customers. https://mariadb.org/amazon-mariadb-vector/
↑ https://github.com/pgvector/pgvector
↑ https://www.theregister.com/2025/05/07/nasa_people_memgraph/

[1] wp:Vector database

[2] It was Hugo Wen of the Amazon RDS team who contributed heavily to this feature of MariaDB. Why add vectors to a relational database? There are already databases designed specifically for vectors, but they will lead to additional layers in a database architecture and increase overall costs for the service. Supporting vectors inside the relational database will reduce complexity and simplify maintenance, thus making it more cost efficient. With MariaDB Server, users can have data and vectors in two tables of the same database and access them both with one single query. Postgres offers similar (free) open source capabilities. MySQL's implementation is reserved for their enterprise version customers. https://mariadb.org/amazon-mariadb-vector/

[3] ttps://github.com/pgvector/pgvector

[4] ttps://www.theregister.com/2025/05/07/nasa_people_memgraph/

[1]

[2]

[3]

[4]

@@ Line 8: / Line 8: @@
 https://mariadb.org/amazon-mariadb-vector/</ref> and [[Postgres]]<ref>https://github.com/pgvector/pgvector</ref> offer vector capability now.)
+== Commercial ==
+'''Pinecone''' is a commercial vector database product. They have a page [https://www.pinecone.io/learn/vector-database/ What is a Vector Database & How Does it Work? Use Cases + Examples] that goes quite a bit into describing what a vector database is, compared with simple vector indexes or traditional scalar databases, and how a vector database works. The even get into algorithms and discuss things like [[wp:random_projection|random projection]] (the Wikipedia article needs more work but it's the tip of an iceberg of statistics, data science and artificial intelligence).
+Random Projection is a topic all unto itself. Professor '''Michael Pyrcz''' at the University of Texas, Austin posts his coursework and code online for all to learn. Here's [https://www.youtube.com/watch?v=bfS7JAjiOMI PGE 383 - Feature Projection] which he starts off with Random Projection. (Note too that he puts all materials onto his GitHub account; e.g. https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/SubsurfaceDataAnalytics_Multidimensional_Scaling.ipynb)   It's one of many algorithms in "[[wp:Dimensionality_reduction|dimensionality reduction]]" that have applications in neuroscience, artificial intelligence and [[wp:Recommender_system|recommender systems]].
+== Open Source ==
 One interesting open source vector database is '''Memgraph'''. Memgraph is like Neo4j without the cost. Memgraph uses the same Cypher query language as Neo4j. However, it is written in C++ and integrates better with Python than Neo4j, which uses Java to build applications. An interesting case study is how NASA is building a People Knowledge Graph with LLMs and Memgraph<ref>https://www.theregister.com/2025/05/07/nasa_people_memgraph/</ref>. In the NASA case study, they use [[Ollama]] which is a locally deployed AI model runner which can be thought of as like [[Docker Desktop]] for running [[Docker]] images.