Code Graph

Code Graph

In most RAG (Retrieval Augmented Generation) solutions we’ve seen thus far the common setup is a combination of LLM and a Vector database. Early feedback indicates that this setup is simply not good enough, applications suffer from hallucination and users are presented with half truths. This makes the leap toward AI enhanced applications a real challenge.

 we’ve created this demo application (source code available here) which processes a Python code-base and allows the user to query the source code using natural language.

Originally published on: https://www.linkedin.com/pulse/code-graph-roi-lipman-fnc0f/ 

Nowadays everyone is trying to incorporate LLMS into their technology stack, Be it AI empowered chat bot or a domain expert assistant. The range of applications is vast. But in most RAG (Retrieval Augmented Generation) solutions we’ve seen thus far the common setup is a combination of LLM and a Vector database. Early feedback indicates that this setup is simply not good enough, applications suffer from hallucination and users are presented with half truths. This makes the leap toward AI enhanced applications a real challenge.

It is our belief that the situation can be dramatically improved by simply replacing the Vector DB with a Knowledge-Graph which allows for a much broader and rich contextual retrieval.

To demonstrate our point we’ve created this demo application (source code available here) which processes a Python code-base and allows the user to query the source code using natural language.

Here’s a short list of questions presented to the system and its answers.

Question: Which class contains the function with most arguments? name the class, function and a short list of its arguments.

Answer: The class “FalkorDB” contains a function with the most arguments. The function is “__init__”, and some of its arguments include “host”, “port”, “password”, “socket_timeout”.

Request: Show me a calling path between the ‘drop_edge_range_index’ function and ‘_query’, only return function(s) names

Result: The calling path between the ‘drop_edge_range_index’ function and ‘_query’ is: drop_edge_range_index -> dropindex -> query -> _query

Question: Which test functions exercise the ‘query’ function ?

Answer: The test functions that exercise the query function are: – test_connect_via_url – test_explain – test_graph_creation – test_array_functions – test_path.  These test functions are the ones that utilize the query function in their respective test cases.

Code Graph flow
 

Whenever a user sends a question, we present that question as part of a broader prompt to the LLM which in turn produces a graph query to extract relevant information from our knowledge graph.

Provided with the query result we reintroduce the user’s question enriched with the retrieved context to produce our final answer.

Interacting with OpenAI

Given that the user can query the code base using natural language we needed a way to transform his question into a graph query for context retrieval. This transformation is done by OpenAI completion API with the following prompt:

You’re a Cypher expert, with access to the following graph:

The knowledge graph schema is as follows:

The graph contains the following node labels:

Module

Class

Function

the Module label has 57 nodes and is associated with the following attribute(s):

‘name’ which is of type String

the Class label has 109 nodes and is associated with the following attribute(s):

‘name’ which is of type String

‘file_name’ which is of type String

‘src_start’ which is of type Integer

‘src_end’ which is of type Integer

The graph contains the following relationship types:

CONTAINS

CALLS

the CONTAINS relationship has 1388 edges and has no attributes

the CONTAINS relationship connects the following labels:

Module is connected via CONTAINS to Class

Module is connected via CONTAINS to Function

Class is connected via CONTAINS to Function

the CALLS relationship has 2622 edges and is associated with the following attribute(s):

‘file_name’ which is of type String

This is the end of the knowledge graph schema description.

The knowledge graph contains the following indexes:

NODE of type Function have a RANGE index indexing its name attribute

NODE of type Function have a VECTOR index indexing its src_embeddings attribute

NODE of type Module have a RANGE index indexing its name attribute

NODE of type Class have a RANGE index indexing its name attribute

This is the end of our indexes list

To use a Vector index use the following procedure:

CALL db.idx.vector.queryNodes(<LABEL>, <PROPERTY>, <N>, <VALUE>)) YIELD node

The procedure returns up to N nodes that have a <PROPERTY> value which is semantically close to <VALUE>.

Here are a few question / answer examples of using the vector index:

Question: Find 3 functions which have nested loops, return a list of callers

Cypher query: CALL db.idx.vector.queryNodes(‘Function’, ‘source’, 3, ‘nested loops’) YIELD node MATCH (f)-[:CALLS]->(node) RETURN node.name

Question: List 5 Classes which contain a function that raise an exception

Cypher query: CALL db.idx.vector.queryNodes(‘Function’, ‘source’, 5, “raise exception”) YIELD node MATCH (c:Class)-[:CONTAINS]->(node) RETURN class.name, node.name

The graph represents a code base.

This prompt achieve a number of goals:

Presented with the above prompt and a user question the LLM instructs us to run its generated query against our knowledge graph.

Knowledge graph

The demo provides a number of Pythonic code bases to explore and query, one of them is the FalkorDB Python client code base, the demo is written is such away which allows for the construction of knowledge graphs from any Python project, taking this even further we believe that with minimal effort this demo can be extended to support code-bases from other programming languages, this is all thanks to the flexible and well documentedTreeSitter Parser which can generate ASTs for a variety of programming languages.

Dependency graph
 
The knowledge graph construction starts by building an AST from a Python code base, then given the AST we extract the following entities and the relations between them

Entities:

Relations:

Both nodes and edges are enriched with additional metadata e.g. class_name, function_args and function_src_code.

Lastly, to support questions which inquire the actual logic of a function e.g. “Which function performs IO operation?” or “Which function branches?”

We’re creating embeddings for each function source code and indexing these embedding vectors using our graph vector index.

I should mention that due to the structural nature of the data we’re working with (source code), it was an obvious choice to use a parser and AST to allow an easy entity extraction, but it could be an interesting exercise to handoff entity and relation extraction to LLM, hack in most situations the knowledge itself is unstructured, PDF documents writing in natural language.

 

Human Answer

Although the knowledge-graph query results could suffice the user question, it is structured in a certain way which isn’t well suited for a question answer chat bot type of an application. This is why we’re performing a second interaction with the LLM, this time we’re presenting the user’s question with the data retrieved from the knowledge graph and are instructing the LLM with the following prompt:

This is the user’s question: <ORIGINAL_QUESTION>

And this is the data we’ve got from our knowledge graph: <QUERY_RESULTS>

Please formulate an answer to the user question based on the data we’ve got from the knowledge graph

This prompt restricts the LLM to produce an answer based on the context provided, nothing more. It’s this restriction which prevents the model from hallucinating.

It is at this point where we’re left with a human readable answer which we present to the user.

Future work

For the short term we see value in enriching our knowledge graph with additional information such as: introducing new types of entities, extending entities metadata to include doc-strings and some runtime-information.

Moreover we would like to add support for additional programming languages, e.g. Go, C and Java.

In the long run we would like to offer a more general solution which handles data from other domains.