As opposed to web search, most people don’t use, know, or care about code search. In some ways, however, code search is harder, and perhaps more important, than web search. At least, this is what Quinn Slack, co-founder and CEO of Sourcegraph thinks. We caught up to discuss what makes Sourcegraph special, and what this funding will bring about.
Sourcegraph secured $23 million in Series B funding led by Craft Ventures with participation from earlier investors Redpoint Ventures, Goldcrest Capital and others. Sourcegraph touts its Universal Code Search as enabling developers to explore and better understand all code, everywhere, faster, and has clients like Uber, Lyft, Yelp and Plaid.
Code search – what is it good for?
“With more code, languages, and systems to integrate than ever before, the job of the developer is exponentially more difficult than 5-10-15 years ago”. This statement by Slack should give even non-developers a glimpse into why a developer’s work is hard. The amount of code in the world today may not be as much as the amount of documents on the web, but it’s definitely on the rise.
Some developers, and projects, are more organized than others. No matter how organized you are, however, in the end the sheer volume of it all becomes overwhelming. This is why code search is needed. Slack said Sourcegraph wants to be the Google of code. Even though code search is used by less people than web search, Slack believes it’s actually harder.
Let’s start with what web search and code search share: indexing. Search functionality largely relies on indexing, and code is no exception. But code is a different kind of data than web documents. Not just in terms of structure, but also in terms of interdependence and velocity. Information on the web is in constant flux, but people have come to accept that web search won’t usually get them results generated within the last few hours.
For code, this absolutely essential. As distributed teams are becoming commonplace, especially in large organizations, changes in code should be immediately visible. Changes may introduce bugs or vulnerabilities, or they may solve major issues. Being always up to date, and having access to the latest code changes, however, does not work on indexing alone.
Slack said that in order to make it work, they had to build code search from the ground up. In addition to specialized indexing, Sourcegraph’s code search analyzes and utilizes interdependency in code. Code modules are highly interconnected – they include each other, call each other, pass data to each other, and so on. This interdependency is best modeled as a graph, and this is where the “Graph” in Sourcegraph comes from.
Graph all the code
Slack described this as “a graph of all the code you are using”. This rich structure is leveraged to improve code search results, much like links do for web search. It also powers things such as code change management and code review. Naturally, that was a discussion point with Slack, who was clear about the fact that Sourcegraph is not using any off-the-shelf graph database or component.
This, Slack said, comes down to code being different from other data, too. There are 2 parts in understanding code, he went on to add. The first one is from the compiler point of view, and Sourcegraph is leveraging the open-source Language Server Protocol (LSP) for this.
LSP, originally introduced by Microsoft, defines the protocol used between an editor or IDE and a language server that provides language features like auto complete, go to definition, find all references etc. The goal of the Language Server Index Format (LSIF) is to support rich code navigation in development tools or a Web user interface without needing a local copy of the source code.
Sourcegraph leverages this, as Slack said they were already using something similar internally, so when LSP/LSIF came out, they were happy to switch to it. But that’s only one part of the equation – the other one is putting that LSIF data about code somewhere where it can be queried in real time. Some projects have hundreds of developers working on them, and when something changes, that needs to be visible immediately, and for all code versions.
The need to index historical versions of code adds an extra layer to code search, because it means multiplying the volume of code by a large number. Just imagine web search being able to search throughout the entire history of the web, and you’ll get the picture. This, Slack said, meant no off-the-shelf graph database worked for Sourcegraph.
And to make things worse: Sourcegraph ships its solution to customers. This means there is no singe point of indexing – it has to work on the client’s infrastructure. Code can be a sensitive and valuable asset, which clients want to keep to themselves.
Fair use, source code availability, and Fair License
What all of that means, is that Sourcegraph had to develop some quite advanced technology to deal with those challenges. Knowing that Sourcegraph is open core, i.e. it comes in many flavors, based on an open-source codebase, we wondered which part of the offering the advanced solutions Slack outlined belonged to.
While not all of Sourcegraph’s code is open source, it is all public. The open-source source part if licensed under the permissive Apache License. Even the non open-source part, Slack said, needs to be available (source-available). The reason is connected to code being a sensitive asset: clients feel secure by being able to audit what Sourcegraph does.
Slack went on to add that some of the advanced capabilities are also present in the open-source version, but things such as the high scalability features are part of the paid version. There is a free version, which developers can use to get started. As Slack said, if many developers in an organization are using Sourcegraph, they’re probably are getting lots of value out of it, and so it makes sense for them, and it would be fair, to move to the paid version.
Interestingly, this touches upon the evolution of open-source, on which we wrote previously. To be precise, the above idea seems to match exactly the premise behind the Fair Source license, drafted by prominent open-source lawyer Heather Meeker, and associated with Sourcegraph. Since Fair Source adoption seems stalled, we asked Slack what the story is there.
Slack referred to Fair Source as an experiment, which in hindsight may have been more appropriately attributed to him, rather than Sourcegraph. Sourcegraph is not licensed under it. Sourcegraph, Slack went on to add, is a developer-facing tool, used directly by end users. This means it gets more mindshare, stickiness and feedback than most other developer tools.
It also means that the risk of being redistributed by cloud providers is lower: “You see AWS doing that with tools that are not user facing, like databases. It’s much harder for them to do that with a user-facing product. I find this conversation super interesting personally, but it does not really affect us as Sourcegraph”, said Slack. But Fair Source is not abandoned, either.
Slack said that last time he checked, there were some 600 – 700 projects using the Fair Source license, although he has not been in touch with them. Slack was sympathetic to the woes of other open-source vendors, but he was also clear about the fact that since Sourcegraph does not really feel their pain, he can’t advise on a way forward.
Universal Code Search
Code search, and related source code facilities, used to be a privilege of the few. Facebook and Google have had this for a while, and have invested heavily in it, too. Slack said Sourcegraph hired one of the key people who implemented this in Google, as the first part in Sourcegraph’s master plan is to make basic code intelligence ubiquitous.
Sourcegraph is not short of ambition, as the next steps in its master plan are making code review continuous and intelligent, and increasing the amount and quality of open-source code. It also seems to have a very transparent and data-driven culture, with its strategy, values and goals being publicly available.
That however, does not mean it’s the only solution for code search out there. GitHub, for example, recently announced CodeQL, a semantic code analysis engine brought to GitHub via Semmle. Other solutions like ShiftLeft, which also utilizes graph infrastructure, or good old IDE search, also exist.
Slack acknowledged the competition, but said that what Sourcegraph brings to the table is the fact that its search is universal; it works across repositories, IDEs, and programming languages:
“Universal Code Search is the only solution that lets developers stay on top of this ever-growing complexity by giving them the ability to search, understand and fix problems across the entire codebase. We are going to continue to invest in our product, making it faster, adding new features, supporting more programming languages, integrating deeper with tools. Every dollar is going to be spent in a way that ends up working for developers”.
Phase 2 in Sourcegraph’s master plan is enabling everyone to code. If you think that’s crazy, reads the plan, ask yourself: now that billions of people have access to the Internet, is coding more like reading and writing (which virtually everyone does) or publishing books (which 0.1% of the population does)?
We’d argue reading and writing may be done by a large part of the population, but with varying levels of quality. Code is probably similar. For the time being, let’s stick to raising the bar on quality.