Compilers and Repositories: A Complicated Love Story

More often than not, compilers are used as black boxes, limited to their (admittedly useful) primary purpose, namely, to compile source code into binary machine code, assembler, another common programming language or bytecode for the virtual machine of the day.

Even if understandable, such a narrow view on a compiler’s use is wasteful: in the process of actually compiling source code, it performs tons on analysis based on sophisticated data structures, both of which could very well be of use for other purposes. When a compiler does not allow for the reuse of such components, anyone in need for source code analysis must reimplement a significant part of it from scratch.

Beyond wasteful, really.

With the advent of reflection (at least for its static incarnation, the one that works on source code artefacts, not the runtime inspection of a live system during execution), there are now ways one can use a compiler’s components as a resource to implement bespoke analysis tasks.

Java’s Jackpot project was an early and comprehensive effort at allowing one to perform various analysis and/or transformation on Java programs, but it never gained serious momentum.

It is now almost defunct: you know for a fact that anything related to Java gets close to disappearance when a Google search returns references to coffee-related product on Starbucks’ web site instead of software products.

Microsoft’s Roslyn, coupled with LINQ’s query capabilities, is faring better nowadays. See, for instance, Sudipta Mukherjee’s book “Source Code Analytics With Roslyn and JavaScript Data Visualization“ for a comprehensive catalog of meta-programming techniques applied to real world problems.
But the barriers to entry are still serious. The learning curve is steep, even for the simplest analysis tasks. And then, when dealing with modular language and too large portfolios to be held entirely in memory, parsing the source and running queries on the resulting data structure just won’t scale, even for simplistic analysis tasks if they are global in nature.

Relational databases are you best friend here.

I know, relational technology fails to adequately represent the typically dynamic and polymorphic data structures we use to represent source code and derived artefacts. I would not dare advocate their use to as sole persistence mechanism, and reduce any source code processing task – compilation or other – to a walkthrough in a relational database. That would be insane.

The idea is not to persist everything, but merely, pre-munch data for future use, possibly by less specialized users. Modules, functions, sizing information, call graph, etc. can all be represented conveniently and (almost) intuitively using relational technology.

Taking advantage of relational repositories when processing source code is a form of divide and conquer strategy: we separate the task of going through elaborate data structure, deriving and persisting synthetic information, from the task of using this synthetic information through ad hoc queries. The two sides can be tackled by different people, using different skill sets and different tools.

In source code transformation projects (conversion from one language to another, database remediation, etc.), maintaining such a repository database offers invaluable benefits. It allows for dashboards to measure progress, query sizing information, measure complexity, navigate through call graphs, etc. It is the swiss-knife that allows one to survive the multiple hiccups of any non-trivial transformation project, by delivering reliable facts and driving rational decision processes.

Oddly enough, very few compilers available on the market today do provide such a repository capability. As far as I know, it was a first when we extended the Raincode compilers for COBOL and PL/I to populate a relational database with facts extracted from the source code as part of the compilation process. It was a simple addition to otherwise very sophisticated products, but simple as it was, it proved invaluable to migration projects based on these compilers. It allowed for the slicing of large systems into more manageable sub-systems, performing impact analysis, accessing the call graphs and more.

Most modern databases are hefty beasts, that can be annoying to install, server processes, security policies and more. Even though we sometimes go for big, server-based databases engine when multiple users need to share the repository, our first choice for these database is SQLite.

It is rock solid, blazingly fast and reasonably full-featured. Even more importantly, SQLite is easy to deploy. We embed it in our tools, it requires no server process, and comes with a wealth of third party tools to query and operate the databases with a convenient user interface.

A no brainer if I ever saw one.

08-01-2017 in

Leave a Reply

Your email address will not be published. Required fields are marked *