Academics

 

Introduction

Raincode S.P.R.L. is a leading company in compiler design and more generally, language processing infrastructures. We are headquartered in Brussels, Belgium, and serve customers in all major regions of the globe. We have a sales and support office serving the North American market, and a development facility in Bangalore, India.

 

Our organization is divided into two commercial activities:

  • Raincode (raincodelabs.com) is the product arm, providing a full set of compilers, together with emulators and additional tools, to migrate legacy workloads from the IBM mainframe environment to Microsoft’s .NET and Azure platforms. These products cover a wide range of technologies, including lCOBOL, lPL/I, lASM370, lJCL, lCICS and more.
  • Raincode Labs (raincodelabs.com) is the engineering arm, providing compiler-related consulting and development services. By design, Raincode Labs puts no restrictions on the environments and technologies our clients want to use. Raincode Labs’ scope is not limited to compilation per se. It includes IDEs, migration, translation, interpretation, analysis and whatever else is necessary to get the job done. Our customers range from large organisations with bespoke compiler needs to vendors for whom Raincode provides the technical backbone for their compiler-related products.

Intership

Brussels

Our main engineering centre is located in downtown Brussels, Belgium, close to the historical city centre, within walking distance from Brussels main train station (Gare du Midi/Zuidstation) for convenient connections to cities such as Paris, Amsterdam, Frankfurt and London.

Intellectually Rich Environment

There is more to internships than just the actual work. Significant value comes from being immersed in an exciting and challenging environment. To ensure maximum synergy between our team and each intern, full-time presence in Brussels for the duration of the internship is strongly advised, but more flexible schemes can be discussed on a case by case basis.

Mentoring

Interns work under the direct supervision of a mentor, who is responsible for both the scientific and the logistics aspects of the internship. Interns are integrated into the team as extensively as possible, mingling with the entire staff (breakfast, lunch, various social occasions). For the duration of their internships, they are an integral part of our staff.

Consequently, both the freedom and obligations that come with regular staff positions apply to interns:

  • Flexible working hours, including remote work, when and as agreed upon with their mentor
  • Casual dress code
  • Professionalism, courtesy, trust and solidarity

Implementation and theory

 Internships at Raincode are never about theory alone, nor are they only about the finished product. They always combine a strong theoretical component with an equally strong focus on a working implementation.

Part of the value of a Raincode internship lies in our focus on a production-level implementation, high quality standards and systematic test infrastructures. This stands in stark contrast with purely academic implementation efforts, which often aim only at demonstrating some level of feasibility, without caring for the quality or the practicality of the resulting implementation.

Production and research: a delicate balance

Most Raincode internship topics have a strong relationship with industry, and focus on solving a real-world problem. However, our attention to industrial applicability does not mean that interns are considered cheap labour for customer paying projects.

More specifically, our guarantee to our customers is that everyone working on a paid project is a fully qualified professional, while our guarantee to interns is that they will not be exploited for commercial purposes. The topics always show some level of industrial interest and applicability, but we maintain a Chinese wall between academic investigations and industrial projects. 

Selected publications

  • Parsing in a Hostile World, Darius Blasband WCRE 2001: 291-300.
  • Hard Facts vs. Soft Facts, Darius Blasband, WCRE 2008: 301-304.
  • Compilation of Legacy Languages in the 21st Century. Darius Blasband, GTTSE 2011: 1-54.
  • Software Language Identification with Natural Language Classifiers, Juriaan Kennedy van Dam, Vadim Zaytsev, SANER 2016: 624-628.
  • Raincode assembler compiler (tool demo), Volodymyr Blagodarov, Yves Jaradin, Vadim Zaytsev, SLE 2016: 221-225.
  • The Rise and Fall of Software Recipes, Darius Blasband, Reality Bites Publishing 2016, ISBN 978-9-490-78342-6, pp. 1-368.
  • Language Design with Intent, Vadim Zaytsev, MoDELS 2017: 45-52
  • Towards a Taxonomy of Grammar Smells, Mats Stijlaart, Vadim Zaytsev, SLE 2017: 43-54.
  • Parser Generation by Example for Legacy Pattern Languages, Vadim Zaytsev, GPCE 2017: 212-218.
  • Live Robot Programming: The language, its implementation, and robot API independence. Miguel Campusano, Johan Fabry, Elsevier SCP 2017, v.133: 1-19.
  • Open Challenges in Incremental Coverage of Legacy Software Languages, Vadim Zaytsev, PX/17.2: 1-6.

Contact

 Raincode S.P.R.L.
45 rue de la Caserne
1000 Brussels
Belgium

+32(0)2522.06.63

academia@raincodelabs.comacademia@raincodelabs.com

Master thesis topics

 

This section lists topics for theses to be covered by internships. They are merely starting points: more often than not, the actual work that is being performed ends up diverging significantly from the original plan.

 

 Stress-testing compilers by automated innocuous changes

The goal of this thesis is to design and implement behaviour-preserving transformations, to be applied to test programs, resulting in modified programs that should behave as the original. Examples of such transformations include (but are not limited to)

• adding expressions to integer expressions that are guaranteed to be zero,

•  similarly multiplying them by 1 (even though the compiler does not have the static capability of taking advantage of this fact)

•  adding extraneous parentheses or other bracing constructs,

•  adding control flow branches that are never executed, given the expected input values of the programs,

•  replacing structured constructs by more elementary ones using labels and GO TO statements, etc.

This work would be developed using Raincode’s proprietary meta-programming technology, and would be applied to the extensive regression testing infrastructure for our COBOL and PL/I compilers. This will result in extra checks for correctness of the compilers, and possibly finding bugs in the compilers as has been performed previously, e.g. for GCC and LLVM by Zhendong Su et al (See ICSME17 keynote https://icsme2017.github.io/program/keynotes.html). Furthermore, it will show that this approach is feasible in languages that are syntactically more complex than those used in current work.

 

Code Slicing by Program Transformation

Program slicing is a well-researched technique to check for dependencies: given a variable, its forward slice contains all executable statements that access it or anything dependent from it — the original idea was for slices to be executable, but that is rarely a hard requirement in practice (often it is enough for them to be compilable).

Usually slicing is done by creating an abstract syntax tree, analysing it to produce all kinds of useful but computationally hard to obtain artefacts like dependence graphs, and then performing slice actions as trivial reachability operations on those graphs. However, it can also be done much cheaper and faster, by transforming programs and borrowing ideas from partial evaluation and supercompilation — simply put, by assuming that all variables beside interesting ones are constants.

The goal of this project is to implement such a slicer in an industrial metaprogramming language, test it on actual codebases and research the advantages and limitations

 

Automatic generation of syntactically correct programs using samples and working pairs as oracle 

This topic is about the automatic generation of code in legacy languages which can then be used in the testing of compilers and code transformation tools. Such work has already been done academically, most notably via tools like CSmith [1] and used, e.g., for finding bugs in C compilers [2] or missed optimisations in C compilers [3]. As far as we know, there is however no such tool for legacy languages and no work has been done on verifying correctness of source code translation on code in such legacy languages.

The work consists in building a tool that takes a number of valid inputs, and in a generally language-independent way, automatically builds other syntactically valid (but probably semantically invalid) programs using a working parser to validate the resulting programs. This tool will then be used for testing our compilers and code transformation infrastructure.

A variety of techniques can be used and combined:

  • Random construction by picking lexemes
  • Genetic mash to combine valid programs
  • Altering existing valid programs by random insertion or deletion
  • Using the grammar to build valid input (or parts thereof) constructively

References

[1] CSmith: https://embed.cs.utah.edu/csmith/

[2] X Yang, Y Chen, E Eide, J Regehr: Finding and understanding bugs in C compilers. PLDI ’11 Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation

[3] G Barany : Finding Missed Compiler Optimizations by Differential Testing, CC2018 27th International Conference on Compiler Construction

 

 Advanced Code Diffing

There are many techniques for comparing code, both with itself (for clone detection) and across codebases (for version diffing). Most of them, however, are aimed at the situation when the two systems being compared are relatively similar, such as snapshots of two adjacent commits in the same repository. There is occasional industrial need to compare systems that are much farther apart, sometimes separated by years of development or even consequences of copy-paste programming. There is no ready solution for this, but it could be designed after careful consideration of related work, and implemented in a sufficiently powerful metaprogramming framework (i.e., Rascal for Java, Roslyn for C#, etc).

The project involves nominal matching to track classes moved across packages, structural matching to find renamed entities, lexical matching to understand how many lines of the old system are present in the new, metrics calculation to be able to make different statements about the systems, etc. It is highly nontrivial, industrially relevant and academically publishable.

 

 Grammar Smells, Continued

 Code smells is a concept that has been around for quite a while, and is now being expanded to cover smells in design, architecture, requirements, documentation, spreadsheets and whatnot. Last year, we have extended it to cover grammars of software languages. In that project, all kinds of already published smells in software artefacts were analysed and ported into the domain of grammars and grammarware. The result was a loosely-classified taxonomy of 56 smells, each having a definition, a short description, sometimes a few examples and occasionally also a code detector in Rascal. The hard part is done now, the fun part remains. There are two ways to extend the project:

  • go broad and make improvements across the entire taxonomy: refine classification, make sure each smell has a detector, run all the detectors across the entire Grammar Zoo, etc; this extension is possibly both exploratory and concrete, and is guaranteed to produce insights and observations that are both defendable as a MSc project and publishable as an empirical survey;
  • go deep and choose one or any other small number of related smells, investigate them in as much detail as possible, devise ways of automatically detecting but also removing smells, measure the impact of such adjustments; this extension starts with its feet firmly on the ground but will require some imagination later with interpretation of the results.

Retargeting the Assembler Compiler

 One of Raincodes’ compilers has an unprecedented modular structure and relies on principles of model-driven engineering: the instruction set of the language is modelled on a level high enough to generate several interrelated artefacts needed to parse the language and support its runtime execution.

One of the artefacts generated that way is essentially microcode: the semantics of each instruction is expressed in a sequence of almost-atomic steps which then are used in another round of code generation to produce code in C# (for the generated emulator) and in CIL (for performance-targeted inlining).

The goal of the project is to retarget this system to generate code in C so that an emulator can be generated to be compiled by GCC on Linux. Some degree of success for this project is already guaranteed by the straightforwardness of the first steps (we know it is possible to generate C, this is why this microcode language was developed in the first place), but it contains enough hidden challenges of both technical and scientific nature.

Suggested reading:

  1. Blagodarov, Y. Jaradin, V. Zaytsev. Tool Demo: Raincode Assembler Compiler. SLE 2016, pp. 221-225. DOI: 10.1145/2997364.2997387

Static comparison of relational database optimization schemes applied to large portfolios 

The Raincode compilers include a capability to convert SQL statements found in COBOL and PL/I programs, from DB2’s SQL dialect to the (vastly different) SQLServer dialect. While this transformation is designed to guarantee functional equivalence, the resulting performance of the transformed SQL statement may be significantly different. This is mainly due to the difference in optimization strategies implemented by the various database engines, and by the entropic nature of the transformations which may prevent a database optimizer from using indexes adequately.

An example of typical performance degradation is when the target database fails to use an index while the original database did.

The purpose of this thesis is to design a system that will statically predict which converted SQL statements will perform significantly slower than the original ones. To achieve such predictions, it will

  • take existing SQL statements (possibly in the tens of thousands), each with its translated counterpart,
  • query both databases for their respective query plan,
  • and report those statements that may suffer a significant performance penalty due to the translation.

This final ambition of this thesis is to pave the way to a tool that would run on large scale portfolios in migration projects, so that possible performance hotspots could be detected and remedied without having to go through lengthy testing, profiling and debugging sessions.

Grammars in IDEs

Most grammars are domain-specific, in a sense that they are created for a narrow purpose: to parse programs that are to be executed, or to analyse specific parts of programs, or to document the structure of a language, or to guide a pretty-printer, etc. One commonly overlooked purpose is IDE support, which is a relatively obscure topic in research but a very important part of plugin development, commonly a desired component in Raincode compilers. In the DSL movement, IDE-aided debugging of a program written in a DSL is considered a luxury. In practical language/compiler development, IDE integration is an important part of DSL deployment and is often crucial to gain clients’ acceptance.

Typical IDE-supported features include: syntax highlighting of otherwise monotone text, word selection for scope visualisation, code folding for hierarchical program blocks, visual editing of naturally graphic elements, debugging executable programs, discovering and running tests, performing dependency analysis, suggesting refactorings, displaying violations of coding conventions, providing code navigation to allow programmers to quickly jump between definitions and uses or to follow a call trace, configuring a build, displaying tooltips with documentation, and many others.

Some of them are possible to implement based on a parse tree or an AST, but many fall into one or more of the following problematic categories:

  • need to work on partially incorrect programs (e.g., code completion);
  • must work significantly faster than a complete parser (e.g., syntax highlighting);
  • have no sufficiently advanced parser available or require noticeably more information than the parser provides (e.g., detecting missing dependencies).

Most research done on this topic is limited to getting basic IDE support like syntax highlighting by either tweaking a grammar by adding ad hoc manually written code (e.g., to assign colours and implement name suggestion strategies) or by enhancing the grammar with annotations that carry enough information for the underlying universal algorithms to work (e.g., to recover from errors). For getting faster towards a sufficiently detailed parse result, in the industry it is common (not just at Raincode, also at SublimeText, TextMate, Cloud9, MakePad, CodeMirror, …) to use ad hoc combinations of simplistic parsing algorithms (e.g., recursive descent or parsing expression grammars) and regular expressions to perform some form of approximate/island/robust parsing collectively known as semiparsing. For some families of languages even their representation in a “grammar” is still in its infancy, which is the case for at least spreadsheet-based and pattern languages.

Intelligent Modernisation Assitance for Legacy Sofware

The overall objective of a recently started academic-industrial collaboration project, sponsored by the Innoviris foundation in Brussels, is to explore novel pattern mining algorithms and incorporate them in a prototype of an intelligent modernisation assistant for legacy software systems. The assistant is to proactively recommend to software engineers source code modernisation actions [1] by comparing their current development efforts with insights gained by treating software systems as data, in particular their source code and development history. The assistant will draw its intelligence from continuously mining for previously-unknown patterns both in the current state of the system’s source code and structure (e.g., so-called programming idioms [2], coding conventions [3], library usage protocols [4], …) and in past and ongoing changes made to this source code (e.g., so-called systematic edits [5,6], repetitive changes, …). The modernisation recommendations made by the assistant will appear increasingly informed as it refines or uncovers more previously-unknown patterns in the source code and version repositories it mines. The success of the modernisation assistant — its apparent learning ability — hinges on the quality of the pattern mining algorithms it incorporates.

One of the key aims of Raincode Labs has always been achieving a high degree of automation. When modernising a codebase of 100+ million lines, even 99.99% automation could still make the project fail. One of the most crucial components to achieve high automation are patterns of modernisation. To illustrate, several years spent on finding 144 patterns of code improvement for code produced from PACBASE, made Raincode Labs world leader in PACBASE migration. Yet, they are still ineffective for SEER*HPS, which is a similar language requiring different patterns. Discovering and understanding such patterns and their corresponding code modernisation actions thus requires a major effort. Raincode Labs has gained the expertise to build traditional software analytics tools, but those can only find precise matches for known patterns. What is really needed instead is the ability to find and act upon unknownpatterns that are perhaps but adhered to a limited extent.

There can be many use cases of this technology, including discovering syntactic coding idioms and conventions in legacy systems (locations where uncovered syntactic patterns occur, where the code is similar but not identical to the expectations, and what are the proposed modernisation actions) or discovering API usage patterns and their violations (API, library, framework usage patterns, suspected violations, on-demand recommended actions).

References

[1] A. F. Iosif-Lazar, A. S. Al-Sibahi, A. S. Dimovski, J. E. Savolainen, K. Sierszecki, A. Wasowski. Experiences from designing and validating a software modernization transformation (E). ASE 2015.

[2] M. Allamanis, C. Sutton. Mining idioms from source code. FSE 2014.

[3] B. Goncharenko, V. Zaytsev, Language Design and Implementation for the Domain of Coding

Conventions, SLE 2016. ACM, pp. 90–104.

[4] G. Uddin, B. Dagenais, M. P. Robillard. Temporal Analysis of API usage concepts. ICSE 2012, pp. 804–814, IEEE Press, 2012.

[5] M. Kim, D. Notkin. Discovering and representing systematic code changes. ICSE 2009.

[6] N. Meng, M. Kim, K. S. McKinley. LASE: Locating and applying systematic edits by learning from examples. ICSE 2013.