Menü

Using visualization tools to understand large polyglot code bases

03 September, 2020 | 41 min 17 sec
Podcast Host Rebecca Parsons and Ashok Subramanian | Podcast Guest Erik Dörnenburg and Korny Sietsma
Listen on these platforms

Brief Summary

Code visualization tools can be a great way to understand the intricacies of large code bases but they can be problematic when dealing with very old or very new code sets. Our co-hosts Rebecca Parsons and Ashok Subramanian are joined by Erik Dörnenburg and Korny Sietsma to look at the benefits and challenges of code visualization, especially when dealing with multiple programming languages.

Podcast transcript

Rebecca Parsons:

Hello, everyone. I'd like to welcome you to another addition of the ThoughtWorks Technology Podcast. My name is Rebecca Parsons. I'm one of your recurring co-hosts and the Chief Technology Officer for ThoughtWorks. We're here today to talk about code visualization tools, how this works across different programming languages. I am joined by my colleague, Ashok, who's another one of our co-hosts. So let's start with Ashok, please.


Ashok Subramanian:

Hello. Hello everyone. Thanks Rebecca. I'm Ashok Subramanian. I am the Head of Technology for ThoughtWorks in the UK, and one to the hosts of the ThoughtWorks Technology Podcast. I think we've got two experts on the topic with us today Korny and Erik. Korny, do you want to introduce yourself first, and then Erik can go next.


Korny Sietsma:

Hi I'm Korny Sietsma. I'm a tech principal in ThoughtWorks in the UK, originally from the Australian office, but come over here a few years ago. Yeah, I'm quite keen on tinkering with visualization, very much a visual thinker. So visualizing code has always been appealing to me.


Erik Dörnenburg:

Hi and my name is Erik Dörnenburg. I'm the head of technology in ThoughtWorks Germany. I had a very strong interest in visualization when we're dealing with really large code bases over the last few years. I haven't spent that much time, but I'm super curious about the topic. And when I heard that Korny is getting into this topic quite deeply, I was very curious to have this discussion.


Ashok Subramanian:

Excellent. To maybe get us all started, I thought maybe we can describe the whole subject or area around visualizations. And I know Erik, there's probably some articles that you've written which almost I think someone referred to with our internal many listeners, they're almost becoming like a teenager around visualization. I know you have been tinkering with this for a long time. So, could you describe your view on visualization?


Erik Dörnenburg:

Yeah, I think the article that was referred to was called Get the Thousand Foot View and I compared at the time as when we were flying all a little bit more, flying over a surface. And if you're at the ground level, you're only seeing the trees and it's really, really hard to make out any structure. If on the other hand you look at the typical architecture diagrams, especially the ones that are done upfront, that is the 30,000 foot view from a plane. And you can barely make out whether something is used for agriculture or not, but you can't see any detail.


Erik Dörnenburg:

And what I said was solely lacking especially at the time when we were almost exclusively dealing with these large monoliths, the million plus lines of core code base is, we lacked an understanding of that. And we needed that mid-level view to actually better understand the software. This was made worse in a funny way, by test-driven development, because we practice this so much that it was very, very easy as development teams to have the courage to make changes.


Erik Dörnenburg:

You had the courage because you knew you weren't breaking any tests. Your code change was fine. I would argue... I'm not saying... I totally believe that test-driven development is the right approach, but I would say before people were more cautious, they were afraid to break something and then they spend more time looking around and they kind of wrote more documentation. They understood the structure in a different way. But I didn't want to go back to those days because I had seen the inefficiencies of that model. So I said, "Let's move forward, and let's get that view that is missing."


Erik Dörnenburg:

And of course, if you're talking about a view at that level, it needs to be visual and it needs to be automatically created and, it needs to make every pixel count. You shouldn't have a box with arrow heads and you can fit three boxes on a page. The information density is then not high enough to get a meaningful view at that 1,000 foot view. That's sufficient for a 30,000 foot architecture diagram but not for what we had in mind at the time. And I certainly wasn't the only one with that idea. I'd seen academic research and I spent a fair bit of time with some people from the academic circles to look into this.


Ashok Subramanian:

And Korny, I know you've been tinkering with this lately as you mentioned, what got you interested into software visualization?


Korny Sietsma:

I think I've always been a bit of a visual thinker. It's ironic that I'm in a career that's all about writing lines of code, but I've always... If I want to picture something I grab a whiteboard and scribble something on it, I draw lines and arrows and boxes. But I think like Erik says the architecture view, I like a simple architecture diagram, but often they don't see the wood for the trees. They're showing somebody's mental model of something, not reality, not what's really there. So, I've always been keen on getting a bit of getting some graphs and some other tools going.


Korny Sietsma:

But also, I think the legacy code thing is very big, the large systems. Legacy is probably unfair large systems of code I've had to deal with. I've gone into clients and they've said, "Oh, we've got the system in the corner. This team of people have been working on it for 20 years. We don't really know what's there." And getting a quick grasp of what that code is about, that's just essential and it's almost impossible to do. You can load up an editor, but it doesn't really give you a picture of what's really there.


Korny Sietsma:

Even just how big things are and how old they are, that sort of data, you want a quick 30,000 foot view. So, when I started finding tools that could do this, I got very keen on playing with them, and eventually I started writing my own tools because I got quite frustrated as well.


Ashok Subramanian:

You mentioned frustration. What got you into actually trying to write your own tool?


Korny Sietsma:

A couple of problems. One is that there's a very high... A lot of these tools degrade over time. A lot of them were great when the author built them, somebody gets really keen, they build a tool, they make it work and they do everything. But a lot of them seem to just get bit rot. You go and pull them up and that's five years old, and nobody's changed them, interest has gone somewhere else and you can't get it to run. A lot of my frustration was I went to look at some of the tools that were on Erik's blog and I couldn't get them to run.


Korny Sietsma:

Or I could get them to run, but I had to reload one person's tool and then another person's tool and then another third tool. And then it didn't work for the version of C# that I was operating on. And the second connected problem was the whole point of code bases that aren't your own. They're not always clean and neat and in a nice language, a lot of these tools start from the idea that you have a parser that can understand the code base. You can have something that is built around building a grammar, building a syntax tree and reading that.


Korny Sietsma:

And that's fine if you're using a particular known version of a popular language. It fails at both extremes of old and new code. It fails when you've got something... We've got a potential work which is a client that has some Adabas natural code from the 1970s. None of these tools are going to even know what that is. On the other end of the scale, we've got languages like Rust and Closure and other modern languages that people haven't gotten around to writing cyclomatic complexity tools, or if they have, they haven't integrated them into these things. So, there's a lot of places where the traditional tools wouldn't work. And I started reading a few things about what you can gather without really needing to understand the code at that level. And so I started tinkering with those ideas.


Erik Dörnenburg:

I think that is something I also struggled with the variety of languages and problems you have to solve. And I think that made it really, really hard for any company to provide flexible tooling in that space. Comprehensive and flexible tooling because the market is not very big, but the problem space is big, which means you're not selling that many editions of your software. And of course, selling developer tools has been for the last 10, 15 years, a difficult proposition to start with. We can be lucky if people are willing to pay for their IDE, which they use for six hours, eight hours every day. So, to pay for these additional tools, that was difficult to actually get by.


Erik Dörnenburg:

There were some brave tools vendors who are still around, some of them didn't make it. But it was really difficult for them to earn enough money with selling their tools, to reinvest enough money to keep these tools up to date. And sometimes as Korny said, the technology was moving quite fast. About the time when I stopped being so active in that space, was when the code shifted more into the web browser and we immediately saw all these new JavaScript frameworks and every six months there was a... Okay, that's a slight exaggeration, but every year or two, there was a completely new JavaScript framework. Nevermind a major new version, a completely different frameworks. So, the toolings couldn't keep up.


Erik Dörnenburg:

So, a bit like Korny, I also settled on saying, let's use some basic building blocks if you will. And we assembled them with different toolings and the idea really was that you try to find something that can... If you talk about static analysis of code, I think we will probably go beyond that later in the conversation. But especially when you look at static analysis, to have one class of tools that provides analysis of the code base and outputs of textural format. And that is something you don't rewrite all the time. You have another class of tools that take some textual format and generates graphics and diagrams from it. And that again is something that is reusable.


Erik Dörnenburg:

And that's not only for code analysis. You can use tools like D3 for example, which I've come to love over the last five years to do this visualization. So you can use on both ends of the process, existing tools, and the only thing you really have to write is the bit in the middle that transforms from one textual format, the output of the analysis step, into the input of the visualization step.


Erik Dörnenburg:

And that have been a tool chain that at least for me has worked quite well, but it is incredibly hard to create a repeatable tool that you can just simply hand to somebody else to take this. It's really very custom made, and you basically have to teach people the basis of how these building blocks work together. And then again, it becomes a lot of effort to teach that.


Rebecca Parsons:

And we've been focusing so far primarily on what I would call the metrics side of this. What are the things that you might collect? I will admit, I consider myself the world's least visual person. But one of the things that I was struck by, we did an experiment with one of our clients, with some of the code visualization, and we basically ran the metrics and the visualization across a suite. Some of which were very good codes, some of which was very bad code. And we gave them to a development manager, who didn't know anything about what the code was.


Rebecca Parsons:

And the visualization was so good that that person who knew nothing about what the graphs were telling him, could segregate completely accurately the good code from the bad code. So, you want to talk a little bit about what is it about the visualization that you think exposes these aspects of code quality and architectural structure and such? And how do you think about designing that visualization? How do you tell D3 what you want it to do?


Korny Sietsma:

A lot of what I was most interested in was patterns, which is a combination of the files you're looking at and the size of them. There's a lot you can tell just from a high-level view. If you can see all of the code that are in the system, you can say, "Okay, a small file is a small bit of the screen. A large file is a large bit of the screen." A bit of C, is this color, a bit of JavaScript is that color, a bit of header file is that color. And you can very quickly pick up patterns that you might not see at another view.


Korny Sietsma:

We had one code base we looked at, which had a lot of cut and paste code. It wasn't obvious from looking at it at all. But when we looked at the big picture, we saw the same pattern, the same shape, the same combination of colors over and over and over again, hundreds of times. It was a very regular display. I knew your brain can immediately say, "Oh, look, there's a pattern there. I wonder what that is." And then dig deeper. It also highlighted a few... Suddenly you've got 10,000 lines of SQL he didn't know about in the directory. So, hidden things jump out at you where they might not jump out in any other way.


Erik Dörnenburg:

Yeah. I would add to this, it as ever depends on the audience and the question you have. There's one thing about understanding a code base that you don't know, something that we as consultants are often confronted with. When we work with a new client, we don't know their code base, and it's good for us to understand this and there's no physicality to it. Like Korny says, there can be a small file that is 50,000 lines of code, but you don't see it even at a listing of directories. You really need to open the file, but if you have a large code base, you don't see it.


Erik Dörnenburg:

Or you have many repositories today, you don't know how the code is distributed. And if you have a good mapping from the source code, from the physical artifacts to a visualization that allows you to see patterns very quickly or outliers. Outliers is... I mean patterns, but the other one is also outliers. Oftentimes you can find, depending on how you create the visualization, you can use the space an item takes up, and you use the color. This is the idea of polymetric views. You can show multiple metrics in one view. And as humans we're often quite good at seeing the correlation.


Erik Dörnenburg:

So, we learn from looking at the diagram. When it is blue, it should be a smaller box, when it's red, it should be a large box, but then suddenly we see a large blue box and we immediately think that is probably worth looking at. So, we spot the outliers much more quickly than we could ever do by sifting through directories with hundreds of files in them.


Korny Sietsma:

I just wanted to comment on something Rebecca said, which is also, you have to be very careful. A lot of this does need thought and a bit of investigation to see what's really going on. This talking about showing it to a manager, one of the real concerns is it's very easy to use a visualization to tell half-truths. It's very easy to say, "Hey, look, red is good. Blue is bad. The old code is red. The new code is blue." Or vice versa. And if you don't understand what's going on, it's quite easy to jump to conclusions. So, I guess we have to be a bit cautious with this, that you're summarizing. You might be summarizing in ways that are totally fictional, especially to comparing code bases.


Erik Dörnenburg:

We need to understand our biases. This is something that many years ago, a couple of colleagues and I did, which is now known as the toxicity metric. And that was exactly born out of that necessity to come up with an idea of saying when is code... and we chose the word toxic. When is code such that you really wouldn't want to try to touch it? And we came up with some really loose definition where we hope that most people would agree that that is not a good thing. Like a 30 line long method or something like that. In Java mind you, there's different programming languages obviously.


Erik Dörnenburg:

What we learned in that case though was to communicate and again, the audience is important to communicate with the non-development background management. We needed to choose a visualization that they were familiar with, and we chose bar charts for that specific purpose because we knew they would immediately understand what it was. If we had gone with tree maps or some of the other more complicated visualizations, we first would have had to explain it, and then we might as well have shown them some source code. So the idea was to in that case to think about the audience and choose visualization that was familiar.


Erik Dörnenburg:

So the visualization was understood, and they could focus on the content. If you are a practitioner, if you are a developer who is interested in the architecture of a system, with understanding the system, then more complex visualizations can make sense because now you have different goal.


Ashok Subramanian:

And I think the discussion right now has been around when you go into an unfamiliar territory, like a code base that you've never ever seen before and this is a great way to have, Erik as you described, 1,000 foot view. Be able to at least understand high level, how might you go about navigating. I think another interesting aspect of pattern that I've seen sometimes applied is about the changing nature or the evolution of a code base. Are there any things that you have observed or seen around the evolution of a code base and almost the interactions of people, and what can you further derive out of that from visualizations?


Korny Sietsma:

A long time ago, it was before ThoughtWorks, that's at least 10 years ago, I remember sitting around a desk at the office we were at, somebody pulled out this tool that they'd found, which was visualizing. It was called Goss. It's still around there. And it was just visualizing your Git history. Actually, I think it was Subversion history at that stage. And did this amazing visual thing where colors and patterns flew all around the thing. And we had about 10 people clustering around the machine and saying, "Look, that's when that bug happened. And we all swarmed on it."


Korny Sietsma:

It was really a great example of people seeing the behaviors and the change. I'm saying, it was a memory trigger. These were people who worked with this code base, but they were saying, "Oh, I didn't know we'd done that thing." And that was a really nice... just tracking through the history of the project as time went through. Yeah, that was a good example I think of taking code we knew, but we'd forgotten. And it helped remind people of how the systems all interacted.


Erik Dörnenburg:

Yeah I think in some way, looking at the history of code can still aid in understanding of the architecture. So, one thing that I have seen is if you look at co-commits, files that often get committed together, you can visualize that and then you can actually see sometimes more truthfully what the real dependencies are than in architectural diagram. If you see that these three files are often committed together you know there's something going on between the three modules, functions, classes that are in those files.


Erik Dörnenburg:

At the same time I think, if you look at code over time, you can also see not only architectural patterns, but also collaboration patterns and team patterns. And there's a lot of actual literature at this stage also about looking at who is committing to which files? Are these files committed to by one person. And again, you can use all the techniques that Korny described by coloring them to then find the outliers.


Erik Dörnenburg:

But clearly also, you can find trends all the time. So if you're a product team and you're working on a piece of software, sometimes very simple, it can just be a line graph to be honest with you, or an area chart to just see how a certain metric, how a certain correlation.


Erik Dörnenburg:

You've also over time... Most of the time you have an opinion on which direction it should go, should it go up or down? And if it's going in the wrong direction, you can probably start asking questions.


Korny Sietsma:

Yeah. We use this quite a bit when I was at the UK Government Digital Services for tracking. The progress of our whole delivery mechanism over time, how long did it take to ship software? And with very simple graphs that just said, we used to take us on average this long to go from a code commit to release into production. Now it's taking us that long, what's changed? And watching those little peaks and troughs over time was really insightful. And again, fairly simple to produce.


Erik Dörnenburg:

I have one case that I even wrote about for a specific incident, but I have seen it in other place as well, we just basically took the ratio of lines of code in production versus lines of unit tests. And you divide them, which gives you the ratio. And what we've seen is sometimes that as we are working with teams who are not as experienced in test room development, that the ratio is higher because they're afraid... Assuming they are motivated to write tests and so on, they are a little bit afraid, or copy pasting tests and retesting something, that they write more tests than an experienced team who almost with surgical precision writes TDD.


Erik Dörnenburg:

And I've seen it on one project, you would see in a chart that was plotting over several months, the test to code ratio, when changes were made to the composition of the team and more junior people were added to the team. That metric would show you what happened. So, you can get some deep insights from some trivial metrics. We didn't even try to count lines of code, we just counted the lines in the files and divided them. That was a Unix script that took like 15, 20 minutes to write.


Ashok Subramanian:

Yeah, I think I've seen this with teams, you look at the batch sizes of your commit to try and understand whether people actually are practicing like TDD and actually working in short, fast feedback loops. And sometimes you can see sometimes in code base use, like if it's a two week iteration. Near the end of the two week you'll get those giant commits coming all together hitting your version control system. Which, again, can tell you a little bit about what was the mindset of the team as they were working on it as well.


Korny Sietsma:

I find this fascinating. I think there's a whole lot of stuff we haven't even tapped. The edges of this yet. There's quite a bit of research I've been reading coming out of Microsoft and other areas where they're looking at the revision histories and mining it all and saying, "Where do the bugs come in? And how do they associate with seniority of people in the team and new hires versus old hires and rotations?" There's a lot of data to be tracked that is...


Korny Sietsma:

The only catch is it's extremely specific to an organization and a team. It's very hard to do a general way. You can't point that metrics at one team in one company and then a different team and a different company and easily compare them because there's a whole lot of other human factors that might be radically different.


Erik Dörnenburg:

Yeah and sometimes... This is the speculative side, right? You're saying let's visualize something and then let's see whether we find a trend or some outliers or patterns in there. But sometimes you have a specific question you want answered. And this is a project that Rebecca worked on in the UK, like a large monolithic application and the build time was long. And what people started to do was just the visualization of which stages of the build took how long, and then it was much, much easier to understand how to optimize the build time and to see which step it was.


Erik Dörnenburg:

And I remember based on what I saw there, I did it on a different project and realized very quickly through the help of the visualization that somebody had changed a switch in one of these... Was it Ant or was it Gradle? I can't remember, but they were just forking the VM for each unit test. And just seeing that change, you could revert it and you almost halved in that case, halved the build time. But again, it was shown through the visualization.


Ashok Subramanian:

So I think that a lot of what we have been discussing is about code bases, where you can see almost the entire code base if you like and be able to almost reason about it, and using visualization as a technique for that. But nowadays I'm finding, I'm sure you a lot as well. Teams are actually breaking this down into what... You might have an individual code base service, as teams like to describe themselves. Almost every organization nowadays I have come cross describes doing microservices in some shape or form. What's your take on how do you go about applying these techniques into this highly distributed systems. I am curious whether you've looked at similar techniques in such core basis?


Erik Dörnenburg:

So, I personally haven't. For me as I said, the motivation was we did the large code bases. And like I said, it coincides a little bit that I did less in visualization when we switched more to microservices because I felt that if you are in a microservice that is like 10,000, 20,000 lines of code, which is probably most people would consider even on the larger end, you can get away with not a fantastic architecture. I'm not saying it's the right thing, but you can get away with it.


Erik Dörnenburg:

Because you have these other boundaries that stop you. But clearly, and I hope Korny can chime in on this one, I think clearly the next step would be to apply some of these ideas and apply them at the level of the microservice to analyze how they are communicating if they are communicating, or how access patterns across the services correlate to each other. And how are they structured in the same? Can you tell the different handwriting of a team across five microservices? Things like that, but I certainly have not looked into this.


Korny Sietsma:

I started to look at some of these areas. It's quite interesting. To a simplistic level you can just say, display the microservices next to each other. One of the views I have and the tool I'm writing just has a circle for each service. You just add all the data, especially if you're not caring too much about the programming language, then there's an element of just putting them side by side and saying, "I've got 20 circles for 20 services and I can look at them."


Korny Sietsma:

It won't work so well if you've got 2,000, and there are issues with the high end of scale of any project. Tracking changes across those services is an area I'm still exploring, which I think has great potential to do the... I think Adam Tornhill's books talk about temporal coupling, which is the idea that if you can say this file changes at the same time as that file, you can imply that maybe there's coupling. If 99% of the time when you change file A, file B changes as well, then you've got a coupling relationship there probably. There's some challenging aspects to it. It's quite computationally expensive to try and look at a large code base and work at what things change at the same time.


Korny Sietsma:

You can't just use a commit because they're in multiple repositories. So, this is an area I'm looking at. I think there's a lot of scope to say, we can tell these services change. I know I've worked on microservices projects where I don't have the code to them anymore unfortunately, because it's client confidential, but where definitely, if you change service A, service B had to change and you could pretty well pick that up from revision patterns.


Korny Sietsma:

I think there's definitely scope for that. I think tracking duplication is another area that's interesting and again, computationally challenging. But if you can say, these areas have very similar patterns of file, that's another thing that would be quite viable to look at across services and say, okay, these 10 services are all effectively the same, why are they 10 services?


Ashok Subramanian:

And I think I find a similar sort of thing hard to visualize, even looking at the infrastructure side of code bases now. Because a lot of it is no longer... Especially if you're trying to do something in cloud formation or these scripts are... You're writing stuff in YAML and so on, which makes it even harder to see as things grew out, how or what the interactions between different parts of the system are potentially going to end up doing.


Korny Sietsma:

But even YAML, one of the really interesting things I found early on when I was looking into code agnostic techniques is that indentation is quite a good proxy for complexity. If you have something that doesn't care about the file flow, it just looks at how many spaces or tabs there are on the left hand side of every line. And then you start doing things like what's the standard deviation of those numbers of spaces. Turns out that files that have a... That are constant, that don't vary much are generally less complex. Things that are deeply indented tend to have nested if statements or case statements or deeply nested YAML or XML.


Korny Sietsma:

Any file format. Not any, because some languages don't really use any indentation at all, but in a lot of cases, it's a good way to quickly spot complexity, even in a non-programming language.


Erik Dörnenburg:

That's a very good point. In programming languages that have curly braces, a very simple visualization is simply to strip everything but the curly braces, and then start each line with the curly brace and you continue until you find the closing one. And then you go to the next line, which almost gives you if you look sideways, a column chart of the complexity of the methods and so on. So there's a lot of things you can do very simply again that help you. And as I said, I think we can apply these techniques.


Erik Dörnenburg:

And one thing that I think in the infrastructure space that you mentioned, Ashok, we are seeing more tools than us again. So we have OpenTracing, we have all these tracing tools and there's a couple of commercial ones, I don't want to mention one and not the others, but we know who we're talking about here. We now have a space again where we have commercial interest and tooling. So again, maybe if we apply the same patterns and use the tools to extract some data, and then have some transformation we write ourselves and use some of the visualizations. And I know Korny has also used some pretty standard tools for the beautiful visualizations.


Ashok Subramanian:

So Korny, I know you had mentioned something about the tool that you're developing yourself and I think you'd also mentioned... I don't know if you hadn't mentioned but I know you're writing it in Rust. So, I'm just curious on your experience or your journey of how you've gone about developing such tool. And if one of our listeners is keen on doing something like that, should they choose Rust as a ...


Korny Sietsma:

That's a good question. I think this actually also relates back to the whole polyglot code discussion. There's then a matter of choosing the right language for the job. I'm very much not a there is one true language for everything. The initial tools I started writing were in Closure. This was four or five years ago. We had a very rapid need to develop some visualizations, assessing a large legacy code base. And the team I was on needed something quickly. They found Adam Tornhill, I mentioned before he wrote some tools for analyzing metrics around code that looked very useful and I'd been doing a lot of Closure.


Korny Sietsma:

And so I took his tools and then pulled the data out in Closure and then used some very simple JavaScript D3 to visualize them. And that worked quite well for a while, but a couple of years ago when I needed them again on a larger code base, I found I was getting significant performance problems. The Closure code was great but two problems. One is start up time was just really annoying. It's not a huge problem, but when you're trying to run a whole lot of command line tools and you're piping the output of one to the input of another, waiting two or three seconds for your code to start every time was quite annoying.


Korny Sietsma:

But also memory use and roll speed really became significant. If you're looking at 10 million lines of code, and you're just parsing text, it's not a good fit. Closure is a fabulous language for complicated logical processing, but it's not a great language for raw string manipulation. Under the cover it's Java, JVM, and even that's not the best. And I'd been learning Rust so I said, "Oh, well. I'm going to attempt to rework some of this in Rust."


Korny Sietsma:

Also a bit to something Erik said earlier. I had originally followed very much the Unix model, where you have a series of little program where the output of one gets piped to the input of the other. And I like that well, but in the reality is there's a single program of building something in my own time, I was adding a whole lot of complexity I didn't need. I was building a perfect architecture and really I just wanted something quick that would work. And Rust had the ability to just say, I'm going to start from scratch. I've got all the code, I've got the tests. I know how the thing I wrote before worked. I built it again in Rust.


Korny Sietsma:

And it was quite a steep learning curve, but I really enjoyed it. I started coding in C and C++ and I gave up on them because of all of the what's and ugliness of those languages. Coming back to Rust was a bit nostalgic. It had some of the parent simplicity. I could have a mental model and say, "Oh, this object is stored in memory in this way," that I missed from those languages a bit, but also it had a good memory, and safety, and really excellent code tools and developer tools and documentation, all the surrounding stuff you need to make a language easy to use.


Korny Sietsma:

It has a lot of painful learning curve stuff as well. I wouldn't say dive into Rust easily. There's a lot of memory management things you need to learn, but I found it a very effective tool and I could reuse some of the things I'd learned from Closure and Scala and other functional languages in a much lower close to the metal language. But I should say only half of what I've built is in Rust. So, this is the multiple tools thing. The core piece of code that reads git logs and read source code and parses lines of code, that's all Rust.


Korny Sietsma:

Combination of stuff I've written, there are other languages and libraries I used. I found a library that did lines of code calculations. I didn't have to write that myself. It has a really, really simple parser that just looks for comments, so it doesn't need to do anything complicated. But then the Rust program churns through all the data, it can take quite a few hours to run.


Korny Sietsma:

There is a lot of work on these large code bases. And it spits out a giant JSON file. And then there's a D3 and React front end, which is probably as much code in the end that takes that JSON file and says, "How do I display this? How do I interact with it?" Because a lot of visualization is very interactive. You need to explore, you need to say, "What if I try this thing? What does this file over here mean?" That's the other half of the code. I haven't tried making this a commercial thing.


Korny Sietsma:

Erik's talking about commercial. I think one of the problems with all these tools is, I'm a big believer in open source. If you're writing something in your spare time, it's useful for the community. I'm using other people's open source libraries. It might've all stay open source somebody with more time may pick it up and fork it and do amazing things with it. So, yeah. It's a lot of fun as well.


Erik Dörnenburg:

That has been a bit of a sore spot for me. There's another constituency if you will, other than the consultants or developers like we are and the commercial tools, but in a sense academia, there's universities. And that was a really... Or there is a group that I think originated at the University of Lugano in Italy, around a number of people who spent almost a decade of writing fantastic tools, but they chose Smalltalk, which was a good implementation language for them. But the barrier to entry for practitioners in the field was really, really high. And it didn't adapt to the languages we're using at the time.


Erik Dörnenburg:

I have my hopes, I have high hopes. I spent some time with the people at University of Leipzig last year, and they have a tool now in which they store the model in a graph database, Neo4j. And that gives you then the abilities that Korny talked about. The moment you can process a lot of data quickly, you can come up with more complex questions and more exciting visualizations than the simple ones. As I said, I'm a big fan of simple visualizations, but there is definitely also room for the large ones, but then you really have to start to think about how you implement it because it immediately gets really complicated.


Korny Sietsma:

And I think the academic code is an interesting one as well. One of the hiccups I had is, I'm using a visualization, a Voronoi treemap, which is a marvelous piece of maths and complication, but almost all the implementations are written by academic programmers. And I shouldn't criticize them too much. I'm using it, and it works, but they are not people who are experienced in writing robust, sustainable, long-lived code. And it's definitely one of the challenges.


Korny Sietsma:

It's interesting some of the people are eating their own dog food. There's a few papers about comparing academic code bases to industry code bases, and discussing some of the issues that they found in that area. So yeah, I've had to take academic code and do things like run it in a loop 100 times until it succeeds catching all exceptions and retrying those different random parameters. So, it's a bit of an ugly area.


Erik Dörnenburg:

And there's one other important aspect here, is that oftentimes what is happening with the universities is they don't generally have access to a lot of commercial proprietary code bases. So a lot of the analysis that we see from the university is done on code bases that are open source software, but they often show very different patterns, especially the larger open source pieces, so different patterns and the proprietary code that we often find in organizations and product teams. So, that research is often based on a different style of coding and a different style of architecture even.


Ashok Subramanian:

And obviously the commercial pressures and things that we have to come across or deal with on a daily basis, writing commercial software is... Definitely there are different set of compromises that people have to make. Also I suppose it's reflective of teams that are used to working together for a long time versus in open source code bases. How the open source modern structure works is quite... Those patterns are quite distinct. Sorry, Korny you going mention something.


Korny Sietsma:

Even in the commercial world, I think it's interesting I have been reading a lot of these papers lately, and even in the commercial world, you get a completely different result from people looking at the code base in Windows binary tools, versus... It's interesting, the metrics they drive, you made a paper that says, "Hey, we did this amazing proof that this is a thing we can use." And then somebody else tries it on their code base and it totally doesn't work because some research comes out of Microsoft that's largely based on operating system code.


Korny Sietsma:

Some research is actually quite old. It's from 20 years ago, and it's based on many computer giant systems. Some research is microservices, but not a lot. So, I think there's a lot of trying to find how these tools adapt to the particular need you have. Even for us, we work in new code bases. There's places we want to take and look at loathing code that we're writing now, but often, most places have some level of brownfield systems that we also want to understand. And there, the older techniques actually might be quite useful. So, it's an interesting area. I'd love to try these things on more client code.


Ashok Subramanian:

There was one thing I want to ask maybe as sort of your closing thoughts from the two of you. Where do you think the future of visualization is going? Erik, clearly you've been looking at this for 10 plus years now. And Korny, you said you've been looking at a lot of the recent research in this topic. For teens or people who are interested in visualization and looking how to apply this, any suggestions you would give or your view or direction, that this is going to go in?


Erik Dörnenburg:

So, if I think about what I would recommend for somebody who wants to get into this and adapt to the future needs is the toolbox. I like the idea of the Unix style approach as Korny a called it. And in my opinion, the really good news is that with D3 we now have a fantastic visualization toolkit. This was something that we really struggled. I remember writing Java code to render pixel bitmaps and so on. And this is all behind us. And I think D3 really is what you need. That's easy to learn. And on the other hand we already discussed that the sources of data are either easier to get, because you have commercial tooling or you use some very simple approaches like the whitespace approach that also Korny talked about.


Korny Sietsma:

Yeah. I think there's a lot of value in doing stuff your own with simple scripts. I've been building this complicated beast, but you don't need that. You can just say, I'm going to build some D3, it's really well documented. They've done some major architecture improvements in the last few years. And do things like what can I get with a bit of shell scripting or spitting out some numbers and putting into a graph and seeing what happens.


Korny Sietsma:

Some of the best timing data we got out of GDS was where somebody had the bright idea of saying, "Hey, let's just log how long things take, and store it in your file forever. And one day we'll have time to look at it." And a year later, when we needed to find out why things were going slow, that data was magic, and we just put it in graphs and logging. It wasn't anything very sophisticated, but it was immediately useful. So, gathering the data and then just learning the simple tools. I agree D3, there are always competing tools out there, but D3's got a nice balance of power versus a lot of great examples of how to do things.


Ashok Subramanian:

Thank you both for your insights. I think it's been very interesting for me to delve in and actually see what might go behind the visualizations that I've seen teams end up using. And so, thank you for that. And thank you to everyone; Rebecca, my co-host and Korny and Erik for taking part, and we look forward to seeing all the listeners on the next edition of the ThoughtWorks Technology Podcast. Thank you. 


Alexey Boas

And on the next episode Rebecca Parsons and I will talk to Aravind, Brandon, and Zabel about different models of open source projects. Hope you will join us for that conversation.

Check out the latest edition of the Technology Radar