Software Problems I’ve Dealt With

This is the list of some of the problems I’ve encountered during my programming career, roughly ordered by most memorable due to a combination of when they happened, how well I can remember them, difficulty, and what I learned from them.

They are color-coded in the following fashion:

Success! And a good learning experience.

OK, but I feel I could have done better.

You win or you learn, and hopefully I learned something.

If you’re looking for the application of specific skills (NOTE: not necessarily mentioned in problem description itself, but used):

Programming languages:
Bash: 3, 11
C++: 1, 4, 5, 7, 8, 9
Java: 2, 7, 11
Javascript: 3
Python: 3, 10, 11

Tools:
Docker: 11
Eclipse: 2, 11
git: 10, 11
gitlab: 11
gVim: 1, 3, 4, 5, 7, 8, 9, 10
Microsoft Word: 5, 9
Netbeans: 7
svn: 1, 2, 3, 4, 7, 8, 9

Skills:
Debugging: 1, 2, 3, 4, 7, 8, 10
Enhancements: 2, 5, 7
Hardware interfacing: 1, 4, 5, 7, 8, 9
Regression testing: 9
Technical writing: 5, 6, 9, 10

Performance:
Success: 1, 2, 3, 4, 5, 8, 9, 10
OK: 6, 11
Learning experience: 7

Approximate year:
2008: 1
2009: 4, 8, 9
2010: 5
2012: 7
2016: 11
2017: 2, 3, 6
2019: 10

Organization:
McMurdo: 1, 4, 5, 7, 8, 9
Polycom: 2, 3, 6, 11
Upwork: 10

So here they are, with the approximate year in parentheses:

1. McMurdo – 10 Second Lag Between Joystick And Camera (2008)


This was my first big bug out of college, and probably the first big bug I fixed, period. Shortly into McMurdo, I was assigned to a new C++ application service called TCAMD, which got the hardware components of a maritime surveillance system (the joystick, camera, video server, etc.) to communicate. Each of these devices, represented by a corresponding object-oriented class, had a different protocol, so they communicated via a central hub class, using an Esperanto-style universal struct which they would translate to and from, so every device didn’t need to translate every other device’s protocol.

The overall layout of the applications:

TCAMD

TMAC and TMS View were remote and local GUI applications, respectively. In addition to local commands from the joystick, TCAMD had to handle remote commands from the TMAC GUI. The commands TCAMD initially had to handle were:

Both Joystick TMAC
Switch Video (Day, IR) Move speed Move to azimuth/elevation
IR Zoom (Switch Lens) Day camera zoom in/out speed Day camera zoom to position
Image capture Day camera focus in/out speed Day camera focus to position
IR Autofocus IR focus in/out speed
Fault clear

Shortly into taking over this code, a problem was discovered by users. When the camera was turned off for a period of time, and then turned back on, the camera would seemingly fall 10 seconds behind the joystick’s commands, and the commands themselves would be erratic and the camera wouldn’t move properly.

I initially assumed that the problem was the joystick. It wasn’t a very good assumption in hindsight, but I was just out of college. But the joystick looked like something out of the early 1980s and was already suspect, so I looked there first. I put debug statements in and found that the joystick was sending the commands just fine. I then put debug statements into the camera’s Esperanto-style receive code and found that the commands were being sent fine there too.

The camera required a keep-alive message, so that would be sent continuously by the camera code. If a command was sent by the joystick, it was added to this message. After turning the camera off, back on, and looking at the actual socket connection write code, I found something weird. The camera code was supposed to send these messages at 10Hz, but after turning the camera off for a long period and then back on, these messages would speed up to around 100000Hz. Looking at the socket’s read code, and finding that the camera would stop returning messages at the same time it acted up, I determined that the camera quit functioning properly under the barrage of messages, until it was restarted.

But why did this happen after the camera was turned off? For a while, I couldn’t figure that out. Using gdb, I traced it back to a company-written library.

TCAMD was a single-threaded function and used the C function “poll” to manage its commands. This command would check socket read events and also handled timer events. The company-written library was attached to “poll”, and added timer events via a signal from a timer class. In TCAMD, the 10Hz camera keep-alive message was managed by this timer class, which was set at a 100ms interval to send these messages. After sending a message, the timer was incremented by 100ms.

Unfortunately, when the camera was turned off, TCAMD attempted a blocking socket connection attempt to the camera, which prevented any other code from executing, and took 3 seconds. The keep-alive timer was also used as the socket connection timer, given that it was handling the same socket, so it would increment 100ms every 3 seconds. And it incremented relative to its last execution time, not current time. So this timer would fall behind. And when the camera would be turned back on, the timer was several minutes behind.

C’s “poll” function executes timer operations based on what time they report to it. So when the socket connection was made, the timer was reporting a time minutes behind. To “poll”, that means execute the command and send the message. Then the timer would increment 100ms, which should have caused TCAMD to wait 100ms to send the keep-alive message. Unfortunately, as long as the timer class reported a time in the past, “poll” saw that as “send the command now.” So rather than sending messages every 100ms, TCAMD sent these messages as fast as it possibly could until the timer class caught back up with current time, which, if the timer class’s time was far enough behind, would eventually overload and corrupt the camera.

I thought that the initial solution would be to simply rewrite the timer code to set the next execution time based on current time rather than its last execution time. But I was warned by a manager that other services used this timer, and seemed to have no issues, so any changes to this timer class could impact these other services. It was an early, important lesson: be very careful when you change library code that others are using. Rather than rewriting the timer code, I created a subclass timer which incremented based on current time, used it, rather than its parent class in TCAMD, and the issue was finally fixed.

2. Polycom – The Boot and the Phantom Call (2017)


As complicated as McMurdo’s systems could get sometimes, they had nothing on Polycom’s DMA. It was, and probably still is, a two-million-line program which was responsible for coordinating SIP, H.323, and other audio and video conferencing protocol messages. It has to be able to get devices such as audio-only phones, smartphones, video conferencing devices, and computer video conferencing software, as well as internet software such as WebRTC, to communicate together, and initiate, join, and close calls. It does this with a massive amount of coordination between devices, to determine what they can and can’t do (e.g. audio-only phones don’t have video, so don’t bother sending video signals to them). In addition, it records call records for configuration testing and debugging purposes.

Also, the DMA was built for redundancy, meaning it could be clustered for safety and failover purposes, so one server going down wouldn’t bring down a call or call records. This was great for customers. For us, it meant more code to ensure that each clustered DMA had the same data in real time.

DMA bugs were usually very difficult to solve for two reasons. One was the fact that Polycom had a large, senior, and good programming staff, so simple bugs were often eliminated quickly. There were some minor bugs I found, but these were the exception, not the norm. The other reason was basically what I already covered. The DMA was such a massive multi-threaded program that it was very difficult to trace through the threads and determine what was going on. Even with Eclipse’s Java debugger and its breakpoints, it was often convoluted to find and fix a bug. Despite the code generally being very well-written, I also had to be very careful that a fix didn’t introduce another bug. Some of the bugs I fixed were from others doing that, and I wouldn’t be surprised or offended if others had to fix a bug introduced by one of my fixes.

It’s difficult for me to remember many of the DMA bugs accurately because they were so complex, but I’ll do the best I can to describe them. This one was probably the one I can remember best, and was pretty typical.

In a certain DMA configuration, when a user attempted to join a conference call, roughly one out of 100 times, they would be booted out of the conference call within seconds, and would be unable to return to the call for 90 seconds. In addition, a phantom call, or a call not initiated by a user, would appear in the log after that. Phantom calls were common in the DMA call log when there were bugs.

As typical with many DMA bugs, I spent days trying to find out what was happening. Either I or someone else figured out that when a breakpoint was placed in the right spot, this failure was guaranteed to happen. But as with my first McMurdo bug, where you see it isn’t necessarily where it is.

I eventually found that when the call was just starting, the DMA would send a call initiation message to the RMX, the device which handled the media streams themselves. In between this message being sent and being received, however, the call was marked as inactive, since it hadn’t officially started yet. And the problem was that calls that were marked as inactive could be terminated.

If this was the case everywhere in the DMA, then calls wouldn’t be able to start at all, since they were inactive until the call began. So in 99 out of 100 places in the DMA, the code was written properly to not terminate starting calls. This was that one out of 100. While waiting for the reply in this specific location, if a cleanup thread executed before it got back (around one time out of 100, unless I added a breakpoint), then it would terminate the call and delete the Java object representing it. Then, the reply would get back. The DMA would be confused since it couldn’t find a Java object representing the “terminated” call, so it would just add a record of a new one, which led to the phantom call being displayed in the log.

Fixing this bug involved cleaning up multiple threads and rewriting the cleanup code to properly check if the call object was waiting for this specific reply, and to not delete the call if so, unless it had been waiting for it for too long. As I said before, I had to be careful not to introduce bugs while cleaning up old ones, and I probably wasn’t perfect at it.

3. Polycom – DMA Cluster Upgrade Failure (2017)


To upgrade the DMA, a user could receive an upgrade package from Polycom, load it, and it would update automatically, preserving the data. Very easy for the user.

Well, that was supposed to happen. In 2017, I was assigned a bug where the update of a clustered DMA system wouldn’t complete. Rather, it would be stuck displaying a Javascript update window. This specific user would reboot the system, but it wouldn’t work properly after that. Its internet had been corrupted and wouldn’t work.

When I initially looked through the upgrade log, I couldn’t find a problem. Everything looked OK. Once I looked through the bash and Python upgrade scripts, and looked at the statements in the log and where they were in the scripts, I noticed that a good portion of the upgrade scripts looked like they had just been skipped. No reason had been given in the log either.

I found that not only had this specific configuration resulted in a script error when the user had attempted to upgrade, but that the bash script wasn’t properly written to handle and report this error, so the bash script just skipped the block where the error happened and continued normally. This block itself called an entire Python script, so a significant part of the upgrade process was just skipped over. And this part of the script set up the internet for the DMA, so after upgrade, the DMA’s internet would be broken.

I eventually fixed the bug and got the system to upgrade properly. I also added more debug statements to the bash and Python scripts to hopefully give any future debuggers a better idea of what was going on, after noticing the logging wasn’t that great in other areas as well.

There was one final issue: the Javascript upgrade page. This page would tell the user that the upgrade was in progress, and when the upgrade was complete, the main upgrade script would reboot the DMA. When the upgrade page could reach the DMA again, the page would be replaced by an indication of completion. When this bug happened, however, the Javascript progress page would eventually display a flashing warning page, since it couldn’t reach the DMA with its broken internet configuration. To make things more difficult, the main upgrade script wasn’t pulled from the upgrade package, but the pre-upgrade DMA itself, so it couldn’t be replaced easily pre-install. Post-install, it would be replaced. Because of something in this script that I can’t remember, with my bug fix, the DMA could not reboot properly on its own. As a result, despite a successful upgrade due to the fix, the Javascript page would still display the warning page, potentially confusing the user.

However, with the fix, once the user rebooted the system manually, everything worked. So in addition to fixing the main upgrade script to properly reboot the system in the case of this configuration and my bug fix, I changed the Javascript page to tell the user to attempt to reboot manually after a certain amount of time, rather than giving the flashing warning. This was considered a temporary measure for this specific upgrade until the old main upgrade scripts were gradually replaced by the ones with my update.

4. McMurdo – Camera “move to azimuth/elevation” stall bug (2009)


As I displayed here and in the table, TCAMD didn’t just receive commands from the joystick. It could also receive commands from the TMAC GUI as well. One of these commands was the ability to point the camera to a specific azimuth and elevation. In TMAC, an AIS or GPS ship representation in a local map could be selected, the corresponding azimuth and elevation of that ship relative to the camera could be determined, and the camera could point to that location. This was a fairly new command when I started. And immediately, there was a problem.

When moving from pointing towards one vessel to pointing towards another, the camera actuator would very often just stop in the middle of moving and report a fault. And for a while, I couldn’t figure out why. I had seen this behavior occasionally when running “move to” commands, but it happened much more often when pointing the camera at ships, or in the office, virtual ship objects which came from past data. I tried sending multiple “move to” commands, but couldn’t repeat the behavior.

I checked the manual for the camera actuator, and finally got specific information about the fault by checking the specific fault data bits in debug statements. The fault reported was a vertical stall. At first, I thought this didn’t make any sense, since when pointing towards the horizon, as the camera did when pointing at ships, vertical distances between ships were very small, often well under one degree. Then I realized that this was exactly what the problem was.

The Quickset camera actuator had three settings to control movement: acceleration rate, speed, and deceleration rate. In order to move tiny distances, the camera had to be able to very quickly accelerate and decelerate. If the rate was too low for any of these values, the camera would either not be able to start moving at all, or overshoot the target. Both of these would cause the camera actuator to freeze and report a stall.

The solution was to preconfigure the camera actuator to allow a higher acceleration and deceleration rate, using its included software. Once this was done, the problem no longer occurred.

5. McMurdo – TCAMD Configuration Instructions (2010)


By 2010, I had already added significant improvements to the TCAMD code. They included:

  • Manual day camera iris control that the user could adjust based on lighting conditions
  • Day camera digital zoom
  • Code handling the replacement of the video switch with a 4-port video server
  • Other bug fixes and enhancements described on this page, which required configuration

In addition to the new changes that required configuration, the coastal sites that ran TCAMD all had different equipment, and there were different ways to configure TCAMD based on that equipment.

I thought about writing code to do the configuration automatically, but I decided that that would take too long. So I created a list of configuration instructions which allowed a field engineer to properly configure TCAMD depending on the overall system type, the hardware being used, and the allowed camera zoom values. I also created an interactive program which allowed users to run certain TCAMD commands from the command line, so that they could easily see certain values such as zoom and focus position. Using these values, they could properly set and configure the ranges in a configuration file.

Despite being pretty young, I knew that I was a software engineer very familiar with what I was working on, and that I needed to make sure that others less familiar with TCAMD could run the code. So I had coworkers attempt to use my configuration instructions, and sure enough, they were confused in a lot of places. With their help, I cleaned up the instructions to make them more readable to field engineers.

6. Polycom – Windows server configuration instructions wiki cleanup (2017)


With the proper configuration, it was possible to get Polycom’s DMA to work with Microsoft’s Skype for Business. The proper very complicated configuration. When I first started at Polycom, I was warned to stay away from this until I had better knowledge of the DMA. Later, I figured out why.

Configuration took around an hour if you knew what you were doing and knew how to troubleshoot if you made a mistake. If not, it would be hours of obnoxiousness.

There was one major thing that didn’t help: Polycom’s software wiki instructions for this configuration were scattered all over the place, across various pages, and there were many missing troubleshooting instructions, leaving the developer to look around the wiki to see if anybody had run into a similar issue. Very often, they had, but it was on a completely different page which only showed up when searching the entire wiki.

In an effort to stop annoying myself, and to hopefully help others in the future, I consolidated this info into one wiki page. If I couldn’t put the info there, I added links to it in this configuration page. I enjoy trying to make things like this easier for myself and others with clear instructions, so it wasn’t too difficult.

I then realized that I could visualize a system diagram with the configuration process steps corresponding to different parts of the system (and diagram). I realized that that could make things more clear.

Unfortunately, I never got around to making this diagram. Maybe I was afraid I was wrong about the high-level setup. I also couldn’t find good software to draw this diagram at the time, but in hindsight, a handwritten diagram would have been fine. I could have had the humility to ask somebody if my diagram was right, but it certainly wouldn’t have hurt at all to try to add it.

So this was a partial success. I succeeded a lot when it came to cleaning up and ordering the instructions, but I look back and realize that I could have made things more clear. And I shouldn’t have been as afraid of being wrong.

7. McMurdo – Oil Rig Deterrence System (2012)


This project was a disaster. The major positive from it was that I learned a lot, not just about what not to do software-wise, but how not to run a project, and especially, how software can’t solve every problem.

McMurdo’s Trident project, which included TCAMD, was beginning to wane and new business wasn’t coming in, so we looked to other possible uses of our maritime surveillance system to stay afloat. One of these was an oil rig deterrence system in Qatar. Ships often fished close to oil rigs since lots of fish hung out around them. Unfortunately, they weren’t supposed to do this since they ran the risk of crashing into the rig, so fishing too close to the rigs was made illegal. But this was difficult to enforce on offshore rigs kilometers from shore, and fishermen looking for good fish and a good profit would do it anyway.

Our “solution” was a deterrence system with a targeted audio speaker (LRAD) which could send an alarm at up to 160dB. But even here there was a problem: we heard that this had already been tried by other companies, but fishermen simply used ear covers. And without naval support, these systems already did not seem to work when tried. But we figured that we could at least get an ID on the fishermen with a camera that was part of the system. Or we were just looking to get paid, and giving in to wishful thinking.

The camera stunk. We afforded the contract by using less-than-ideal equipment, including a very low resolution camera. In other words, a camera that, even at close distances, wouldn’t allow you to ID anybody, especially at night when IR was used. And while the oil company was OK with the camera initially, the camera’s poor quality was one of the major issues in the end.

Then there was our software and our organization. People often say that during a company’s difficult times, middle management is the first to go, and I learned that there is a good reason people say that. I learned that because this didn’t happen at McMurdo during this time, and I saw the consequences. Many software developers were laid off, and many managers were kept on. The explanation later given to me was that these managers were the face of our company, and we needed them to promote and represent the company at a difficult time. While this may have been true, and I’m sure I didn’t know much about what was going on given my position, it may also have been better to just be honest with clients about what was going on. This attempt to hold this image would also affect us later in the project, when an employee of the oil company told me it wasn’t much of a secret that we were struggling, despite trying to place the blame on whatever else we could.

Many of the managers accepted deferred salaries and worked less for less pay, while no change was made to the developers’ schedules, and we, given other developers being laid off, were expected to cover their work, leading to disproportionate responsibility on the part of developers. This led to a snowball effect where some remaining software developers got disgruntled and/or wanted to move to new things, and quit. Too few developers were left, and it became impossible for us to fix all of the bugs in the system.

And then there was me. Being a developer, I felt that there was just too much to do in too little time to get this system to even a satisfactory level. No specific deliverables or deadlines for those deliverables were provided, and requirements were very vague (e.g., “Make it work!”). When we tried to bring this up, we weren’t really listened to and were greeted with optimism cliches (“I know you can do it!”). In hindsight, I was part of this project too, and could have, and should have, made more of a stand. But I was overly fearful about losing the job, and in hindsight, this fear was irrational. At the time, I struggled with bad anxiety, and didn’t even know it. I had been led to believe that if I resigned without a new job lined up or was fired for being critical, I would seriously struggle to find a new job, possibly never finding one. I ignored the fact that I was a software engineer with years of experience and some money saved up at that point, and the difficulty of finding a new job in those circumstances, at least in my field, had been greatly exaggerated. So while we were partially ignored, I was overly anxious and definitely could have been more assertive, and I was part of the problem too.

Despite the issues with the company structure, I “completed” the following programming tasks with the time I had:

  • The LRAD had four components: the audio speaker with volume control, a light which had three settings (off, on, blinking at a specified Hz), an actuator, and the audio streaming component. I wrote device drivers for each of them.
  • I wrote the Java GUI components for each of these elements
  • I wrote the code for camera movement, and code to sync the camera and LRAD, so the LRAD would point where the camera was pointing

I couldn’t get the LRAD to work properly with real-time audio, so I resorted to loading and playing pre-recorded files with it. In hindsight, this was my biggest programming mistake on this project, and I should have persevered so that real-time audio (e.g. a microphone) could have been used.

Even with all the issues, we managed to deploy the system. While it passed the initial acceptance test, bugs starting appearing everywhere in the system. I did what I could to fix them, but there were so many bugs and so little time that I couldn’t even fix all of what I saw as the critical ones. I, and several of the other remaining developers at the company, made several trips back in an attempt to get the system working. Each time, I came back with a massive list of bugs, and I could only attend to the most important ones before I was sent out again.

The biggest issue: it didn’t work. The LRADs occasionally stopped working until they were restarted, but the system we had didn’t allow for remote reboot, so they could only be rebooted when an employee went out to the rig. Even when the LRADs were working, fishermen weren’t scared away and even waved at the camera. One did something to the system I won’t repeat here.

If not for a move by my immediate manager, the project would have been even more of a disaster. He put together a list of the explicit requirements specified in the contract, we worked on those until they were completed, and he got the oil company to sign off on each of them. We got paid, but the oil company wasn’t happy with the system.

So I learned a lot of lessons:

  • When times are tough, don’t just jump into any contract. Make sure it’s something you can complete successfully.
  • If a similar product by another company has failed in the past, figure out why, and if those issues can be resolved.
  • If you can’t get the right equipment for the job, don’t do it.
  • Make sure you have enough people and time to do the work required. Consult employees and/or coworkers and use an estimation system if necessary. Listen to the people who do the work.
  • In tough times, higher standards should go to leaders, and while standards need to be high for anyone working on the project, they shouldn’t disproportionately fall on lower-level workers.
  • Sometimes its better to persevere, debug, and program an interface well, rather than do the easy, but less effective thing in the long run.
  • For remote systems, add the ability to power cycle remotely whenever possible.
  • Have explicit requirements and terms. They saved us from a bigger disaster.
  • If you feel a project is doomed to failure, and you don’t feel like you’re being heard, either speak up and make a stand or leave. Don’t allow yourself to become passive-aggressive. You’re not helping yourself or the company.
  • Most importantly, always a good lesson for software developers, software won’t help if the system, even with perfect software, won’t do what it’s supposed to.

8. McMurdo – Camera “Zoom To/Focus To” Bug (2009)


A user could send a remote command from the TMAC GUI to TCAMD to have the day camera zoom to a specific value, represented by a number between 0 and 255. However, in some cases the camera would continuously attempt to zoom in despite reaching the maximum zoom in value possible. This continuous attempt to zoom would block any other command from being sent to the camera, and TCAMD would have to be rebooted.

I quickly found that the problem was that the “zoom to” function would attempt to zoom to a zoom value that was beyond the range of the camera’s zoom capability. While the camera allowed for zoom position values of between 0 and 255, the actual range allowed by the camera hardware was lower. For example, valid camera zoom value ranges would be between 50 and 200, and if a “zoom to” value was outside of that range (e.g. 30 or 230), TCAMD would perpetually zoom in or out in an attempt to reach a zoom beyond what the camera’s range allowed.

My initial solution was to have the zoom function time out if it didn’t succeed after a certain amount of time, so other commands could be sent after that and TCAMD wouldn’t need to be rebooted. Later, though, I added a configuration program which allowed the user to obtain the minimum and maximum zoom values allowed for the specific camera. They often varied slightly. I then placed these values into a configuration file, and didn’t allow any zoom position values outside of this range.

I would later do the same thing for the camera’s focus command, as I found that the same problem was occurring there.

9. TCAMD Regression Testing (2009)


“Regression testing? What’s that?”

That was how I replied to my manager after TCAMD was failing to start. I had put in a new parameter to prevent TCAMD from using unnecessary classes when TCAMD was used on a ship, rather than at a coastal station. On a ship, TCAMD’s sole responsibility was image capture from TMAC.

Later, we were getting ready for acceptance tests, but TCAMD wouldn’t start. And the employees there couldn’t figure out what this missing parameter that TCAMD was complaining about in the log was. They pulled me over to look at it.

This was a classic entry-level developer mistake.

I programmed this as if I would always be the one deploying it. I hadn’t thought about or tested how TCAMD would react without a parameter specification in the configuration file, and without deployers who knew what it was. Once I did this, and made sure it worked without a specification so field engineers wouldn’t be confused, the problem disappeared. And I was told about regression testing, or that if you make a change to the code, make sure it still works with older systems, and if not, tell field engineers what they need to do to upgrade.

Later the same thing happened, but with the zoom command. Shortly after one fix, I found a specific “zoom position” command in the camera protocol manual. Before that, “zoom position” was really just zooming in or out until the reported position matched what we wanted. I wondered how my company had missed this command before, since it seemed to make things so much easier. This would make things so much easier. I replaced the old “zoom position” commands with this new one I found in the manual.

But then “zoom position” commands stopped working on some cameras. I couldn’t figure it out for a while, to the point where I had to revert all the commands. Then I finally figured out that the protocol manual I got was for a new camera firmware version, and that the commands wouldn’t work with the older version.

So I created a command to pull the camera’s firmware version, and choose the “zoom position” command based on that.

After a few events like this and a few more hardware options added to the surveillance system, I created a regression testing chart, and made sure that TCAMD operated correctly in each configuration.

As configuration got more complex, I also began to create a document to make the configuration easier for field engineers.

Upwork – Python Program on Mac (2019)


This was my first, and as of February 18, 2019, last Upwork problem.

A client came to me regarding a Python program which was not running properly on his Mac. I downloaded the program from Github on my Mac and attempted to run it. The script required outdated dependencies. I spent a good deal of time trying to find those since I remembered times where I gave up on that, tried to take the easy way out, and couldn’t get a project running because of dependency mismatches.

I first tried to run a Python virtual environment. However, the dependencies specified in the README file were old to the point where I couldn’t find them on Mac’s homebrew, and had to manually install some of them. I attempted using Linux instructions first, but this became very difficult, so I tried a different approach. I switched to the newest version of these dependencies, and was able to get the program running, but it hung in the multiprocessor logic. So I took out the microprocessor logic and had it run in a linear fashion, getting weird results. But my client indicated that he just wanted the script to run.

I have a Linux AWS machine, so I managed to get the right dependencies and run it there. The program ultimately ran fine there, indicating that the software was built to run on Linux. The results of the program were identical to what I saw on my Mac with the microprocessor logic taken out.

Back on my Mac, I used anaconda to finally find packages of all of the dependencies specified in the README and get the script running with those old dependencies. I ran into the same microprocessor freeze, almost certainly confirming that this software was not written for a Mac.

With some online research, I was able to get the microprocessor code running correctly, and again, getting the same results. Since the results were the same for newer dependencies, I forked the program and rewrote it (BSD license) with newer dependencies and the ability to work on a Mac, and helped the client to get it working on his machine. Once that was done, and after updating the README instructions, the contract was complete.

The updated code is available here.

11. Polycom Integration Test Framework (2016)


Around 2016, Polycom began work on a Microsoft AQUA cloud project. We looked to use a DevOps pipeline to deploy it. I took on, and was assigned, the responsibility of handling the integration testing.

The Polycom setup was made up of several Docker images which would run in the cloud. Unfortunately, this was before Kubernetes became popular, so when looking online for solutions to integration test Docker images, I was lost. I couldn’t believe that nobody had had this problem before. It turns out they had, hence Kubernetes, but I was originally unable to find anything. So I attempted to create my own framework.

The framework was started by a bash script. I then created a Python object-oriented testing suite, which started multiple Docker images, some of them dummy test objects, and then had them send data between each other. I chose Python because of the ease of command-line operations, given that this was running in GitLab. However, this may have been a mistake.

Polycom’s dominant language was Java. By far. It was difficult for my coworkers to run the system. Many of them didn’t know much about bash or Python, so I often had to help them run it. Many of them also had Macs, while I had a Linux box, so the bash script ran slightly differently on their machine. So I had to look up differences between Linux and Mac bash scripting in order to fix the problem.

Then there was a bug where the connection dropped between the images, or so I initially thought. I tried to fix it with the assumption that the network connection was changing or dropping between the images, but didn’t succeed. Only shortly before being transferred to a new project did I realize that the images were being shut down prematurely. Ultimately, I was transferred to a new project before I could fix this bug.

While I’m proud of my initial creativity, I definitely could have done better. I could have asked for more help with the Docker issue with my coworkers. I was definitely proud, and by then I was beginning to realize that a 9-to-5, strictly software (i.e., no hardware interfacing) job wasn’t my thing and was beginning to get burned out, but that was not a good excuse. Again, as with McMurdo, I was struggling to figure out the right time to leave if I wasn’t feeling it.

Also, I definitely should have reconsidered using what I knew (Python) and used what everybody else knew (Java). When I began at McMurdo, a coworker used an obscure compilation library called SCons, and the fact that people at the company didn’t know much about it made it difficult to debug, especially when that coworker left. Again, it was a learning experience.

12. McMurdo – Morocco Deployment (2014)

By 2014, my experience with a project, as well as McMurdo’s continued top-heaviness and organizational issues were taking a mental toll. I was trying to stick it out, but I didn’t have the passion that I did when I first started. And as I said, if I wasn’t able to contribute well, for both McMurdo’s and my sake, I should have resigned or started looking for new work. But in comparison to now, I wasn’t great at listening to myself, and continued to push myself, hoping I would be rewarded well in the end.

Still, I was able to perform well in Morocco. I enjoyed the travel experience of working for McMurdo, even when I struggled with installations, such as the one in Qatar. The lack of this travel, and the lack of variety in comparison, would be an issue at Polycom later.

In Morocco, things were different. While some new software had to be written, it was significantly less software that had to be written than in Qatar. And while the software still had lots of bugs, we weren’t plagued with as many critical ones. And at this time, I had a wide understanding of the entire system.

The install consisted of several Linux servers, including backup servers, several Windows computers, an approximately 30TB backup hard drive, Postgres servers, radar units, AIS units, and was divided among a command center and several coastal stations. And we had to transfer data from their old servers to the new ones, so the possibility of screwing up and losing months of data was unnerving.

While an independent contractor told me that using VMWare would have probably been a better idea, and virtual images were becoming more popular at the time, I was able to do my part in getting the whole system functioning more or less well. I made a big, common mistake among system administrators early, accidentally locking myself out of sudo, and feared that would set the tone for the trip, but I was able to recover from that. Also, I kept very good track of what was going on, at one point using a dry erase board, and while this was primitive, it worked. The rollover and installation of new equipment was successful without any major issues.

The work and hours were difficult. I was struggling with sleep deprivation, and being an introvert, not understanding it well at the time, and pushing myself to spend too much time around people with very little relaxation, I blew up at a coworker at a restaurant. Again, I struggled to listen to myself. There was anger that could have derailed the project, but our install group as a whole did a good job at getting things settled down quickly and we were able to get things done.

It was a wild, difficult experience, but it was probably my biggest successful group project management experience to date.

13. L-3 Communcations – Metrics Analysis (2007)

On the second half of my internship with L-3, I was finally assigned something. Something that would, unfortunately, make me familiar with the phrase “analysis paralysis”.

My manager asked me to do a metrics analysis. Specifically, it was to get a number: lines of code per day. To do that, I had to get the following factors:

  • Difficulty of the project from the programmers
  • Lines of code that were written rather than generated
  • Approximate skill of the programmers (from their managers)
  • Difficulty of the programming languages
  • Whether or not the programmers were working on other projects at the time

… and so on.

So I came up with two tasks:

  1. Build a line counter
  2. Interview the programmers

For the first, I had Understand for Ada, but that was just for Ada, which was close enough to VHDL, which is what most of the programs were written in. I wrote a line counter in Perl, and wrote it to exclude generated files, looking for specific strings that indicated generation.

A comical moment happened when I was attempting to use a command line script to count lines and my script came across a binary file and multiple system speaker characters. “BEEP! BEEP! BEEP!” over and over, annoying much of my office, until I terminated the script. So I switched to a GUI interface which would not print out lines like this.

I did a decent job on number one, but my manager’s coworkers were skeptical. It was an intern’s written tool, not an official tool. Still, I struggled to find official tools for all the languages I encountered, so I felt that writing a program was the best idea. And it gave me programming experience.

Where I bombed on this project was the second task. I was so worried about annoying the engineers, forgetting questions and having to come back to them again, that I tried to put together a perfect list of questions. I kept telling myself, as I worked on my line-counting program, that I’d eventually get to the interviews. I never did. Not a single one. Again, I was overly afraid of annoying them, stuttering, and anything that would make me look like an idiot.

At the end, I had a decent line counter, but my coworkers weren’t too impressed. I didn’t have any analysis, or even partial analysis, complete. L-3 was forgiving, given that I was an intern, but I realized that I could have done a lot better. So the lesson was pretty simple: don’t be afraid of not looking perfect, to the point where you don’t do anything and look far less than perfect in the end.

14. McMurdo – A New Configuration Parameter (2009)

TCAMD could run at a coastal station, and on a ship. However, on a ship, the camera and joystick were independent, so the only thing that it did was take and execute photo commands from the TMAC GUI. When I took over TCAMD, what did the joystick/video switch/camera code do when it was run on a ship? Complain. Complain about missing devices in the log, and flood it. It wasn’t a huge issue, so at first, McMurdo had just ignored it. But I immediately saw potential for improving the code, so I decided to make this my first enhancement.

I succeeded, but I later got some criticism for not telling anybody about my fix.

15. McMurdo – CPU Temperature/Percentage Monitor, Java GUI

At Morocco, some Linux servers were overheating due to an overactive radar capturing program. The program could be slowed down to less FPS, but people at the stations often couldn’t tell if a server was overheating. I wrote a Linux program that could pull the CPU performance and temperature data from each one of these servers. I then wrote a Java GUI program that could take that data and display it on Morocco’s TMAC GUI. I thought about adding an alert to the GUI to improve the user’s awareness, but that idea was scrapped due to something more urgent that I can’t remember. This enhancement at least allowed me to leave McMurdo on a fairly good note.

This was the last enhancement I can remember at McMurdo. Weeks before the end, I remember telling my dad at a restaurant that I was beating myself up over my inability to leave or get a new job, and not having the energy to search for a new job. My mental health was being hit. The conventional wisdom was, and is, to only leave a job when you have a new one lined up. So I tried to stick to conventional wisdom. Way too long.

When I finally, completely, ran out of the psychological energy to continue, and told my manager that I had been looking for a new job, he told me that the company had been aware of my reduced performance and was considering letting me go, but he and they had remembered how good my performance was years ago and was hoping I’d return to that point. I was just too dejected to do so.

That very day, I half resigned and was half let go. Again, rather than making the best decision for myself and the company, I had attempted to push myself too hard through the difficult times. And despite my issues with the company, I had been a part of it too, and therefore, part of the problem. Especially since I just prevented myself from finding something that I enjoyed, and prevented someone who would be more engaged at McMurdo from stepping in.

I did not repeat this mistake at Polycom, or doing solar installations. I feel that the “two year” rule can help me pad my resume, and waiting until I have a new job before I quit one can be comfortable, but if I stick around at a job I’m not engaged in, I’m just putting myself and my company through difficulty for the sake of looking good and feeling secure, rather than happy.