MTA BUS Visualization
Creative Visualization of Bus Trips in NYC
I visualized the movement of the buses running in NYC in the morning on February 12, 2019. The bus records are gathered from the MTA website. Each line shows a movement of a bus between one record and another. The line color was decided based on the travel distance between the records: red and orange color indicates long-distance travel, and blue means a short distance trip. This simple, beautiful visualization gives us a way to understand the pattern of bus trips in NYC.
I analyzed the data using python and made the visualization using openFrameworks. The source code is uploaded in my GitHub page.
METHOD
Data Analysis / Data Visualization
TOOL
Python / openFrameworks / Photoshop
ROLE
Data Visualization Designer
PERIOD
September 2019
Motivation
The Metropolitan Transportation Authority (MTA) has opened the various data of buses running in NYC: position, time, destination, and so on. I am interested in what patterns will appear when I visualize the bus data. At the same time, I would like to compare the result of the visualization with my instincts. I often feel that buses run really slowly, and it is hard to estimate how long it takes to get to my destination.
Approach
In the visualization, I focused on the position of the buses running in NYC during rush hours: 7:30 a.m. to 9:30 a.m. I got access to the data of MTA buses through their API and gathered the data every two minutes. At first, I was interested in the bus speed, but it is difficult to estimate the actual velocity because they are discontinuous. When a bus runs at sixty miles per hour, it can travel two miles for two minutes. Moreover, it does not always move straight. It can stop or turn at an intersection. Therefore, if I calculate a velocity based on the positions of the bus at the time A and time B, it can be different from the actual speed. It is not appropriate to refer the value as the bus speed. Instead, I simply connected two continuous GPS records of a bus with a line. The color is decided based on the distance between two points to visualize the pattern of bus movements. The red and orange color means the points are far away from each other and a bus moved for a long distance. Blue and purple indicate the points are close to each other.
This approach has a limitation. Ideally, the dataset should contain the data of the buses for every two minutes. However, sometimes, the data lack. For example, the following data of a bus at the 7:30 a.m. can be the data at 7:34 or 7:36 instead of 7:32. If I connect the points of these data, the length might be two or three times longer than it should be. This may cause misunderstanding of the patterns the bus travels. So, I removed the data whose time gap is more than four minutes. I then changed the transparency of the line based on the time gap between the two records. The opacity of the line between two data whose time gap is four minutes is 95% less than that for the data which have two minutes time gap. This method makes the data with more uncertainty less obvious compared to others.
Limitation
In some cases, an XML file contains two data of a bus. In this case, the time gap of these two data becomes less than two minutes, and the length of the line between them is shorter than other data. The number of these data are about 5200, and the percentage is about 2.8%. It was difficult to choose one of these two data and I used both of these data. Therefore, it is necessary to think about the way of solving the problem of the time gap.
The Other error handling
I did not use the calculated velocity for visualization but used it for removing error values. As I described, the calculated distance can be shorter than the trip distance. This means the calculated velocity is always slower than the actual one. Therefore, if the calculated velocity is 180km per hour, the actual average speed is more than that. However, it is almost impossible to keep running at that speed. So, I did not draw a line when the calculated velocity is more than 120km per hour.
I wanted to explore the dataset in the visualization. I tried to keep the records as much as possible.
Processes
As a first step, I gathered the real-time bus data of February 12, 2019, from MTA Bus Time as XML files. Then I cleaned and formatted it for visualization. The source code of the cleaning process is uploaded on the Github. Then, I made the visualization using openFrameworks.
Result
The visualization shows a beautiful pattern of bus travels. Most of the lines in New Jersey are red. Some lines from Staten Island to Manhattan are green and orange. These means buses traveled long-distance for a certain period of time in this area. These lines are drawn around the highway. On the other hand, many roads in Manhattan and Brooklyn are blue or purple. In total, this result matches our expectations.
Conclusion
I partially achieved my goal through this presentation. I could make a beautiful visualization and I could comprehend the result. However, again, this visualization does not indicate the actual bus speed. To understand the bus travel time and speed, we need to gather more data and find a better way of visualization.