3 languages changing data science

mercredi 21 août 2024, 10:30 , par InfoWorld

The most powerful and flexible data science tool is a programming language. Sometimes the only way to properly solve a data science problem is to roll your own solution, even if it’s only a few lines of code. But programming languages can also take time to master, and the language you choose will be influenced by many things, including your existing programming experience, the problem scope, and whether you need to prioritize execution speed or development speed.

In this article, we’ll look at three of the programming languages with the biggest impact on modern data science. These languages are critical not only for their power and speed, but also for convenience and how they enable data-driven work via third-party libraries.

Python

No discussion of modern data science is possible without mentioning Python. The decades-old language has enjoyed a massive boost in popularity over the last 10 years, in part because it’s become the de facto language for data science.

Two of Python’s big selling points come into play here. First, Python is a relatively easy language to program in. You can quickly prototype a working piece of software in Python and (if needed) enhance its performance incrementally over time. Data science projects don’t take as much time to stand up and get running in Python as they might in other languages. It also helps that an existing culture of data science provides many fast templates you can use for your own Python-based projects.

The second big selling point is Python’s third-party library ecosystem, which all but guarantees someone else has developed a prepackaged solution to any problem you have. That ecosystem contains an embarrassment of riches for data science: number-crunching libraries (NumPy, Pandas, Polars); graphing and plotting tools (Bokeh, Plotly); notebook environments for reproducible work (Jupyter); machine learning and AI toolkits (PyTorch); analytics tools (DuckDB); and much more.

For all its popularity and appeal, though, Python has drawbacks that make some data science applications harder to develop and deploy.

Python lacks a native mechanism for deploying a Python application as a standalone program. If you write a Python library, you can package and distribute that library to other users via PyPI. But that implies the other users know how to set up a Python environment to use your code. It’s far harder to package a Python program in a way that someone with no Python experience can just download and run it. It’s not impossible, mind you; just difficult, and not natively supported by Python’s toolchain.

Data science folks who want to repackage the Python tooling they’ve created for others to use have limited choices. They can deploy their work as a library and assume the other person knows Python (not always true); they can deploy the app in question through a web interface (not always feasible); they can deploy via a container system like Docker (again, not always familiar to the other party); or they can use a third-party tool to package the Python runtime with their app and its libraries (not always reliable).

Another drawback is the speed of native Python code. It’s far slower than C, Rust, Julia, or other languages when performing CPU-based computations. This means any high-performance Python code typically isn’t written in Python itself, which imposes an extra level of abstraction between you and the work you’re doing. Work is underway to make native Python faster, but it’s unlikely we’ll see speeds akin to what machine-native code can produce anytime soon.

Julia

The Julia language, first released in 2012, was created specifically for data scientists. Its creators wanted to have a language as easy to work with as Python, but as fast as C or Fortran, and without having to work in more than one language at a time for the best results.

Julia works its magic by being “just-in-time” compiled, or JITed, to machine-native code, by way of the LLVM compiler system. Julia code has the simplicity of Python’s syntax, so it’s straightforward to write and supports quick results. You can let the compiler infer types at first, then supply type annotations for better performance later on.

Julia’s package collections contain libraries for most any common data science or analytics work—common math functions (like linear algebra or matrix theory), AI, statistics, and tools for working with parallel computing or GPU-powered computing. Many of the packages are written natively in Julia, but some wrap-in well-known third-party libraries such as TensorFlow. And if you have existing C or Fortran code in a shared library, you can call it directly from Julia with minimal overhead.

Data scientists use the interactive Jupyter notebook environment for quickly writing and sharing code. The IJulia package adds support for Julia code in Jupyter and the JupyterLab IDE.

So, what are the drawbacks of using Julia? One possible issue is also one of Python’s big limitations: There’s still no easy way to bundle a Julia program so that someone without the Julia runtime can use it. Various workarounds exist, but there’s no one “blessed” solution that handles the entire workflow to create a redistributable app.

Another issue comes up early in a user’s learning experiences with Julia: the “time to first X” problem (also known as “time to first plot” or “TTFP”). Because Julia is JIT-compiled, the first time a program runs it may execute far more slowly than on subsequent runs. Experienced Julia users learn quickly about the tools and techniques available to reduce first-run latency.

A third possible obstacle is how some commonly used things found in the core libraries of other languages are only available as third-party items in Julia. For instance, Python’s pathlib library provides an object-oriented way to handle file paths. In Julia, paths are natively handled as strings, which makes some common path operations more complicated than they need to be.

Rust

One of the hottest new languages overall, Rust is worth noting for its growing presence in the data science space. Rust allows developers to write data science tools that run fast, use true parallelism, are memory safe, and avoid whole classes of bugs—all features that matter when working with data at scale.

Many data scientists have likely worked with Rust-developed tools by now. The Polars library for dataframes, for instance, was written in Rust, and can be used in many other languages including Python. But Rust’s native package collections (known as “crates”) enable data science work directly in the language, as well. The ndarray crate provides powerful matrix math tools roughly analogous to NumPy in Python. The plotters crate renders charts and graphs. And the evcxr_jupyter project provides a Jupyter kernel for using Rust in a notebook environment.

It’s easier to generate a redistributable binary from a Rust project than it is with Python or Julia, which is a significant advantage for data science. Rust makes it easier to create data science tools (like Polars) as opposed to just projects.

Rust’s insistence on correctness and memory safety is both a valuable feature and its biggest disadvantage. Whereas Python (and sometimes Julia) trade execution speed for development speed, it’s the other way around with Rust. Rust has a steeper learning curve than Python or Julia, and writing a Rust program can take longer to get right than a Python or Julia program. This makes Rust less suited to projects that need to be prototyped in short order, but better for work where correctness and safety are more important. Rust’s safety features make it ideal for developing data science libraries or public applications, but it may not be your best choice for projects intended only for internal use.

Conclusions

It’s hard to go wrong with choosing Python for data science, thanks to its breadth of support and overall power, although it may require more work than other solutions to become the fastest or most redistributable choice. Julia was built for data science from the ground up, and promises (and delivers) more speed with less overall work, but it’s also difficult to redistribute standalone Julia programs. Rust’s speed and correctness are unmatched, which is why it’s the language of choice for many common data science tools. It just isn’t the best choice for projects where you need fast iteration or quick prototyping.

Lire la suite sur InfoWorld

https://www.infoworld.com/article/3486164/3-languages-changing-data-science.html

56 sources (32 en français)

Date Actuelle

sam. 12 juil. - 12:44 CEST